CN104112012A

CN104112012A - Score normalization method for diversity of information retrieval results

Info

Publication number: CN104112012A
Application number: CN201410340344.7A
Authority: CN
Inventors: 李洁玉; 黄春兰; 吴胜利
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2014-07-16
Filing date: 2014-07-16
Publication date: 2014-10-22

Abstract

The invention discloses a document score normalization method for the diversity of information retrieval results. A method based on document ranking positions is utilized for normalizing scores. Supposing the document ranking position is rank, and the normalization score of the document is obtained by calculating the value of 1-0.2*1n(rank+1). The document score normalization method for the diversity of the information retrieval results is applicable to diversified targets of information retrieval results, enables scores of documents to be provided with better comparability and can be applied to data fusion of information retrieval results, distributed information retrieval and the like.

Description

A kind of mark normalization method for the diversification of information retrieval result

Technical field

The present invention relates to one for the diversified mark normalization method of result for retrieval, be applied to data fusion, the distributed information retrieval etc. of information retrieval result.

Background technology

In a lot of application, as data fusion, the distributed information retrieval etc. of information retrieval system result, need to use the score information of document to carry out overall treatment to data.For these application, mark standardization is indispensable link.Because these process needs processing are a lot of from the given document of different searching systems, and the document mark that different searching system provides generally has different distribution ranges, and this just makes the document mark of separate sources not have comparability.Even, some searching systems do not provide document mark, only provide a document sequence.The disappearance of these nonstandard marks and score information can have a great impact subsequent treatment.Mark standardization has ensured the comparability of document mark, is the necessary preparation work before of combination separate sources data.

At present, there is several different methods to carry out mark standardization.Usually, can be divided into two classes: the normalization method based on raw score and the normalization method based on document ranking position.Normalization method based on raw score utilizes the raw score of the document that searching system provides, and adopts certain strategy, and raw score is distributed and is converted into new score distribution, makes the mark after the standardization between different system have comparability.The strategy adopting mainly contain linear with nonlinear two kinds.In linear mark normalization method, more classical have a 0-1 linearity specifications method [1], and the raw score of document is standardized linearly to [0,1] interval; Fitting method [2] is improved on 0-1 method for normalizing, and mark scope specification is arrived on [a, b] interval; Sum-to-One method [3] require mark after all standardization and be 1.Non-linear method has in a kind of mixture model of having considered relevant documentation and irrelevant document different distributions [4], the normalization method [5] based on CDF (Cumulative Density Function) etc.

Using the prerequisite based on raw score normalization method is that system provides authentic and valid raw score information.The sequence of document ranking is only provided and the situation of document raw score is not provided for system, will adopt some method to transform ranking information to obtain corresponding score information.More famous in mark normalization method based on document ranking position have a rank inverse method [6], and the method adopts 1/ (rank+k) formula to standardize to document raw score, and proposing parameter k, to get 60 effects that obtain best.Logistic model is also used to mark standardization [7,8].In document [7,8], the people such as Calve use the logarithm value ln (rank) of document ranking position to replace document ranking position rank itself.Due to use rank itself as the Logistic curve of independent variable along with the increase of rank reduces very fastly, be greater than 10 position in rank after, mark after standardization is all very close to 0, this just makes the comparability variation of the locational mark of non-top ten document ranking, particularly for the document of 11-100 position.Also have Cubic model [9] according to the method for document ranking location specification, ripple reaches counting model [10] etc.

These mark normalization methods can make result for retrieval performance good in some cases, but do not consider the situation of result for retrieval diversification.The diversity whether these methods can realize result for retrieval needs to be investigated.Consider in actual conditions, exist this part searching system that the raw score information of document is not provided, the present invention proposes to adopt the mark normalization method based on document ranking position, particularly adopt the logarithm value ln (rank) of document ranking position, ensure that the score distribution after standardization has differentiation on first 100.

Documents

[1]Lee,J.H.:Analysis?of?multiple?evidence?combination.In:Proceedings?of?the20th?Annual?International?ACM?SIGIR?Conference,Philadelphia,Pennsylvania,USA,pp.267-275,1997.

[2]Wu,S.,Crestani,F.,Bi,Y.:Evaluating?Score?Normalization?Methods?in?Data?Fusion.In:Ng,H.T.,Leong,M.-K.,M.-Y.,Ji,D.(eds.)AIRS2006.LNCS,vol.4182,pp.642-648.Springer,Heidelberg,2006.

[3]Montague,M.,Aslam,J.A.:Relevance?score?normalization?for?metasearch.In:Proceedings?of?ACM?CIKM?Conference,Berkeley,USA,pp.427-433,2001.

[4]Manmatha,R.,T.Rath,and?Fangfang?Feng.:Modeling?score?distributions?for?combining?the?outputs?of?search?engines.In:Proceedings?of?the24th?annual?international?ACM?SIGIR?conference?on?Research?and?development?in?information?retrieval.ACM,2001.

[5]Fernández,M.,Vallet,D.,and?Castells,P.:Probabilistic?score?normalization?for?rank?aggregation.Advances?in?Information?Retrieval.Springer?Berlin?Heidelberg,pp.553-556.2006.

[6]Cormack,G.V.,Clarke,C.L.A.,and?Buttcher,S.:Reciprocal?rank?fusion?outperforms?Condorect?and?individual?rank?learning?methods.In:Proceedings?of?the32nd?Annual?International?ACM?SIGIR?Conference?on?Research?and?Development?in?Information?Retrieval,pp.758-759.Bonston,Massachusetts,2009.

[7]Le?Calvé,A.,and?Savoy，J.:Database?merging?strategy?based?on?logistic?regression.Information?Processing&Management36.3,pp.341-359,2000.

[8]Savoy,J.Report?on?CLEF-2003multilingual?tracks.:Comparative?Evaluation?of?Multilingual?Information?Access?Systems.Springer?Berlin?Heidelberg,pp.64-73,2004.

[9]Wu,S.,Bi，Y.,and?McClean,S.:Regression?relevance?models?for?data?fusion.Database?and?Expert?Systems?Applications,2007.DEXA'07.18th?International?Workshopo?n.IEEE,2007.

[10]Javed?A.Aslam,Mark?H.Montague:Models?for?Metasearch.SIGIR2001:275-284.

Summary of the invention

The object of the present invention is to provide a kind of mark normalization method for result for retrieval diversification, to improve the performance of result for retrieval in diversity, the mark that makes different system give same document has better comparability.

In order to solve above technical matters, the concrete technical scheme that the present invention adopts is as follows:

For a document mark normalization method for information retrieval result diversification, it is characterized in that: be rank based on document ranking position, use the logarithm of rank as the non-linear mark standardization of one of model core, circular is as follows:

s＝1-0.2*ln(rank+1)

Wherein rank represents the rank position of document, and s represents the mark that standardizes of the mark after document standardization.

The present invention has beneficial effect.The present invention adopts simple logarithmic model, be applicable to result for retrieval diversification target, can provide and more have the document of comparability mark, thereby make result for retrieval there is good correlativity and diversity, can be applicable to data fusion and the distributed information retrieval of information retrieval result.

Embodiment

Below in conjunction with specific embodiment, technical scheme of the present invention is described in further details.

Embodiment 1

Be located in specific retrieval situation, provide for certain inquiry for system, provided the ranked list of a document, in list, comprised 5 documents that retrieve, and this system does not provide the mark of given each document.Adopt logarithmic model 1-0.2*ln (rank+1), the locational document mark of each document ranking is:

Experiment arranges and result

Experiment has adopted three groups of results that are submitted to TREC text retrieval meeting, these results to be selected from the diversity task (diversity task) under the web track theme of 2009 to 2011.These three groups of data have comprised respectively classic 8 results (by ERR-IA@20 index sequences) then, in each result for retrieval, comprise the document ranking list that relates to 50 inquiries, the ranking information that has comprised some documents in each document ranking list and original point of value information.About the information of these results in table 1.

Three groups, table 1 is filed in the result of TREC web retrieval diversity task

For the quality of reciprocal model 1/ (rank+60) and logarithmic model 1-0.2*ln (rank+1) relatively, can differentiate by the linear relationship of comparison both and the locational relevance scores of document ranking.Wherein, the locational relevance scores of document ranking is to be determined for the relevance score of ad hoc inquiry by the document comprising on this position.And each document is produced by artificial judgment for the relevance score of ad hoc inquiry.For one group of result for retrieval, as 8 member's results of group in 2009, in each member's result, relate to the document ranking list of 50 inquiries, come always total 8*50=400 of same locational document.The relevance score of first cumulative these 400 documents, if document is correlated with, and relevant on the individual sub-topics of k (k>0) of inquiry, the relevance score of the document is k; If document is uncorrelated, i.e. k=0, therefore incoherent document need not be considered.And then using cumulative document relevance score value divided by 400 as the locational relevance scores of the document rank.Table 2 has provided front 5 the locational relevance scores of document ranking in three groups of data.Except document ranking position and corresponding relevance scores information, give the estimated score being calculated by reciprocal model and logarithmic model.

The locational relevance scores of table 2 document ranking distributes

According to above-mentioned gained, taking the estimated score of model as independent variable, relevance scores is dependent variable, investigates the linear relationship between reciprocal model and logarithmic model and relevance scores by curve estimation (recurrence).Table 3 has provided the relevant information returning, and therefrom can find that remarkable value (Significance Level) under all situations all in .000 rank, illustrate that it is effective returning estimation.Investigate R ²with F index, in three groups of data, logarithmic model is all better than reciprocal model.

Table 3 linear regression analysis statistic: model estimator is independent variable, relevance score is dependent variable

In addition, consider under a result diversification target that mark is normalized should be used for investigating above-mentioned two models, select the data fusion of information retrieval result here.The above-mentioned 3 groups of experimental datas of same employing, after specification document mark, are used CombSum method to carry out result fusion.Except above-mentioned two kinds of mark normalization methods based on rank, the 0-1 linearity specifications method based on score value is also included into consideration.In table 4, provide the evaluation information of fusion results.Wherein best represents the optimal result in group membership's system results.On the whole, adopt the result after CombSum method merges to be all better than the optimal result in each group, can find out that diversification is a kind of effectively fusion method to CombSum method for result for retrieval.The result that wherein adopts 0-1 mark normalization method is the poorest in three kinds of mark normalization methods, and this is mainly owing to existing part Member Systems that point value information of document is not provided.Rank reciprocal model and the performance of rank logarithmic model are all relatively good.The evaluation of estimate of rank reciprocal model on 2009,2011 years is higher than rank logarithmic model, and poorer than rank logarithmic model on 2010 annual datas.From the average of three groups of data, rank logarithmic model is better a little than rank reciprocal model on the whole, but is not obvious especially.This shows that the mark normalization method adopting based on rank is effective for the data fusion under result for retrieval diversification target.Although be not all excellent than the performance of rank reciprocal model in all cases based on rank modulus of logarithm type, may be a kind of better model in general.

The evaluation of fusion results after three kinds of mark normalization methods of table 4, is used ERR-IA@20 indexs to weigh

Claims

1. for a document mark normalization method for information retrieval result diversification, it is characterized in that: be rank based on document ranking position, use the logarithm of rank as the non-linear mark standardization of one of model core, circular is as follows:

s＝1-0.2*ln(rank+1)

Wherein rank represents document ranking position, and s represents the mark that standardizes of the mark after document standardization.