CN104112012A - Score normalization method for diversity of information retrieval results - Google Patents
Score normalization method for diversity of information retrieval results Download PDFInfo
- Publication number
- CN104112012A CN104112012A CN201410340344.7A CN201410340344A CN104112012A CN 104112012 A CN104112012 A CN 104112012A CN 201410340344 A CN201410340344 A CN 201410340344A CN 104112012 A CN104112012 A CN 104112012A
- Authority
- CN
- China
- Prior art keywords
- document
- rank
- mark
- information retrieval
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a document score normalization method for the diversity of information retrieval results. A method based on document ranking positions is utilized for normalizing scores. Supposing the document ranking position is rank, and the normalization score of the document is obtained by calculating the value of 1-0.2*1n(rank+1). The document score normalization method for the diversity of the information retrieval results is applicable to diversified targets of information retrieval results, enables scores of documents to be provided with better comparability and can be applied to data fusion of information retrieval results, distributed information retrieval and the like.
Description
Technical field
The present invention relates to one for the diversified mark normalization method of result for retrieval, be applied to data fusion, the distributed information retrieval etc. of information retrieval result.
Background technology
In a lot of application, as data fusion, the distributed information retrieval etc. of information retrieval system result, need to use the score information of document to carry out overall treatment to data.For these application, mark standardization is indispensable link.Because these process needs processing are a lot of from the given document of different searching systems, and the document mark that different searching system provides generally has different distribution ranges, and this just makes the document mark of separate sources not have comparability.Even, some searching systems do not provide document mark, only provide a document sequence.The disappearance of these nonstandard marks and score information can have a great impact subsequent treatment.Mark standardization has ensured the comparability of document mark, is the necessary preparation work before of combination separate sources data.
At present, there is several different methods to carry out mark standardization.Usually, can be divided into two classes: the normalization method based on raw score and the normalization method based on document ranking position.Normalization method based on raw score utilizes the raw score of the document that searching system provides, and adopts certain strategy, and raw score is distributed and is converted into new score distribution, makes the mark after the standardization between different system have comparability.The strategy adopting mainly contain linear with nonlinear two kinds.In linear mark normalization method, more classical have a 0-1 linearity specifications method [1], and the raw score of document is standardized linearly to [0,1] interval; Fitting method [2] is improved on 0-1 method for normalizing, and mark scope specification is arrived on [a, b] interval; Sum-to-One method [3] require mark after all standardization and be 1.Non-linear method has in a kind of mixture model of having considered relevant documentation and irrelevant document different distributions [4], the normalization method [5] based on CDF (Cumulative Density Function) etc.
Using the prerequisite based on raw score normalization method is that system provides authentic and valid raw score information.The sequence of document ranking is only provided and the situation of document raw score is not provided for system, will adopt some method to transform ranking information to obtain corresponding score information.More famous in mark normalization method based on document ranking position have a rank inverse method [6], and the method adopts 1/ (rank+k) formula to standardize to document raw score, and proposing parameter k, to get 60 effects that obtain best.Logistic model is also used to mark standardization [7,8].In document [7,8], the people such as Calve use the logarithm value ln (rank) of document ranking position to replace document ranking position rank itself.Due to use rank itself as the Logistic curve of independent variable along with the increase of rank reduces very fastly, be greater than 10 position in rank after, mark after standardization is all very close to 0, this just makes the comparability variation of the locational mark of non-top ten document ranking, particularly for the document of 11-100 position.Also have Cubic model [9] according to the method for document ranking location specification, ripple reaches counting model [10] etc.
These mark normalization methods can make result for retrieval performance good in some cases, but do not consider the situation of result for retrieval diversification.The diversity whether these methods can realize result for retrieval needs to be investigated.Consider in actual conditions, exist this part searching system that the raw score information of document is not provided, the present invention proposes to adopt the mark normalization method based on document ranking position, particularly adopt the logarithm value ln (rank) of document ranking position, ensure that the score distribution after standardization has differentiation on first 100.
Documents
[1]Lee,J.H.:Analysis?of?multiple?evidence?combination.In:Proceedings?of?the20th?Annual?International?ACM?SIGIR?Conference,Philadelphia,Pennsylvania,USA,pp.267-275,1997.
[2]Wu,S.,Crestani,F.,Bi,Y.:Evaluating?Score?Normalization?Methods?in?Data?Fusion.In:Ng,H.T.,Leong,M.-K.,M.-Y.,Ji,D.(eds.)AIRS2006.LNCS,vol.4182,pp.642-648.Springer,Heidelberg,2006.
[3]Montague,M.,Aslam,J.A.:Relevance?score?normalization?for?metasearch.In:Proceedings?of?ACM?CIKM?Conference,Berkeley,USA,pp.427-433,2001.
[4]Manmatha,R.,T.Rath,and?Fangfang?Feng.:Modeling?score?distributions?for?combining?the?outputs?of?search?engines.In:Proceedings?of?the24th?annual?international?ACM?SIGIR?conference?on?Research?and?development?in?information?retrieval.ACM,2001.
[5]Fernández,M.,Vallet,D.,and?Castells,P.:Probabilistic?score?normalization?for?rank?aggregation.Advances?in?Information?Retrieval.Springer?Berlin?Heidelberg,pp.553-556.2006.
[6]Cormack,G.V.,Clarke,C.L.A.,and?Buttcher,S.:Reciprocal?rank?fusion?outperforms?Condorect?and?individual?rank?learning?methods.In:Proceedings?of?the32nd?Annual?International?ACM?SIGIR?Conference?on?Research?and?Development?in?Information?Retrieval,pp.758-759.Bonston,Massachusetts,2009.
[7]Le?Calvé,A.,and?Savoy,J.:Database?merging?strategy?based?on?logistic?regression.Information?Processing&Management36.3,pp.341-359,2000.
[8]Savoy,J.Report?on?CLEF-2003multilingual?tracks.:Comparative?Evaluation?of?Multilingual?Information?Access?Systems.Springer?Berlin?Heidelberg,pp.64-73,2004.
[9]Wu,S.,Bi,Y.,and?McClean,S.:Regression?relevance?models?for?data?fusion.Database?and?Expert?Systems?Applications,2007.DEXA'07.18th?International?Workshopo?n.IEEE,2007.
[10]Javed?A.Aslam,Mark?H.Montague:Models?for?Metasearch.SIGIR2001:275-284.
Summary of the invention
The object of the present invention is to provide a kind of mark normalization method for result for retrieval diversification, to improve the performance of result for retrieval in diversity, the mark that makes different system give same document has better comparability.
In order to solve above technical matters, the concrete technical scheme that the present invention adopts is as follows:
For a document mark normalization method for information retrieval result diversification, it is characterized in that: be rank based on document ranking position, use the logarithm of rank as the non-linear mark standardization of one of model core, circular is as follows:
s=1-0.2*ln(rank+1)
Wherein rank represents the rank position of document, and s represents the mark that standardizes of the mark after document standardization.
The present invention has beneficial effect.The present invention adopts simple logarithmic model, be applicable to result for retrieval diversification target, can provide and more have the document of comparability mark, thereby make result for retrieval there is good correlativity and diversity, can be applicable to data fusion and the distributed information retrieval of information retrieval result.
Embodiment
Below in conjunction with specific embodiment, technical scheme of the present invention is described in further details.
Embodiment 1
Be located in specific retrieval situation, provide for certain inquiry for system, provided the ranked list of a document, in list, comprised 5 documents that retrieve, and this system does not provide the mark of given each document.Adopt logarithmic model 1-0.2*ln (rank+1), the locational document mark of each document ranking is:
Experiment arranges and result
Experiment has adopted three groups of results that are submitted to TREC text retrieval meeting, these results to be selected from the diversity task (diversity task) under the web track theme of 2009 to 2011.These three groups of data have comprised respectively classic 8 results (by ERR-IA@20 index sequences) then, in each result for retrieval, comprise the document ranking list that relates to 50 inquiries, the ranking information that has comprised some documents in each document ranking list and original point of value information.About the information of these results in table 1.
Three groups, table 1 is filed in the result of TREC web retrieval diversity task
For the quality of reciprocal model 1/ (rank+60) and logarithmic model 1-0.2*ln (rank+1) relatively, can differentiate by the linear relationship of comparison both and the locational relevance scores of document ranking.Wherein, the locational relevance scores of document ranking is to be determined for the relevance score of ad hoc inquiry by the document comprising on this position.And each document is produced by artificial judgment for the relevance score of ad hoc inquiry.For one group of result for retrieval, as 8 member's results of group in 2009, in each member's result, relate to the document ranking list of 50 inquiries, come always total 8*50=400 of same locational document.The relevance score of first cumulative these 400 documents, if document is correlated with, and relevant on the individual sub-topics of k (k>0) of inquiry, the relevance score of the document is k; If document is uncorrelated, i.e. k=0, therefore incoherent document need not be considered.And then using cumulative document relevance score value divided by 400 as the locational relevance scores of the document rank.Table 2 has provided front 5 the locational relevance scores of document ranking in three groups of data.Except document ranking position and corresponding relevance scores information, give the estimated score being calculated by reciprocal model and logarithmic model.
The locational relevance scores of table 2 document ranking distributes
According to above-mentioned gained, taking the estimated score of model as independent variable, relevance scores is dependent variable, investigates the linear relationship between reciprocal model and logarithmic model and relevance scores by curve estimation (recurrence).Table 3 has provided the relevant information returning, and therefrom can find that remarkable value (Significance Level) under all situations all in .000 rank, illustrate that it is effective returning estimation.Investigate R
2with F index, in three groups of data, logarithmic model is all better than reciprocal model.
Table 3 linear regression analysis statistic: model estimator is independent variable, relevance score is dependent variable
In addition, consider under a result diversification target that mark is normalized should be used for investigating above-mentioned two models, select the data fusion of information retrieval result here.The above-mentioned 3 groups of experimental datas of same employing, after specification document mark, are used CombSum method to carry out result fusion.Except above-mentioned two kinds of mark normalization methods based on rank, the 0-1 linearity specifications method based on score value is also included into consideration.In table 4, provide the evaluation information of fusion results.Wherein best represents the optimal result in group membership's system results.On the whole, adopt the result after CombSum method merges to be all better than the optimal result in each group, can find out that diversification is a kind of effectively fusion method to CombSum method for result for retrieval.The result that wherein adopts 0-1 mark normalization method is the poorest in three kinds of mark normalization methods, and this is mainly owing to existing part Member Systems that point value information of document is not provided.Rank reciprocal model and the performance of rank logarithmic model are all relatively good.The evaluation of estimate of rank reciprocal model on 2009,2011 years is higher than rank logarithmic model, and poorer than rank logarithmic model on 2010 annual datas.From the average of three groups of data, rank logarithmic model is better a little than rank reciprocal model on the whole, but is not obvious especially.This shows that the mark normalization method adopting based on rank is effective for the data fusion under result for retrieval diversification target.Although be not all excellent than the performance of rank reciprocal model in all cases based on rank modulus of logarithm type, may be a kind of better model in general.
The evaluation of fusion results after three kinds of mark normalization methods of table 4, is used ERR-IA@20 indexs to weigh
Claims (1)
1. for a document mark normalization method for information retrieval result diversification, it is characterized in that: be rank based on document ranking position, use the logarithm of rank as the non-linear mark standardization of one of model core, circular is as follows:
s=1-0.2*ln(rank+1)
Wherein rank represents document ranking position, and s represents the mark that standardizes of the mark after document standardization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410340344.7A CN104112012A (en) | 2014-07-16 | 2014-07-16 | Score normalization method for diversity of information retrieval results |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410340344.7A CN104112012A (en) | 2014-07-16 | 2014-07-16 | Score normalization method for diversity of information retrieval results |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104112012A true CN104112012A (en) | 2014-10-22 |
Family
ID=51708803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410340344.7A Pending CN104112012A (en) | 2014-07-16 | 2014-07-16 | Score normalization method for diversity of information retrieval results |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104112012A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5590317A (en) * | 1992-05-27 | 1996-12-31 | Hitachi, Ltd. | Document information compression and retrieval system and document information registration and retrieval method |
US20050055366A1 (en) * | 2003-09-08 | 2005-03-10 | Oki Electric Industry Co., Ltd. | Document collection apparatus, document retrieval apparatus and document collection/retrieval system |
CN103744984A (en) * | 2014-01-15 | 2014-04-23 | 北京理工大学 | Method of retrieving documents by semantic information |
-
2014
- 2014-07-16 CN CN201410340344.7A patent/CN104112012A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5590317A (en) * | 1992-05-27 | 1996-12-31 | Hitachi, Ltd. | Document information compression and retrieval system and document information registration and retrieval method |
US20050055366A1 (en) * | 2003-09-08 | 2005-03-10 | Oki Electric Industry Co., Ltd. | Document collection apparatus, document retrieval apparatus and document collection/retrieval system |
CN103744984A (en) * | 2014-01-15 | 2014-04-23 | 北京理工大学 | Method of retrieving documents by semantic information |
Non-Patent Citations (1)
Title |
---|
ANNE LE CALVÉ;JACQUES SAVOY: "database merging strategy based on logistic regression", 《INFORMATION PROCESSING AND MANAGEMENT》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Karniouchina et al. | Extending the firm vs. industry debate: Does industry life cycle stage matter? | |
US9798797B2 (en) | Cluster method and apparatus based on user interest | |
EP2407897A1 (en) | Device for determining internet activity | |
Haiduc et al. | Automatic query performance assessment during the retrieval of software artifacts | |
Liang et al. | Formal language models for finding groups of experts | |
CN104866554B (en) | A kind of individuation search method and system based on socialization mark | |
CN103123653A (en) | Search engine retrieving ordering method based on Bayesian classification learning | |
Winkler | Approximate string comparator search strategies for very large administrative lists | |
CN110634027A (en) | First-order user refined loss prediction method based on transfer learning | |
Zhang et al. | A pomdp model for content-free document re-ranking | |
Zurell et al. | Effects of functional traits on the prediction accuracy of species richness models | |
Nia et al. | A cross-country analysis of macroeconomic responses to COVID-19 pandemic using Twitter sentiments | |
Hirschberg et al. | Clusters of attributes and well‐being in the USA | |
He et al. | Identifying user behavior on Twitter based on multi-scale entropy | |
CN104112012A (en) | Score normalization method for diversity of information retrieval results | |
Tsai et al. | Risk ranking from financial reports | |
Le et al. | Top-k best probability queries and semantics ranking properties on probabilistic databases | |
Dietz et al. | Time-aware evaluation of cumulative citation recommendation systems | |
Chen et al. | LDA based semi-supervised learning from streaming short text | |
Li et al. | An improved slope one algorithm for collaborative filtering | |
Deb et al. | A correlation based imputation method for incomplete traffic accident data | |
Barnett | Comments on “Chaotic monetary dynamics with confidence” | |
Daoud et al. | Mining query-driven contexts for geographic and temporal search | |
Kubota et al. | How many ground truths should we insert? having good quality of labeling tasks in crowdsourcing | |
McCrea et al. | Conditional modelling of ring‐recovery data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20141022 |