CN104112012A - Score normalization method for diversity of information retrieval results - Google Patents

Score normalization method for diversity of information retrieval results Download PDF

Info

Publication number
CN104112012A
CN104112012A CN201410340344.7A CN201410340344A CN104112012A CN 104112012 A CN104112012 A CN 104112012A CN 201410340344 A CN201410340344 A CN 201410340344A CN 104112012 A CN104112012 A CN 104112012A
Authority
CN
China
Prior art keywords
document
rank
mark
information retrieval
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410340344.7A
Other languages
Chinese (zh)
Inventor
李洁玉
黄春兰
吴胜利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201410340344.7A priority Critical patent/CN104112012A/en
Publication of CN104112012A publication Critical patent/CN104112012A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a document score normalization method for the diversity of information retrieval results. A method based on document ranking positions is utilized for normalizing scores. Supposing the document ranking position is rank, and the normalization score of the document is obtained by calculating the value of 1-0.2*1n(rank+1). The document score normalization method for the diversity of the information retrieval results is applicable to diversified targets of information retrieval results, enables scores of documents to be provided with better comparability and can be applied to data fusion of information retrieval results, distributed information retrieval and the like.

Description

A kind of mark normalization method for the diversification of information retrieval result
Technical field
The present invention relates to one for the diversified mark normalization method of result for retrieval, be applied to data fusion, the distributed information retrieval etc. of information retrieval result.
Background technology
In a lot of application, as data fusion, the distributed information retrieval etc. of information retrieval system result, need to use the score information of document to carry out overall treatment to data.For these application, mark standardization is indispensable link.Because these process needs processing are a lot of from the given document of different searching systems, and the document mark that different searching system provides generally has different distribution ranges, and this just makes the document mark of separate sources not have comparability.Even, some searching systems do not provide document mark, only provide a document sequence.The disappearance of these nonstandard marks and score information can have a great impact subsequent treatment.Mark standardization has ensured the comparability of document mark, is the necessary preparation work before of combination separate sources data.
At present, there is several different methods to carry out mark standardization.Usually, can be divided into two classes: the normalization method based on raw score and the normalization method based on document ranking position.Normalization method based on raw score utilizes the raw score of the document that searching system provides, and adopts certain strategy, and raw score is distributed and is converted into new score distribution, makes the mark after the standardization between different system have comparability.The strategy adopting mainly contain linear with nonlinear two kinds.In linear mark normalization method, more classical have a 0-1 linearity specifications method [1], and the raw score of document is standardized linearly to [0,1] interval; Fitting method [2] is improved on 0-1 method for normalizing, and mark scope specification is arrived on [a, b] interval; Sum-to-One method [3] require mark after all standardization and be 1.Non-linear method has in a kind of mixture model of having considered relevant documentation and irrelevant document different distributions [4], the normalization method [5] based on CDF (Cumulative Density Function) etc.
Using the prerequisite based on raw score normalization method is that system provides authentic and valid raw score information.The sequence of document ranking is only provided and the situation of document raw score is not provided for system, will adopt some method to transform ranking information to obtain corresponding score information.More famous in mark normalization method based on document ranking position have a rank inverse method [6], and the method adopts 1/ (rank+k) formula to standardize to document raw score, and proposing parameter k, to get 60 effects that obtain best.Logistic model is also used to mark standardization [7,8].In document [7,8], the people such as Calve use the logarithm value ln (rank) of document ranking position to replace document ranking position rank itself.Due to use rank itself as the Logistic curve of independent variable along with the increase of rank reduces very fastly, be greater than 10 position in rank after, mark after standardization is all very close to 0, this just makes the comparability variation of the locational mark of non-top ten document ranking, particularly for the document of 11-100 position.Also have Cubic model [9] according to the method for document ranking location specification, ripple reaches counting model [10] etc.
These mark normalization methods can make result for retrieval performance good in some cases, but do not consider the situation of result for retrieval diversification.The diversity whether these methods can realize result for retrieval needs to be investigated.Consider in actual conditions, exist this part searching system that the raw score information of document is not provided, the present invention proposes to adopt the mark normalization method based on document ranking position, particularly adopt the logarithm value ln (rank) of document ranking position, ensure that the score distribution after standardization has differentiation on first 100.
Documents
[1]Lee,J.H.:Analysis?of?multiple?evidence?combination.In:Proceedings?of?the20th?Annual?International?ACM?SIGIR?Conference,Philadelphia,Pennsylvania,USA,pp.267-275,1997.
[2]Wu,S.,Crestani,F.,Bi,Y.:Evaluating?Score?Normalization?Methods?in?Data?Fusion.In:Ng,H.T.,Leong,M.-K.,M.-Y.,Ji,D.(eds.)AIRS2006.LNCS,vol.4182,pp.642-648.Springer,Heidelberg,2006.
[3]Montague,M.,Aslam,J.A.:Relevance?score?normalization?for?metasearch.In:Proceedings?of?ACM?CIKM?Conference,Berkeley,USA,pp.427-433,2001.
[4]Manmatha,R.,T.Rath,and?Fangfang?Feng.:Modeling?score?distributions?for?combining?the?outputs?of?search?engines.In:Proceedings?of?the24th?annual?international?ACM?SIGIR?conference?on?Research?and?development?in?information?retrieval.ACM,2001.
[5]Fernández,M.,Vallet,D.,and?Castells,P.:Probabilistic?score?normalization?for?rank?aggregation.Advances?in?Information?Retrieval.Springer?Berlin?Heidelberg,pp.553-556.2006.
[6]Cormack,G.V.,Clarke,C.L.A.,and?Buttcher,S.:Reciprocal?rank?fusion?outperforms?Condorect?and?individual?rank?learning?methods.In:Proceedings?of?the32nd?Annual?International?ACM?SIGIR?Conference?on?Research?and?Development?in?Information?Retrieval,pp.758-759.Bonston,Massachusetts,2009.
[7]Le?Calvé,A.,and?Savoy,J.:Database?merging?strategy?based?on?logistic?regression.Information?Processing&Management36.3,pp.341-359,2000.
[8]Savoy,J.Report?on?CLEF-2003multilingual?tracks.:Comparative?Evaluation?of?Multilingual?Information?Access?Systems.Springer?Berlin?Heidelberg,pp.64-73,2004.
[9]Wu,S.,Bi,Y.,and?McClean,S.:Regression?relevance?models?for?data?fusion.Database?and?Expert?Systems?Applications,2007.DEXA'07.18th?International?Workshopo?n.IEEE,2007.
[10]Javed?A.Aslam,Mark?H.Montague:Models?for?Metasearch.SIGIR2001:275-284.
Summary of the invention
The object of the present invention is to provide a kind of mark normalization method for result for retrieval diversification, to improve the performance of result for retrieval in diversity, the mark that makes different system give same document has better comparability.
In order to solve above technical matters, the concrete technical scheme that the present invention adopts is as follows:
For a document mark normalization method for information retrieval result diversification, it is characterized in that: be rank based on document ranking position, use the logarithm of rank as the non-linear mark standardization of one of model core, circular is as follows:
s=1-0.2*ln(rank+1)
Wherein rank represents the rank position of document, and s represents the mark that standardizes of the mark after document standardization.
The present invention has beneficial effect.The present invention adopts simple logarithmic model, be applicable to result for retrieval diversification target, can provide and more have the document of comparability mark, thereby make result for retrieval there is good correlativity and diversity, can be applicable to data fusion and the distributed information retrieval of information retrieval result.
Embodiment
Below in conjunction with specific embodiment, technical scheme of the present invention is described in further details.
Embodiment 1
Be located in specific retrieval situation, provide for certain inquiry for system, provided the ranked list of a document, in list, comprised 5 documents that retrieve, and this system does not provide the mark of given each document.Adopt logarithmic model 1-0.2*ln (rank+1), the locational document mark of each document ranking is:
Experiment arranges and result
Experiment has adopted three groups of results that are submitted to TREC text retrieval meeting, these results to be selected from the diversity task (diversity task) under the web track theme of 2009 to 2011.These three groups of data have comprised respectively classic 8 results (by ERR-IA@20 index sequences) then, in each result for retrieval, comprise the document ranking list that relates to 50 inquiries, the ranking information that has comprised some documents in each document ranking list and original point of value information.About the information of these results in table 1.
Three groups, table 1 is filed in the result of TREC web retrieval diversity task
For the quality of reciprocal model 1/ (rank+60) and logarithmic model 1-0.2*ln (rank+1) relatively, can differentiate by the linear relationship of comparison both and the locational relevance scores of document ranking.Wherein, the locational relevance scores of document ranking is to be determined for the relevance score of ad hoc inquiry by the document comprising on this position.And each document is produced by artificial judgment for the relevance score of ad hoc inquiry.For one group of result for retrieval, as 8 member's results of group in 2009, in each member's result, relate to the document ranking list of 50 inquiries, come always total 8*50=400 of same locational document.The relevance score of first cumulative these 400 documents, if document is correlated with, and relevant on the individual sub-topics of k (k>0) of inquiry, the relevance score of the document is k; If document is uncorrelated, i.e. k=0, therefore incoherent document need not be considered.And then using cumulative document relevance score value divided by 400 as the locational relevance scores of the document rank.Table 2 has provided front 5 the locational relevance scores of document ranking in three groups of data.Except document ranking position and corresponding relevance scores information, give the estimated score being calculated by reciprocal model and logarithmic model.
The locational relevance scores of table 2 document ranking distributes
According to above-mentioned gained, taking the estimated score of model as independent variable, relevance scores is dependent variable, investigates the linear relationship between reciprocal model and logarithmic model and relevance scores by curve estimation (recurrence).Table 3 has provided the relevant information returning, and therefrom can find that remarkable value (Significance Level) under all situations all in .000 rank, illustrate that it is effective returning estimation.Investigate R 2with F index, in three groups of data, logarithmic model is all better than reciprocal model.
Table 3 linear regression analysis statistic: model estimator is independent variable, relevance score is dependent variable
In addition, consider under a result diversification target that mark is normalized should be used for investigating above-mentioned two models, select the data fusion of information retrieval result here.The above-mentioned 3 groups of experimental datas of same employing, after specification document mark, are used CombSum method to carry out result fusion.Except above-mentioned two kinds of mark normalization methods based on rank, the 0-1 linearity specifications method based on score value is also included into consideration.In table 4, provide the evaluation information of fusion results.Wherein best represents the optimal result in group membership's system results.On the whole, adopt the result after CombSum method merges to be all better than the optimal result in each group, can find out that diversification is a kind of effectively fusion method to CombSum method for result for retrieval.The result that wherein adopts 0-1 mark normalization method is the poorest in three kinds of mark normalization methods, and this is mainly owing to existing part Member Systems that point value information of document is not provided.Rank reciprocal model and the performance of rank logarithmic model are all relatively good.The evaluation of estimate of rank reciprocal model on 2009,2011 years is higher than rank logarithmic model, and poorer than rank logarithmic model on 2010 annual datas.From the average of three groups of data, rank logarithmic model is better a little than rank reciprocal model on the whole, but is not obvious especially.This shows that the mark normalization method adopting based on rank is effective for the data fusion under result for retrieval diversification target.Although be not all excellent than the performance of rank reciprocal model in all cases based on rank modulus of logarithm type, may be a kind of better model in general.
The evaluation of fusion results after three kinds of mark normalization methods of table 4, is used ERR-IA@20 indexs to weigh

Claims (1)

1. for a document mark normalization method for information retrieval result diversification, it is characterized in that: be rank based on document ranking position, use the logarithm of rank as the non-linear mark standardization of one of model core, circular is as follows:
s=1-0.2*ln(rank+1)
Wherein rank represents document ranking position, and s represents the mark that standardizes of the mark after document standardization.
CN201410340344.7A 2014-07-16 2014-07-16 Score normalization method for diversity of information retrieval results Pending CN104112012A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410340344.7A CN104112012A (en) 2014-07-16 2014-07-16 Score normalization method for diversity of information retrieval results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410340344.7A CN104112012A (en) 2014-07-16 2014-07-16 Score normalization method for diversity of information retrieval results

Publications (1)

Publication Number Publication Date
CN104112012A true CN104112012A (en) 2014-10-22

Family

ID=51708803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410340344.7A Pending CN104112012A (en) 2014-07-16 2014-07-16 Score normalization method for diversity of information retrieval results

Country Status (1)

Country Link
CN (1) CN104112012A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5590317A (en) * 1992-05-27 1996-12-31 Hitachi, Ltd. Document information compression and retrieval system and document information registration and retrieval method
US20050055366A1 (en) * 2003-09-08 2005-03-10 Oki Electric Industry Co., Ltd. Document collection apparatus, document retrieval apparatus and document collection/retrieval system
CN103744984A (en) * 2014-01-15 2014-04-23 北京理工大学 Method of retrieving documents by semantic information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5590317A (en) * 1992-05-27 1996-12-31 Hitachi, Ltd. Document information compression and retrieval system and document information registration and retrieval method
US20050055366A1 (en) * 2003-09-08 2005-03-10 Oki Electric Industry Co., Ltd. Document collection apparatus, document retrieval apparatus and document collection/retrieval system
CN103744984A (en) * 2014-01-15 2014-04-23 北京理工大学 Method of retrieving documents by semantic information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANNE LE CALVÉ;JACQUES SAVOY: "database merging strategy based on logistic regression", 《INFORMATION PROCESSING AND MANAGEMENT》 *

Similar Documents

Publication Publication Date Title
Karniouchina et al. Extending the firm vs. industry debate: Does industry life cycle stage matter?
US9798797B2 (en) Cluster method and apparatus based on user interest
EP2407897A1 (en) Device for determining internet activity
Haiduc et al. Automatic query performance assessment during the retrieval of software artifacts
Liang et al. Formal language models for finding groups of experts
CN104866554B (en) A kind of individuation search method and system based on socialization mark
CN103123653A (en) Search engine retrieving ordering method based on Bayesian classification learning
Winkler Approximate string comparator search strategies for very large administrative lists
CN110634027A (en) First-order user refined loss prediction method based on transfer learning
Zhang et al. A pomdp model for content-free document re-ranking
Zurell et al. Effects of functional traits on the prediction accuracy of species richness models
Nia et al. A cross-country analysis of macroeconomic responses to COVID-19 pandemic using Twitter sentiments
Hirschberg et al. Clusters of attributes and well‐being in the USA
He et al. Identifying user behavior on Twitter based on multi-scale entropy
CN104112012A (en) Score normalization method for diversity of information retrieval results
Tsai et al. Risk ranking from financial reports
Le et al. Top-k best probability queries and semantics ranking properties on probabilistic databases
Dietz et al. Time-aware evaluation of cumulative citation recommendation systems
Chen et al. LDA based semi-supervised learning from streaming short text
Li et al. An improved slope one algorithm for collaborative filtering
Deb et al. A correlation based imputation method for incomplete traffic accident data
Barnett Comments on “Chaotic monetary dynamics with confidence”
Daoud et al. Mining query-driven contexts for geographic and temporal search
Kubota et al. How many ground truths should we insert? having good quality of labeling tasks in crowdsourcing
McCrea et al. Conditional modelling of ring‐recovery data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141022