CN101582085B

CN101582085B - Set option method based on distributed information retrieval system

Info

Publication number: CN101582085B
Application number: CN2009101460707A
Authority: CN
Inventors: 王秀红; 鞠时光
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2008-09-19
Filing date: 2009-06-05
Publication date: 2011-11-16
Anticipated expiration: 2029-06-05
Also published as: CN101582085A

Abstract

The invention relates to a set option method in distributed information retrieval and aims at providing a set option method with high retrieval effectiveness and good effect based on a distributed information retrieval system. The technical solution for realizing the aim is the set option method based on the distributed information retrieval system, and the method comprises the steps of calculating the coverage of the data to be retrieved on the database to be selected, and determining the sequence for selecting the database set according to the coverage size. The method greatly improves the time and space costs of the system computation of the computer during distributed information retrieval, guarantees the recall ratio and precision ratio of the asking result, and enhances the efficiency and effectiveness of distributed information retrieval.

Description

A kind of set option method based on distributed information retrieval system

Technical field

The present invention relates to the computer information retrieval field, be specifically related to set option method in the distributed information retrieval.

Background technology

Distributed information retrieval is an important research direction of information retrieval, and the main contents of research comprise several sections such as " set is selected ", " forms data set retrieval ", " result's merging ".Utilize the set selection algorithm to find out maximally related data acquisition and retrieve, thereby realize the query portion collection of document and provide the effect of good result for retrieval.The effect that set is selected is directly determining the quality of final result for retrieval.In the distributed information retrieval field, set selects to cry again database to select or resource selection.

Aspect the set selection, famous algorithm mainly contains 3 kinds: (1) CORI (Collection RetrievalInference Network) algorithm: information bank retrieval inference network method, be that people such as Callan propose nineteen ninety-five, by original document is carried out the Bayesian inference net that correlativity is judged, each information bank is all regarded as one piece of huge document, and the method for the document hierarchical arrangement in the method for information bank hierarchical arrangement and the conventional ir system is similar.(2) gGlOSS (Generalized Glossary of Servers Server) algorithm: be to propose in 1999 by people such as Gravano L., be the friendly of input question-type to be carried out hierarchical arrangement to information bank according to information bank, do like this and can estimate to contain in each information bank the quantity of document that surpasses a certain threshold value, determine the score of information bank then according to quantity of document, attempt to solve when obtaining a plurality of couplings source how to select suitable source, and developed vector space search version and Boolean variable version.(3) CVV (The Cue-Validity-Variance) algorithm: be to propose in 1997, noticed the query feature of Internet, on the basis of vector space algorithm, algorithm has been done improvement by Yuwono and Lee.Kirsch; Steven T. and Chang; William I. (2000, US Patent 6,018,733) discloses a kind of set option method of the file retrieval based on special inquiry.Kirsch (1999, US Patent 5,983,21) discloses a kind of how many next automatic choice sets according to the statistics specific word that collection of document comprised to be closed.In addition, D ' Souza (2004) and Si (2004) and MinJie Zhang propositions such as (2006) are better than the CORI algorithm based on the set option method of language model.(2007) such as Sergey Chernov propose a kind of set selection algorithm based on word frequency statistics and false appearance pass language model.The set option method that Wu and Crestani (2003) propose a kind of multiple goal pattern has been taken all factors into consideration and the degree of correlation of puing question to, computing time expense and the possible chance that obtains same data simultaneously in a plurality of resources.Zhang Gang (2007) will gather the selection problem and be converted into the file retrieval problem, attempt multiple document retrieval method and solve set selection problem.

Summary of the invention

The present invention seeks to: provide a kind of recall precision high and effective set option method based on distributed information retrieval system.

The technical scheme that realizes the foregoing invention purpose is:

A kind of set option method based on distributed information retrieval system, this method comprises: calculating needs the level of coverage of data retrieved to database to be selected, according to the size of level of coverage, determines to select the sequencing of database collection;

The level of coverage that described calculating needs data retrieved to treat the database of selection further may further comprise the steps:

1. by calculating the importance score value of database collection to be selected for the method (being greedy computing method) that is contained in the retrieve data weighted sum in the database to be selected;

2. for same retrieve data, if occurred in the database that has formerly selected, when calculating the database importance score value of back, consider the covering between the disparate databases set, also no longer counting in the database score value of back appears in these data once more.

Above-mentioned steps 1 described greedy algorithm further is: suppose to have an enquirement, result for retrieval merges preceding n the data (n is a natural number) after the ordering, is designated as respectively: d1, and d2 ..., dk ..., dn.K data dk is at certain database C _iIn when occurring the contribution score value to this database importance be 1/k ^β, β is a positive rational number; The importance score value of database is the contribution score value sum of all specific data that it comprised.

Can supply the data retrieved storehouse for a retrieval enquirement, have to comprise M the database of puing question to answer, database collection C=(C ₁, C ₂..., C _M), C _iBe i alternative database, i=1,2 ..., M.N is through merging in the result for retrieval after sorting a preceding n document.C ' is according to degree of correlation size, the set of all m databases on selected, C '=(C ₁', C ₂' ..., C _m'), C _j' be the database of j on selected, j=1,2 ..., m, SC _j' be the score value of the database of j on selected, j=1,2 ..., m, β are the normal parameter in the weight function, are positive rational number, get β=1 herein in the example.Document k is if at database C _iIn, be labeled as kC in form _i, its contribution score value is designated as

Document k is if at database C ' _jIn, be labeled as kC ' in form _j, its contribution score value is designated as

The importance score value computing formula of database is:

{SC}_{1}^{'} = \max_{i = 1}^{M} Σ_{k = 1}^{n} \frac{1}{{({kC}_{i})}^{β}} - - - (a)

{SC}_{l}^{'} = \max_{i = 1}^{M} Σ_{k = 1}^{n} \frac{1}{{({kC}_{i} - Σ_{j = 1}^{l - 1} {kC}^{'}_{j})}^{β}}, 2 \leq l \leq m - - - (b)

In the such scheme, specifically comprise following selection step:

(1) calculates the database C that all contain any one data in the above n data _iThe importance score value, select the database of score value maximum first-selected database and be designated as C ' in selecting as database collection ₁

(2) remove the database of selected mistake, calculate in the remaining database the described database C that comprises any one data in n the data _iThe importance score value, in calculating, remove the data that comprise in the database selected; Select maximum in the importance score value one, be designated as the 2nd and select database C ' ₂

(3) repeat above step (2), select C ' up to the m time _m, when this 1 to m selecteed database replenishes when covering n all data end data storehouse selection step jointly.

Compared with prior art, the inventive method has the following advantages:

1, adopt set to cover and greedy algorithm, but not in the past " digital finger-print ", " word frequency statistics " or information retrieval methods such as " language models " have greatly reduced the computer information retrieval computing cost;

2, consider in esse covering problem between the database, optimized the distributed collection selection result, improved effectiveness of retrieval and effect;

3, merge the difference of position, ordering back according to data at result for retrieval, give different weights, improved the recall rate and the precision ratio of distributed information retrieval in order to calculate the contribution score value of these data to database.

Description of drawings

Fig. 1 is the distributed information retrieval system structural drawing;

Fig. 2 gathers selection course method flow synoptic diagram.

Embodiment

Be described further below in conjunction with accompanying drawing.

As shown in Figure 1, a kind ofly comprise client computer 1 based on distributed information retrieval system, client computer 2 ... client computer n, information retrieval server 1, information retrieval server 2 ... information retrieval server n, client's group of planes is connected by network with the information retrieval server group, and client's group of planes provides retrieval to put question to data, the information retrieval server group provides and comprises all databases of puing question to answer, the invention provides a kind of set option method that carries out information retrieval based on this system.

To following examples, at first define the relevant symbol and the meaning of representative thereof:

Table 1 expression symbol and meaning thereof

Document k is if at C _iIn, be labeled as kC in form _i, its contribution score value is designated as

, document k is if at C ' _jIn, be labeled as kC ' in form _j, its contribution score value is designated as

, the importance score value computing formula of database is:

{SC}_{1}^{'} = \max_{i = 1}^{M} Σ_{k = 1}^{n} \frac{1}{{({kC}_{i})}^{β}} - - - (a)

{SC}_{1}^{'} = \max_{i = 1}^{M} Σ_{k = 1}^{n} \frac{1}{{({kC}_{i} - Σ_{j = 1}^{l - 1} {kC}^{'}_{j})}^{β}}

2≤l≤m (b)

Embodiment 1

As shown in Figure 2, suppose n=10, promptly get among the result who merges after the ordering and be positioned at preceding 10 document.And each document ordering number sign of oneself.Suppose to have alternative database collection a: C ₁In comprised 1,2,3,4 documents; C ₂In comprised 2,3,7,8 documents; C ₃In comprised 1,5,6,7 documents; C ₄In comprised 4,5,6,9 documents; C ₅In comprised 9,10 documents.Do not comprise desired result for retrieval in other database, the object that do not elect is considered.K document to the contribution score value of the database that comprises it is: 1/k (supposing β=1 herein) selects the process of database to be:

At first selecteed database is: in all databases, to each database, look for maximal value after the particular data contribution score value summation that will comprise.According to the Ben Tongzhi SC of each document k to database ₁'=max{1+1/2+1/3+1/4,1/2+1/3+1/7+1/8,1+1/5+1/6+1/7,1/4+1/5+1/6+1/9,1/9+1/10}={1+1/2+1/3+1/4}.So a most important database just the 1st selecteed data is designated as C for comprising the database of 1,2,3,4 documents ₁'=C ₁=1,2,3,4}.

Select the 2nd number storehouse: except the C that has been selected ₁Outward, in 4 remaining databases, remove the document that has occurred in the 1st selected database that goes out respectively after, carry out importance marking more respectively and calculate.

Represent that the database of selecting for the 2nd time is the database that comprises 1,5,6,7 documents, be designated as C ₂'=C ₃=1,5,6,7).

Select the 3rd number storehouse: except the C that has been selected ₁And C ₃Outward, in 3 remaining databases, remove the document that has occurred in the selected database that goes out of the 1st time and the 2nd time respectively after, carry out storehouse importance marking calculating more respectively.

Represent that the database of selecting for the 3rd time is the database that comprises 9,10 documents, be designated as C ₃'=C _5.=={ 9,10);

Select the 4th number storehouse: except the C that has been selected ₁, C ₃And C ₅Outward, in 2 remaining databases, remove the document that has occurred in the selected database that goes out of the 1st time, the 2nd time and the 3rd time respectively after,

So the 4th selecteed database is designated as C for comprising the database of 2,3,7,8 documents ₄'=C _2.=2,3,7,8).

Select the 5th number storehouse: except the C that has been selected ₁, C ₂, C ₃And C ₅Outward, in 1 remaining database, remove respectively the 1st time, the 2nd time, the 3rd time and the selected database that goes out of the 4th in behind the document that occurred, carry out storehouse importance marking more respectively and calculate.

Illustrate that remaining database selects null(NUL), set is selected to leave it at that.So the database that is selected is 4 altogether, m=4.

Embodiment 2

Suppose to get preceding 100 documents, each document identifies as it with its sorting position sequence number in result for retrieval equally.Suppose to have 60 databases to adopt the random number method, make these 60 databases all include in these 100 documents some randomly for retrieval.The method according to this invention from these 60 databases after certain random number produces, is selected the database collection of 24 optimal combinations, in the time of can be for complete these 100 documents of inspection, covers each other between the selecteed database and reaches minimum.It is how to replenish mutually to cover these 100 documents that experimental result has been reproduced 24 selected databases.As first database that is selected, also be a most important database, experimental result shows that this database has covered the 1st, the 9, the the the 11st, the 33, the 46th, the 53, the the 55th, the 62, the 64th, the 82, the the 83rd, the 86, the 87th, the 91, the 94th and the 96th document.

The present invention is from the information resources set of a large amount of, dispersion and isomery, select the appropriate information subset of resources to close,, can search suitable information to satisfy user's enquirement, reduce the COMPUTER CALCULATION expense when guaranteeing certain recall ratio and precision ratio, improve recall precision.

Claims

1. set option method based on distributed information retrieval system, it is characterized in that this method comprises: utilize set to cover, calculating needs the level of coverage of data retrieved to database to be selected, according to the size of level of coverage, determine to select the sequencing of database collection; The level of coverage that described calculating needs data retrieved to treat the database of selection comprises the following steps: to calculate the importance score value of database collection to be selected by to the method that is contained in the retrieve data weighted sum in the database to be selected, specifically comprises:

(1) suppose to have an enquirement, result for retrieval merges preceding n data after the ordering, and n is a natural number, is designated as respectively: d1, and d2 ..., dk ..., dn; K data dk is at certain database C _iIn when occurring, C _iBe i alternative database, i=1,2 ..., M is 1/k to the contribution score value of this database importance ^β, β is a positive rational number; The importance score value of database is the contribution score value sum of all specific data that it comprised;

(2) calculate all and contain database C with any one data in n the data of going forward _iThe importance score value, i=1,2 ..., M selects the database of score value maximum first-selected database and be designated as C ' in selecting as database collection ₁

(3) remove the database of selected mistake, calculate in the remaining database the described database C that comprises with any one data in n the data of going forward _iThe importance score value, in calculating, remove the data that comprise in the database selected, these data no longer count database C _iThe importance score value in; Select maximum in the importance score value one, be designated as the 2nd and select database C ' ₂

(4) repeat above step (3), select C ' up to the m time _m, when this 1 to m selecteed database replenish jointly cover all when going forward n data, end data storehouse selection step.