CN101582085A - Set option method based on distributed information retrieval system - Google Patents

Set option method based on distributed information retrieval system Download PDF

Info

Publication number
CN101582085A
CN101582085A CNA2009101460707A CN200910146070A CN101582085A CN 101582085 A CN101582085 A CN 101582085A CN A2009101460707 A CNA2009101460707 A CN A2009101460707A CN 200910146070 A CN200910146070 A CN 200910146070A CN 101582085 A CN101582085 A CN 101582085A
Authority
CN
China
Prior art keywords
database
data
score value
importance
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2009101460707A
Other languages
Chinese (zh)
Other versions
CN101582085B (en
Inventor
王秀红
鞠时光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN2009101460707A priority Critical patent/CN101582085B/en
Publication of CN101582085A publication Critical patent/CN101582085A/en
Application granted granted Critical
Publication of CN101582085B publication Critical patent/CN101582085B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a set option method in distributed information retrieval and aims at providing a set option method with high retrieval effectiveness and good effect based on a distributed information retrieval system. The technical solution for realizing the aim is the set option method based on the distributed information retrieval system, and the method comprises the steps of calculating the coverage of the data to be retrieved on the database to be selected, and determining the sequence for selecting the database set according to the coverage size. The method greatly improves the time and space costs of the system computation of the computer during distributed information retrieval, guarantees the recall ratio and precision ratio of the asking result, and enhances the efficiency and effectiveness of distributed information retrieval.

Description

A kind of set option method based on distributed information retrieval system
Technical field
The present invention relates to the computer information retrieval field, be specifically related to set option method in the distributed information retrieval.
Background technology
Distributed information retrieval is an important research direction of information retrieval, and the main contents of research comprise several sections such as " set is selected ", " forms data set retrieval ", " result's merging ".Utilize the set selection algorithm to find out maximally related data acquisition and retrieve, thereby realize the query portion collection of document and provide the effect of good result for retrieval.The effect that set is selected is directly determining the quality of final result for retrieval.In the distributed information retrieval field, set selects to cry again database to select or resource selection.
Aspect the set selection, famous algorithm mainly contains 3 kinds: (1) CORI (Collection RetrievalInference Network) algorithm: information bank retrieval inference network method, be that people such as Callan propose nineteen ninety-five, by original document is carried out the Bayesian inference net that correlativity is judged, each information bank is all regarded as one piece of huge document, and the method for the document hierarchical arrangement in the method for information bank hierarchical arrangement and the conventional ir system is similar.(2) gGlOSS (Generalized Glossary of Servers Server) algorithm: be to propose in 1999 by people such as Gravano L., be the friendly of input question-type to be carried out hierarchical arrangement to information bank according to information bank, do like this and can estimate to contain in each information bank the quantity of document that surpasses a certain threshold value, determine the score of information bank then according to quantity of document, attempt to solve when obtaining a plurality of couplings source how to select suitable source, and developed vector space search version and Boolean variable version.(3) CVV (The Cue-Validity-Variance) algorithm: be to propose in 1997, noticed the query feature of Internet, on the basis of vector space algorithm, algorithm has been done improvement by Yuwono and Lee.Kirsch; Steven T. and Chang; William I. (2000, US Patent 6,018,733) discloses a kind of set option method of the file retrieval based on special inquiry.Kirsch (1999, US Patent 5,983,21) discloses a kind of how many next automatic choice sets according to the statistics specific word that collection of document comprised to be closed.In addition, D ' Souza (2004) and Si (2004) and MinJie Zhang propositions such as (2006) are better than the CORI algorithm based on the set option method of language model.(2007) such as Sergey Chernov propose a kind of set selection algorithm based on word frequency statistics and false appearance pass language model.The set option method that Wu and Crestani (2003) propose a kind of multiple goal pattern has been taken all factors into consideration and the degree of correlation of puing question to, computing time expense and the possible chance that obtains same data simultaneously in a plurality of resources.Zhang Gang (2007) will gather the selection problem and be converted into the file retrieval problem, attempt multiple document retrieval method and solve set selection problem.
Summary of the invention
The present invention seeks to: provide a kind of recall precision high and effective set option method based on distributed information retrieval system.
The technical scheme that realizes the foregoing invention purpose is:
A kind of set option method based on distributed information retrieval system, this method comprises: calculating needs the level of coverage of data retrieved to database to be selected, according to the size of level of coverage, determines to select the sequencing of database collection.
The level of coverage that described calculating needs data retrieved to treat the database of selection further may further comprise the steps:
1. by calculating the importance score value of database collection to be selected for the method (being greedy computing method) that is contained in the retrieve data weighted sum in the database to be selected;
2. for same retrieve data, if occurred in the database that has formerly selected, when calculating the database importance score value of back, consider the covering between the disparate databases set, also no longer counting in the database score value of back appears in these data once more.
Above-mentioned steps 1 described greedy algorithm further is: suppose to have an enquirement, retrieval is designated as respectively in conjunction with preceding n the data (n is a natural number) that merge after sorting: d1, and d2 ..., dk ..., dn.K data dk is at certain database C jIn when occurring the contribution score value to this database importance be 1/k β, β is a positive rational number; The importance score value of database is the contribution score value sum of all specific data that it comprised.
Can supply the data retrieved storehouse for a retrieval enquirement, have to comprise M the database of puing question to answer, database collection C=(C 1, C 2..., C M), C iBe i alternative database, i=1,2 ..., M.N is through merging in the result for retrieval after sorting a preceding n document.C ' is according to degree of correlation size, the set of all m databases on selected, C '=(C 1', C 2' ..., C m'), C j' be the database of j on selected, j=1,2 ..., m, SC j' be the score value of the database of j on selected, j=1,2 ..., m, β are the normal parameter in the weight function, are positive rational number, get β=1 herein in the example.Document k is if at database C iIn, be labeled as kC in form i, its contribution score value is designated as Document k is if at database C ' jIn, be labeled as kC ' in form j, its contribution score value is designated as
Figure A20091014607000052
The importance score value computing formula of database is:
SC 1 ′ = max i = 1 M Σ k = 1 n 1 ( kC i ) β - - - ( a )
SC l ′ = max i = 1 M Σ k = 1 n 1 ( kC i - Σ j = 1 l - 1 kC j ′ ) β 2 ≤ l ≤ m - - - ( b )
In the such scheme, specifically comprise following selection step:
(1) calculates the database C that all contain any one data in the above n data iImportance part value, select the database of score value maximum first-selected database and be designated as C ' in selecting as database collection 1
(2) remove the database of selected mistake, calculate in the remaining database the described database C that comprises any one data in n the data iImportance part value, in calculating, remove the data that comprise in the database of having selected; Select maximum in importance part value one, be designated as the 2nd and select database C ' 2
(3) repeat above step (2), select C ' up to the m time m, when this 1 to m selecteed database replenishes when covering n all data end data storehouse selection step jointly.
Compared with prior art, the inventive method has the following advantages:
1, adopt set to cover and greedy algorithm, but not in the past " digital finger-print ", " word frequency statistics " or information retrieval methods such as " language models " have greatly reduced the computer information retrieval computing cost;
2, consider in esse covering problem between the database, optimized the distributed collection selection result, improved effectiveness of retrieval and effect;
3, merge the difference of position, ordering back according to data at result for retrieval, give different weights, improved the recall rate and the precision ratio of distributed information retrieval in order to calculate the contribution score value of these data to database.
Description of drawings
Fig. 1 is the distributed information retrieval system structural drawing;
Fig. 2 gathers selection course method flow synoptic diagram.
Embodiment
Be described further below in conjunction with accompanying drawing.
As shown in Figure 1, a kind ofly comprise client computer 1 based on distributed information retrieval system, client computer 2 ... client computer n, information retrieval server 1, information retrieval server 2 ... information retrieval server n, client's group of planes is connected by network with the information retrieval server group, and client's group of planes provides retrieval to put question to data, the information retrieval server group provides and comprises all databases of puing question to answer, the invention provides a kind of set option method that carries out information retrieval based on this system.
To following examples, at first define the relevant symbol and the meaning of representative thereof:
Table 1 expression symbol and meaning thereof
Symbol Implication
M Comprise all database numbers of puing question to answer
C i I alternative database, i=1,2 ..., M
C Puing question to for a retrieval can be for data retrieved storehouse set C=(C 1,C 2,...,C M)
m Final selecteed database number
C j According to degree of correlation size, the database on j is selected, j=1,2 ..., m
SC j The score value of the database on j is selected, j=1,2 ..., m
C’ All database collections on selected, C '=(C 1’,C 2’,...,C m’)
n Through merging in the result for retrieval after sorting a preceding n document
kC i The document that comes k position in the result for retrieval appears at i alternative set C iIn
kC j The document that comes k position in the result for retrieval appears at j selecteed set C j
β β is the normal parameter in the weight function, is positive rational number, gets β=1 herein in the example
Document k is if at C iIn, be labeled as kC in form i, its contribution score value is designated as
Figure A20091014607000071
Document k is if at C ' jIn, be labeled as kC ' in form j, its contribution score value is designated as
Figure A20091014607000072
The importance score value computing formula of database is:
SC 1 ′ = max i = 1 M Σ k = 1 n 1 ( kC i ) β - - - ( a )
SC l ′ = max i = 1 M Σ k = 1 n 1 ( kC i - Σ j = 1 l - 1 kC j ′ ) β 2 ≤ l ≤ m - - - ( b )
Embodiment 1
As shown in Figure 2, suppose n=10, promptly get among the result who merges after the ordering and be positioned at preceding 10 document.And each document is used from several orderings number sign.Suppose to have alternative database collection a: C 1In comprised 1,2,3,4 documents; C 2In comprised 2,3,7,8 documents; C 3In comprised 1,5,6,7 documents; C 4In comprised 4,5,6,9 documents; C 5In comprised 9,10 documents.Do not comprise desired result for retrieval in other database, the object that do not elect is considered.K document to the contribution score value of the database that comprises it is: 1/k (supposing β=1 herein) selects the process of database to be:
At first selecteed database is: in all databases, to each database, look for maximal value after the particular data contribution score value summation that will comprise.According to the Ben Tongzhi SC of each document k to database 1'=max{1+1/2+1/3+1/4,1/2+1/3+1/7+1/8,1+1/5+1/6+1/7,1/4+1/5+1/6+1/9,1/9+1/10}={1+1/2+1/3+1/4}.So a most important database just the 1st selecteed data is designated as C for comprising the database of 1,2,3,4 documents 1'=C 1=1,2,3,4}.
Select the 2nd number storehouse: except the C that has been selected 1Outward, in 4 remaining databases, remove the document that has occurred in the 1st selected database that goes out respectively after, carry out importance marking more respectively and calculate.
Figure A20091014607000081
Figure A20091014607000082
Represent that the database of selecting for the 2nd time is the database that comprises 1,5,6,7 documents, be designated as C 2'=C 3=1,5,6,7}.
Select the 3rd number storehouse: except the C that has been selected 1And C 3Outward, in 3 remaining databases, remove the document that has occurred in the selected database that goes out of the 1st time and the 2nd time respectively after, carry out storehouse importance marking calculating more respectively.
Figure A20091014607000083
Represent that the database of selecting for the 3rd time is the database that comprises 9,10 documents, be designated as C 3'=C 5.=={ 9,10};
Select the 4th number storehouse: except the C that has been selected 1, C 3And C 5Outward, in 2 remaining databases, remove the document that has occurred in the selected database that goes out of the 1st time, the 2nd time and the 3rd time respectively after,
Figure A20091014607000084
So the 4th selecteed database is designated as C for comprising the database of 2,3,7,8 documents 4'=C 2.=2,3,7,8}.
Select the 5th number storehouse: except the C that has been selected 1, C 2, C 3And C 5Outward, in 1 remaining database, remove respectively the 1st time, the 2nd time, the 3rd time and the selected database that goes out of the 4th in behind the document that occurred, carry out storehouse importance marking more respectively and calculate.
Figure A20091014607000086
Illustrate that remaining database selects null(NUL), set is selected to leave it at that.So the database that is selected is 4 altogether, m=4.
Embodiment 2
Suppose to get preceding 100 documents, each document identifies as it with its sorting position sequence number in result for retrieval equally.Suppose to have 60 databases to adopt the random number method, make these 60 databases all include in these 100 documents some randomly for retrieval.The method according to this invention from these 60 databases after certain random number produces, is selected the database collection of 24 optimal combinations, in the time of can be for complete these 100 documents of inspection, covers each other between the selecteed database and reaches minimum.It is how to replenish mutually to cover these 100 documents that experimental result has been reproduced 24 selected databases.As first database that is selected, also be a most important database, experimental result shows that this database has covered the 1st, the 9, the the the 11st, the 33, the 46th, the 53, the the 55th, the 62, the 64th, the 82, the the 83rd, the 86, the 87th, the 91, the 94th and the 96th document.
The present invention is from the information resources set of a large amount of, dispersion and isomery, select the appropriate information subset of resources to close,, can search suitable information to satisfy user's enquirement, reduce the COMPUTER CALCULATION expense when guaranteeing certain recall ratio and precision ratio, improve recall precision.

Claims (5)

1. set option method based on distributed information retrieval system, it is characterized in that this method comprises: utilize set to cover, calculating needs the level of coverage of data retrieved to database to be selected, according to the size of level of coverage, determine to select the sequencing of database collection.
2. set option method according to claim 1 is characterized in that the level of coverage that described calculating needs data retrieved to treat the database of selection comprises the following steps:
(1) by giving the method that is contained in the retrieve data weighted sum in the database to be selected, calculates the importance score value of database collection to be selected;
(2) for same retrieve data, if occurred in the database that has formerly selected, when calculating the database importance score value of back, no longer counting in the database importance score value of back appears in these data once more.
3. set option method according to claim 2 is characterized in that,
The described algorithm of above-mentioned steps (1) further is: suppose to have an enquirement, retrieval is designated as respectively in conjunction with preceding n the data (n is a natural number) that merge after sorting: d1, and d2 ..., dk ..., dn; K data dk is at certain database C jIn when occurring the contribution score value to this database importance be 1/k β, β is a positive rational number; The importance score value of database is the contribution score value sum of all specific data that it comprised.
4. set option method according to claim 1 is characterized in that, this method comprises following selection step:
(1) calculates the database C that all contain any one data in the above n data iImportance part value, select the database of score value maximum first-selected database and be designated as C ' in selecting as database collection 1
(2) remove the database of selected mistake, calculate in the remaining database the described database C that comprises any one data in n the data iImportance part value, in calculating, remove the data that comprise in the database of having selected; Select maximum in importance part value one, be designated as the 2nd and select database C ' 2
(3) repeat above step (2), select C ' up to the m time m, when this 1 to m selecteed database replenishes when covering n all data end data storehouse selection step jointly.
5. set option method according to claim 4 is characterized in that, puts question to for a retrieval to supply the data retrieved storehouse, has to comprise M the database of puing question to answer, database collection C=(C 1, C 2..., C M), C iBe i alternative database, i=1,2 ..., M; N is through merging in the result for retrieval after sorting a preceding n document.C ' is according to degree of correlation size, the set of all m databases on selected, C '=(C 1', C 2' ..., C m'), C j' be the database of j on selected, j=1,2 ..., m, SC j' be the score value of the database of j on selected, j=1,2 ..., m, β are the normal parameter in the weight function, are positive rational number; Document k appears at database C iIn, be labeled as kC in form i, its contribution score value is designated as
Figure A2009101460700003C1
Document k appears among the database C ' j, is labeled as kC ' j in form, and its contribution score value is designated as
Figure A2009101460700003C2
The importance score value computing formula of database is:
SC 1 ′ = max i = 1 M Σ k = 1 n 1 ( kC i ) β - - - ( a )
SC l ′ = max i = 1 M Σ k = 1 n 1 ( kC i - Σ j = 1 l - 1 kC ′ j ) β , 2 ≤ l ≤ m - - - ( b ) .
CN2009101460707A 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system Expired - Fee Related CN101582085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009101460707A CN101582085B (en) 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN200810156053 2008-09-19
CN200810156053.7 2008-09-19
CN2009101460707A CN101582085B (en) 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system

Publications (2)

Publication Number Publication Date
CN101582085A true CN101582085A (en) 2009-11-18
CN101582085B CN101582085B (en) 2011-11-16

Family

ID=41364232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101460707A Expired - Fee Related CN101582085B (en) 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system

Country Status (1)

Country Link
CN (1) CN101582085B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521350A (en) * 2011-12-12 2012-06-27 浙江大学 Selection method of distributed information retrieval sets based on historical click data
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
CN104699733A (en) * 2014-10-28 2015-06-10 电信科学技术第十研究所 Method and device for computing full-text retrieval recall ratio
CN105956010A (en) * 2016-04-20 2016-09-21 浙江大学 Distributed information retrieval set selection method based on distributed representation and local ordering
CN110781204A (en) * 2019-09-09 2020-02-11 腾讯大地通途(北京)科技有限公司 Identification information determination method, device, equipment and storage medium of target object

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521350A (en) * 2011-12-12 2012-06-27 浙江大学 Selection method of distributed information retrieval sets based on historical click data
CN102521350B (en) * 2011-12-12 2014-07-16 浙江大学 Selection method of distributed information retrieval sets based on historical click data
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
CN104050235B (en) * 2014-03-27 2017-02-22 浙江大学 Distributed information retrieval method based on set selection
CN104699733A (en) * 2014-10-28 2015-06-10 电信科学技术第十研究所 Method and device for computing full-text retrieval recall ratio
CN104699733B (en) * 2014-10-28 2018-07-24 电信科学技术第十研究所 A kind of method and device calculating full-text search recall ratio
CN105956010A (en) * 2016-04-20 2016-09-21 浙江大学 Distributed information retrieval set selection method based on distributed representation and local ordering
CN105956010B (en) * 2016-04-20 2019-03-26 浙江大学 Distributed information retrieval set option method based on distributed characterization and partial ordering
CN110781204A (en) * 2019-09-09 2020-02-11 腾讯大地通途(北京)科技有限公司 Identification information determination method, device, equipment and storage medium of target object
CN110781204B (en) * 2019-09-09 2024-02-20 腾讯大地通途(北京)科技有限公司 Identification information determining method, device, equipment and storage medium of target object

Also Published As

Publication number Publication date
CN101582085B (en) 2011-11-16

Similar Documents

Publication Publication Date Title
Beasley et al. A sequential niche technique for multimodal function optimization
Jin et al. An improved ID3 decision tree algorithm
Cheng et al. Evaluating probability threshold k-nearest-neighbor queries over uncertain data
Fontaneto et al. Cryptic diversity in the genus Adineta Hudson & Gosse, 1886 (Rotifera: Bdelloidea: Adinetidae): a DNA taxonomy approach
CN106844637A (en) Method is recommended based on the film for just giving cluster to prune improvement multi-objective genetic algorithm
CN101582085B (en) Set option method based on distributed information retrieval system
Johnson et al. Evolving strategies for focused web crawling
CN101211355A (en) Image inquiry method based on clustering
Sadi et al. An efficient community detection method using parallel clique-finding ants
CN105468598A (en) Friend recommendation method and device
CN107958298A (en) A kind of choosing method of the logistics node based on clustering algorithm
CN111753215B (en) Multi-objective recommendation optimization method and readable medium
CN110059221A (en) Video recommendation method, electronic equipment and computer readable storage medium
CN106951425A (en) A kind of mapping method and equipment
CN110580252B (en) Space object indexing and query method under multi-objective optimization
CN113742292B (en) Multithread data retrieval and access method of retrieved data based on AI technology
CN107358534A (en) The unbiased data collecting system and acquisition method of social networks
CN103186650A (en) Searching method and device
CN107133321A (en) The analysis method and analytical equipment of the search attribute of the page
CN105653686A (en) Domain name network address activeness statistics method and system
CN107239791A (en) A kind of higher-dimension K means cluster centre method for optimizing based on LSH
CN107169114A (en) A kind of mass data multidimensional ordering searching method
Marghny et al. Web mining based on genetic algorithm
CN105589896B (en) Data digging method and device
Altieri et al. Spatial Sampling for Non‐compact Patterns

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111116

Termination date: 20130605