CN101582085B - Set option method based on distributed information retrieval system - Google Patents

Set option method based on distributed information retrieval system Download PDF

Info

Publication number
CN101582085B
CN101582085B CN2009101460707A CN200910146070A CN101582085B CN 101582085 B CN101582085 B CN 101582085B CN 2009101460707 A CN2009101460707 A CN 2009101460707A CN 200910146070 A CN200910146070 A CN 200910146070A CN 101582085 B CN101582085 B CN 101582085B
Authority
CN
China
Prior art keywords
database
data
score value
information retrieval
distributed information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009101460707A
Other languages
Chinese (zh)
Other versions
CN101582085A (en
Inventor
王秀红
鞠时光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN2009101460707A priority Critical patent/CN101582085B/en
Publication of CN101582085A publication Critical patent/CN101582085A/en
Application granted granted Critical
Publication of CN101582085B publication Critical patent/CN101582085B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a set option method in distributed information retrieval and aims at providing a set option method with high retrieval effectiveness and good effect based on a distributed information retrieval system. The technical solution for realizing the aim is the set option method based on the distributed information retrieval system, and the method comprises the steps of calculating the coverage of the data to be retrieved on the database to be selected, and determining the sequence for selecting the database set according to the coverage size. The method greatly improves the time and space costs of the system computation of the computer during distributed information retrieval, guarantees the recall ratio and precision ratio of the asking result, and enhances the efficiency and effectiveness of distributed information retrieval.

Description

A kind of set option method based on distributed information retrieval system
Technical field
The present invention relates to the computer information retrieval field, be specifically related to set option method in the distributed information retrieval.
Background technology
Distributed information retrieval is an important research direction of information retrieval, and the main contents of research comprise several sections such as " set is selected ", " forms data set retrieval ", " result's merging ".Utilize the set selection algorithm to find out maximally related data acquisition and retrieve, thereby realize the query portion collection of document and provide the effect of good result for retrieval.The effect that set is selected is directly determining the quality of final result for retrieval.In the distributed information retrieval field, set selects to cry again database to select or resource selection.
Aspect the set selection, famous algorithm mainly contains 3 kinds: (1) CORI (Collection RetrievalInference Network) algorithm: information bank retrieval inference network method, be that people such as Callan propose nineteen ninety-five, by original document is carried out the Bayesian inference net that correlativity is judged, each information bank is all regarded as one piece of huge document, and the method for the document hierarchical arrangement in the method for information bank hierarchical arrangement and the conventional ir system is similar.(2) gGlOSS (Generalized Glossary of Servers Server) algorithm: be to propose in 1999 by people such as Gravano L., be the friendly of input question-type to be carried out hierarchical arrangement to information bank according to information bank, do like this and can estimate to contain in each information bank the quantity of document that surpasses a certain threshold value, determine the score of information bank then according to quantity of document, attempt to solve when obtaining a plurality of couplings source how to select suitable source, and developed vector space search version and Boolean variable version.(3) CVV (The Cue-Validity-Variance) algorithm: be to propose in 1997, noticed the query feature of Internet, on the basis of vector space algorithm, algorithm has been done improvement by Yuwono and Lee.Kirsch; Steven T. and Chang; William I. (2000, US Patent 6,018,733) discloses a kind of set option method of the file retrieval based on special inquiry.Kirsch (1999, US Patent 5,983,21) discloses a kind of how many next automatic choice sets according to the statistics specific word that collection of document comprised to be closed.In addition, D ' Souza (2004) and Si (2004) and MinJie Zhang propositions such as (2006) are better than the CORI algorithm based on the set option method of language model.(2007) such as Sergey Chernov propose a kind of set selection algorithm based on word frequency statistics and false appearance pass language model.The set option method that Wu and Crestani (2003) propose a kind of multiple goal pattern has been taken all factors into consideration and the degree of correlation of puing question to, computing time expense and the possible chance that obtains same data simultaneously in a plurality of resources.Zhang Gang (2007) will gather the selection problem and be converted into the file retrieval problem, attempt multiple document retrieval method and solve set selection problem.
Summary of the invention
The present invention seeks to: provide a kind of recall precision high and effective set option method based on distributed information retrieval system.
The technical scheme that realizes the foregoing invention purpose is:
A kind of set option method based on distributed information retrieval system, this method comprises: calculating needs the level of coverage of data retrieved to database to be selected, according to the size of level of coverage, determines to select the sequencing of database collection;
The level of coverage that described calculating needs data retrieved to treat the database of selection further may further comprise the steps:
1. by calculating the importance score value of database collection to be selected for the method (being greedy computing method) that is contained in the retrieve data weighted sum in the database to be selected;
2. for same retrieve data, if occurred in the database that has formerly selected, when calculating the database importance score value of back, consider the covering between the disparate databases set, also no longer counting in the database score value of back appears in these data once more.
Above-mentioned steps 1 described greedy algorithm further is: suppose to have an enquirement, result for retrieval merges preceding n the data (n is a natural number) after the ordering, is designated as respectively: d1, and d2 ..., dk ..., dn.K data dk is at certain database C iIn when occurring the contribution score value to this database importance be 1/k β, β is a positive rational number; The importance score value of database is the contribution score value sum of all specific data that it comprised.
Can supply the data retrieved storehouse for a retrieval enquirement, have to comprise M the database of puing question to answer, database collection C=(C 1, C 2..., C M), C iBe i alternative database, i=1,2 ..., M.N is through merging in the result for retrieval after sorting a preceding n document.C ' is according to degree of correlation size, the set of all m databases on selected, C '=(C 1', C 2' ..., C m'), C j' be the database of j on selected, j=1,2 ..., m, SC j' be the score value of the database of j on selected, j=1,2 ..., m, β are the normal parameter in the weight function, are positive rational number, get β=1 herein in the example.Document k is if at database C iIn, be labeled as kC in form i, its contribution score value is designated as
Figure DEST_PATH_GSB00000454029000011
Document k is if at database C ' jIn, be labeled as kC ' in form j, its contribution score value is designated as
Figure DEST_PATH_GSB00000454029000012
The importance score value computing formula of database is:
SC 1 ′ = max i = 1 M Σ k = 1 n 1 ( kC i ) β - - - ( a )
SC l ′ = max i = 1 M Σ k = 1 n 1 ( kC i - Σ j = 1 l - 1 kC ′ j ) β , 2 ≤ l ≤ m - - - ( b )
In the such scheme, specifically comprise following selection step:
(1) calculates the database C that all contain any one data in the above n data iThe importance score value, select the database of score value maximum first-selected database and be designated as C ' in selecting as database collection 1
(2) remove the database of selected mistake, calculate in the remaining database the described database C that comprises any one data in n the data iThe importance score value, in calculating, remove the data that comprise in the database selected; Select maximum in the importance score value one, be designated as the 2nd and select database C ' 2
(3) repeat above step (2), select C ' up to the m time m, when this 1 to m selecteed database replenishes when covering n all data end data storehouse selection step jointly.
Compared with prior art, the inventive method has the following advantages:
1, adopt set to cover and greedy algorithm, but not in the past " digital finger-print ", " word frequency statistics " or information retrieval methods such as " language models " have greatly reduced the computer information retrieval computing cost;
2, consider in esse covering problem between the database, optimized the distributed collection selection result, improved effectiveness of retrieval and effect;
3, merge the difference of position, ordering back according to data at result for retrieval, give different weights, improved the recall rate and the precision ratio of distributed information retrieval in order to calculate the contribution score value of these data to database.
Description of drawings
Fig. 1 is the distributed information retrieval system structural drawing;
Fig. 2 gathers selection course method flow synoptic diagram.
Embodiment
Be described further below in conjunction with accompanying drawing.
As shown in Figure 1, a kind ofly comprise client computer 1 based on distributed information retrieval system, client computer 2 ... client computer n, information retrieval server 1, information retrieval server 2 ... information retrieval server n, client's group of planes is connected by network with the information retrieval server group, and client's group of planes provides retrieval to put question to data, the information retrieval server group provides and comprises all databases of puing question to answer, the invention provides a kind of set option method that carries out information retrieval based on this system.
To following examples, at first define the relevant symbol and the meaning of representative thereof:
Table 1 expression symbol and meaning thereof
Figure GSB00000349393500031
Figure GSB00000349393500041
Document k is if at C iIn, be labeled as kC in form i, its contribution score value is designated as
Figure GSB00000349393500042
, document k is if at C ' jIn, be labeled as kC ' in form j, its contribution score value is designated as
Figure GSB00000349393500043
, the importance score value computing formula of database is:
SC 1 ′ = max i = 1 M Σ k = 1 n 1 ( kC i ) β - - - ( a )
SC 1 ′ = max i = 1 M Σ k = 1 n 1 ( kC i - Σ j = 1 l - 1 kC ′ j ) β 2≤l≤m (b)
Embodiment 1
As shown in Figure 2, suppose n=10, promptly get among the result who merges after the ordering and be positioned at preceding 10 document.And each document ordering number sign of oneself.Suppose to have alternative database collection a: C 1In comprised 1,2,3,4 documents; C 2In comprised 2,3,7,8 documents; C 3In comprised 1,5,6,7 documents; C 4In comprised 4,5,6,9 documents; C 5In comprised 9,10 documents.Do not comprise desired result for retrieval in other database, the object that do not elect is considered.K document to the contribution score value of the database that comprises it is: 1/k (supposing β=1 herein) selects the process of database to be:
At first selecteed database is: in all databases, to each database, look for maximal value after the particular data contribution score value summation that will comprise.According to the Ben Tongzhi SC of each document k to database 1'=max{1+1/2+1/3+1/4,1/2+1/3+1/7+1/8,1+1/5+1/6+1/7,1/4+1/5+1/6+1/9,1/9+1/10}={1+1/2+1/3+1/4}.So a most important database just the 1st selecteed data is designated as C for comprising the database of 1,2,3,4 documents 1'=C 1=1,2,3,4}.
Select the 2nd number storehouse: except the C that has been selected 1Outward, in 4 remaining databases, remove the document that has occurred in the 1st selected database that goes out respectively after, carry out importance marking more respectively and calculate.
Figure GSB00000349393500051
Represent that the database of selecting for the 2nd time is the database that comprises 1,5,6,7 documents, be designated as C 2'=C 3=1,5,6,7).
Select the 3rd number storehouse: except the C that has been selected 1And C 3Outward, in 3 remaining databases, remove the document that has occurred in the selected database that goes out of the 1st time and the 2nd time respectively after, carry out storehouse importance marking calculating more respectively.
Figure GSB00000349393500053
Represent that the database of selecting for the 3rd time is the database that comprises 9,10 documents, be designated as C 3'=C 5.=={ 9,10);
Select the 4th number storehouse: except the C that has been selected 1, C 3And C 5Outward, in 2 remaining databases, remove the document that has occurred in the selected database that goes out of the 1st time, the 2nd time and the 3rd time respectively after,
Figure GSB00000349393500054
Figure GSB00000349393500055
So the 4th selecteed database is designated as C for comprising the database of 2,3,7,8 documents 4'=C 2.=2,3,7,8).
Select the 5th number storehouse: except the C that has been selected 1, C 2, C 3And C 5Outward, in 1 remaining database, remove respectively the 1st time, the 2nd time, the 3rd time and the selected database that goes out of the 4th in behind the document that occurred, carry out storehouse importance marking more respectively and calculate.
Figure GSB00000349393500056
Illustrate that remaining database selects null(NUL), set is selected to leave it at that.So the database that is selected is 4 altogether, m=4.
Embodiment 2
Suppose to get preceding 100 documents, each document identifies as it with its sorting position sequence number in result for retrieval equally.Suppose to have 60 databases to adopt the random number method, make these 60 databases all include in these 100 documents some randomly for retrieval.The method according to this invention from these 60 databases after certain random number produces, is selected the database collection of 24 optimal combinations, in the time of can be for complete these 100 documents of inspection, covers each other between the selecteed database and reaches minimum.It is how to replenish mutually to cover these 100 documents that experimental result has been reproduced 24 selected databases.As first database that is selected, also be a most important database, experimental result shows that this database has covered the 1st, the 9, the the the 11st, the 33, the 46th, the 53, the the 55th, the 62, the 64th, the 82, the the 83rd, the 86, the 87th, the 91, the 94th and the 96th document.
The present invention is from the information resources set of a large amount of, dispersion and isomery, select the appropriate information subset of resources to close,, can search suitable information to satisfy user's enquirement, reduce the COMPUTER CALCULATION expense when guaranteeing certain recall ratio and precision ratio, improve recall precision.

Claims (1)

1. set option method based on distributed information retrieval system, it is characterized in that this method comprises: utilize set to cover, calculating needs the level of coverage of data retrieved to database to be selected, according to the size of level of coverage, determine to select the sequencing of database collection; The level of coverage that described calculating needs data retrieved to treat the database of selection comprises the following steps: to calculate the importance score value of database collection to be selected by to the method that is contained in the retrieve data weighted sum in the database to be selected, specifically comprises:
(1) suppose to have an enquirement, result for retrieval merges preceding n data after the ordering, and n is a natural number, is designated as respectively: d1, and d2 ..., dk ..., dn; K data dk is at certain database C iIn when occurring, C iBe i alternative database, i=1,2 ..., M is 1/k to the contribution score value of this database importance β, β is a positive rational number; The importance score value of database is the contribution score value sum of all specific data that it comprised;
(2) calculate all and contain database C with any one data in n the data of going forward iThe importance score value, i=1,2 ..., M selects the database of score value maximum first-selected database and be designated as C ' in selecting as database collection 1
(3) remove the database of selected mistake, calculate in the remaining database the described database C that comprises with any one data in n the data of going forward iThe importance score value, in calculating, remove the data that comprise in the database selected, these data no longer count database C iThe importance score value in; Select maximum in the importance score value one, be designated as the 2nd and select database C ' 2
(4) repeat above step (3), select C ' up to the m time m, when this 1 to m selecteed database replenish jointly cover all when going forward n data, end data storehouse selection step.
CN2009101460707A 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system Expired - Fee Related CN101582085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009101460707A CN101582085B (en) 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN200810156053 2008-09-19
CN200810156053.7 2008-09-19
CN2009101460707A CN101582085B (en) 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system

Publications (2)

Publication Number Publication Date
CN101582085A CN101582085A (en) 2009-11-18
CN101582085B true CN101582085B (en) 2011-11-16

Family

ID=41364232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009101460707A Expired - Fee Related CN101582085B (en) 2008-09-19 2009-06-05 Set option method based on distributed information retrieval system

Country Status (1)

Country Link
CN (1) CN101582085B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521350B (en) * 2011-12-12 2014-07-16 浙江大学 Selection method of distributed information retrieval sets based on historical click data
CN104050235B (en) * 2014-03-27 2017-02-22 浙江大学 Distributed information retrieval method based on set selection
CN104699733B (en) * 2014-10-28 2018-07-24 电信科学技术第十研究所 A kind of method and device calculating full-text search recall ratio
CN105956010B (en) * 2016-04-20 2019-03-26 浙江大学 Distributed information retrieval set option method based on distributed characterization and partial ordering
CN110781204B (en) * 2019-09-09 2024-02-20 腾讯大地通途(北京)科技有限公司 Identification information determining method, device, equipment and storage medium of target object

Also Published As

Publication number Publication date
CN101582085A (en) 2009-11-18

Similar Documents

Publication Publication Date Title
Fontaneto et al. Cryptic diversity in the genus Adineta Hudson & Gosse, 1886 (Rotifera: Bdelloidea: Adinetidae): a DNA taxonomy approach
CN106157155B (en) Social media information propagation visualization analysis method and system based on map metaphor
CN101582085B (en) Set option method based on distributed information retrieval system
CN103106279B (en) Clustering method a kind of while based on nodal community and structural relationship similarity
Kosman et al. Conservation prioritization based on trait‐based metrics illustrated with global parrot distributions
CN106844637A (en) Method is recommended based on the film for just giving cluster to prune improvement multi-objective genetic algorithm
Johnson et al. Evolving strategies for focused web crawling
CN105956093B (en) A kind of personalized recommendation method based on multiple view anchor point figure Hash technology
CN105956184B (en) Collaborative and organized junk information issue the recognition methods of group in a kind of microblogging community network
Ahmed et al. Space-efficient sampling from social activity streams
Sadi et al. An efficient community detection method using parallel clique-finding ants
CN105468598A (en) Friend recommendation method and device
CN106528804B (en) A kind of tenant group method based on fuzzy clustering
CN110176050B (en) Aesthetic optimization method for text generated image
CN111753215B (en) Multi-objective recommendation optimization method and readable medium
CN110516163A (en) A kind of commodity sort method and system based on user behavior data
CN115775026A (en) Federated learning method based on organization similarity
CN107451617A (en) One kind figure transduction semisupervised classification method
CN109472343A (en) A kind of improvement sample data missing values based on GKNN fill up algorithm
CN107133321A (en) The analysis method and analytical equipment of the search attribute of the page
CN105653686A (en) Domain name network address activeness statistics method and system
CN103793504B (en) A kind of cluster initial point system of selection based on user preference and item attribute
Marghny et al. Web mining based on genetic algorithm
CN105589896B (en) Data digging method and device
CN107577681B (en) A kind of terrain analysis based on social media picture, recommended method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111116

Termination date: 20130605