CN102609536B - Resource selection method in non-cooperative environment - Google Patents

Resource selection method in non-cooperative environment Download PDF

Info

Publication number
CN102609536B
CN102609536B CN 201210035195 CN201210035195A CN102609536B CN 102609536 B CN102609536 B CN 102609536B CN 201210035195 CN201210035195 CN 201210035195 CN 201210035195 A CN201210035195 A CN 201210035195A CN 102609536 B CN102609536 B CN 102609536B
Authority
CN
China
Prior art keywords
resource
statistical information
degree
inquiry
fingerprint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201210035195
Other languages
Chinese (zh)
Other versions
CN102609536A (en
Inventor
任祖杰
徐向华
万健
张纪林
蒋从锋
任永坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN 201210035195 priority Critical patent/CN102609536B/en
Publication of CN102609536A publication Critical patent/CN102609536A/en
Application granted granted Critical
Publication of CN102609536B publication Critical patent/CN102609536B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a resource selection method in a non-cooperative environment, which includes the steps: firstly, computing relevance of resources and sequencing the relevance based on a relevance resource selection method in the non-cooperative environment so as to obtain a resource list in terms of the relevance sequence; secondly, extracting coverage statistical information from each resource by means of the fingerprint extracting technology, and compressing by the aid of Bloom filters; thirdly, performing high-efficiency storage and retrieval by means of a distribution strategy based on queried keyword semantic meanings; fourthly, comparing overlap degree of corresponding fingerprint sets by comparing the Bloom filters so as to obtain novelty of each resource; fifthly, computing the novelty of each resource, and rearranging the sequence of the candidate resources according to the novelty; and finally, performing weighting computation according to the relevance and the novelty so as to obtain an optimal resource list. The method gives consideration to the resource relevance and the overlap degree in resource selection, and querying efficiency is improved.

Description

Resource selection method under a kind of non-co-operative environment
Technical field
The present invention relates to the resource selection method under a kind of non-co-operative environment, in particular, the present invention relates to a kind of resource selection method resource dependency degree and overlapping degree, under the non-co-operative environment of taking into account.
Background technology
Resource selection is a popular research theme in distributed information retrieval field.For a given inquiry Q, distributed search engine utilizes resource selection method to determine to inquire about related resource tabulation with this, then inquiry is issued the resource of related resource in tabulating.Outstanding resource selection method can be to each inquiry, only need a small amount of resource to participate in that inquiry just can reach and all resources participate in the approaching result of inquiry.Therefore, the effect of resource selection has directly determined the efficient of query execution process and the quality of Query Result.
Most of traditional resource selection method is paid close attention to the degree of correlation of resource and inquiry.It is overlapping that these methods suppose that usually the document sets of each resource does not exist, and perhaps thinks overlapping less so that can ignore.Yet in the P2P search engine under a Non-synergic environment, each its document sets of resource independent maintenance is inevitably so that have a great deal of identical or closely similar document between the resource under the Non-synergic environment.For example, have a lot of similar papers between famous library automation such as ACM, the IEEE, news category website such as Netease, Sina etc. also can comprise a large amount of similar news web pages.
In the face of this problem, if resource selection method is not considered the overlapping of resource document collection, just an inquiry may be transmitted to two resources (such as two mirror image website) that overlapping degree is very high, cause network resources waste and reduce the efficient of inquiring about.Therefore, be necessary to study a kind of resource selection method of taking into account the overlapping and degree of correlation of resource.
Summary of the invention
For the problems referred to above, the invention discloses the resource selection method under a kind of non-co-operative environment, the method can be taken into account resource degree of overlapping and the degree of correlation simultaneously when selecting resource, the novel as a result total amount of maximization expection, improve the validity of resource selection, thereby improve the efficient of inquiry.
The technical scheme steps that the present invention solves its technical matters employing is as follows:
Resource selection method under a kind of non-co-operative environment is to take into account resource dependency degree and overlapping degree when resource selection, thereby improves the efficient of inquiry, and the method adopts following steps to realize:
Step 1: at first utilize the resource selection method based on the degree of correlation, calculate each resource dependency degree and ordering, obtain one according to the Resources list of resource dependency degree ordering.
Step 2: the fingerprint collection that from Query Result, obtains result document; Suppose a resource group<P1, P2 ... Pi ... Pn 〉, and suppose that a node produces an inquiry Q, after node is received return results, to each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document.
Step 3: management covers statistical information; This process has comprised three subprocess: cover the process of statistical information, the storing process of covering statistical information, the process that the covering statistical information is retrieved from the as a result concentrated extraction of fingerprint; Described management comprises two generic operations: storage and retrieval; When one group cover statistical information and produce after, system need to be according to covering the semanteme of inquiring about in the statistical information, is distributed in each resource of system and stores the convenient retrieval that covers statistical information.
Step 4: the novel degree that calculates each resource; According to given one group of resource and covering statistical information thereof, calculate the quantity that each resource contains novel result, and then calculate each resource to the novel degree of Query Result.
Step 5: according to the resource dependency degree that calculates in the step 1, in conjunction with novel degree the tabulation of resource ordering is adjusted, so that the maximization of novel fruiting quantities.
Beneficial effect of the present invention:
1. the present invention can extract from Query Result and cover statistical information, these covering statistical informations can be used in the overlapping degree between computational resource in follow-up query script, the novel as a result total amount of maximization expection when resource selection, thereby the validity of improvement resource selection.
2. the present invention will cover statistical information and store in the Chord network according to the semantic vector space of its inquiry, thereby so that similar semantic query collection, can share the covering statistical information, greatly reduce the storage space that system covers statistical information, and increased the hit rate that covers statistical information, solve the problem of many words synonym.
3. in the situation that exist overlappingly between resource, the present invention can reduce the waste of query messages than other resource selection methods, effectively improves search efficiency.
Description of drawings
Fig. 1 is the present invention carries out resource selection method under non-co-operative environment step.
Embodiment
Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in further detail.Its concrete steps are described as shown in Figure 1:
Step 1. generates the initial resource tabulation.Utilization calculates the degree of correlation and the ordering of each resource based on the resource selection method of the degree of correlation, obtains one according to the tabulation of relevancy ranking.
Step 2. is obtained the fingerprint collection of result document from Query Result.Comprise two sub-steps:
1) collection that from the result, takes the fingerprint.To each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document.Two very approaching titles of content can show by enough same fingerprints.To certain resource and inquiry, the set of all fingerprints is exactly the covering statistical information of this resource.In order to solve better the problem that takes the fingerprint from short text, the present invention adopt a kind of efficiently, healthy and strong, do not need the fingerprint technique (Shingle-based Discrete Cosine Transform, S-DCT) of global statistics information.Be filtering noise vocabulary, S-DCT is with stop words and punctuation mark deletion; From word sequence, generate one group of shingle, utilize DCT that each shingle is changed into a fingerprint.Specifically, the S-DCT method may further comprise the steps:
(1) obtains a result Title content.
(2) deletion stop words and punctuation mark.
(3) the root operation is got in each word execution.
(4) lexcographical order pressed in the residue word and arrange, generate a word order.
(5) utilize sliding window technique, word order is generated one group Shingles
(6) to each Shingle, calculate ShingleIn cryptographic hash.
(7) all cryptographic hash are carried out vertical conversion, the average that makes it cryptographic hash drops on 0.
(8) use the Hash maximal value, all cryptographic hash of standardizing.
(9) all normalized cryptographic hash are carried out dct transform.
(10) be on a small amount of bit position to each DCT coefficient quantization.
(11) merge all bit positions, create fingerprint.
(12) fingerprint of all shingles is used for this result of expression
Figure 576085DEST_PATH_IMAGE002
2) compression fingerprint collection.In order to save bandwidth and storage space, utilize Bloom filter to store the fingerprint collection.Thereby the representation of the covering statistical information of a resource is:
Figure 2012100351954100002DEST_PATH_IMAGE004
Generally, the fingerprint of a document should produce based on all the elements of document.
Step 3. management covers statistical information.
After one group of covering statistical information produced, system need to according to the semanteme of query word in the statistical information, be distributed in the P2P network.The covering statistical information of the inquiry correspondence that semanteme is close can be placed on the same resource.Correspondingly, the covering statistical information of a particular keywords of inquiry is to inquire about in semantic space according to the semanteme of this keyword.Correspondingly, a given inquiry, the covering statistical information that this inquiry is relevant is retrieved by the semantic vector of this inquiry.Thereby can in efficient storage and retrieval covering statistical information, reduce the storage overhead of system and the extensibility of raising system.
For the storage overhead that reduces system and the extensibility that improves system, the present invention adopts the distribution policy based on the searching keyword semanteme, utilize potential semantic indexing that each query vector is mapped to its semantic vector, again semantic vector is mapped to a round values that is positioned at Chord ID scope, determines which resource is this covering statistical information should be placed on.This process has comprised three subprocess: cover the process of statistical information, the storing process of covering statistical information, the process that the covering statistical information is retrieved from the as a result concentrated extraction of fingerprint.
1) cover statistical information from the as a result concentrated extraction of fingerprint, algorithm flow is:
Figure 2012100351954100002DEST_PATH_IMAGE006
Wherein set up potential semantic indexing and be mapped to the step of semantic space as follows:
(1) analytical documentation set, the matrix of word-document that the structure document sets is corresponding;
(2) word-document matrix is carried out svd (SVD);
(3) matrix after the SVD decomposition is carried out dimensionality reduction;
(4) matrix behind the use dimensionality reduction makes up latent semantic space.
2) storing process of covering statistical information, the algorithm implementation is as follows:
(1) after resource A obtained the covering statistical information CV (Q) of an inquiry Q, resource A utilized potential semantic indexing to obtain the semantic vector of this inquiry VQ
(2) then, VQ is mapped to the ID space of Chord, route is pointed to resource B.
(3) last, CV (Q) is sent to its destination resource B.
3) cover the statistical information retrieving.When a resource (being assumed to A) is initiated an inquiry QAfter, this inquiry is switched to semantic vector VQ, and then be mapped to one Chord ID, point to resource B.If resource B has inquiry QCorresponding covering statistical information CV (Q) then will cover statistical information CV (Q) and issue resource A.If there is no, then whether resource B searching exists and inquiry QSimilar inquiry Q ', satisfy
Figure 2012100351954100002DEST_PATH_IMAGE008
If find, then return the covering statistical information CV (Q ')If still do not find similarly, then return an inquiry and cover the statistical information failure, and notice resource A needs to extract inquiry after the result returns QThe covering statistical information.
The novel degree of each resource of step 4. estimation
Compare resource
Figure 2012100351954100002DEST_PATH_IMAGE010
Bloom filter
Figure 2012100351954100002DEST_PATH_IMAGE012
With Bloom filter
Figure 2012100351954100002DEST_PATH_IMAGE014
, wherein SIt is the set of the resource chosen.
Figure 2012100351954100002DEST_PATH_IMAGE016
The document space that expression has covered.Define a resource
Figure 659710DEST_PATH_IMAGE010
Novel degree be:
Figure DEST_PATH_IMAGE018
Namely at Bloom filter
Figure 346519DEST_PATH_IMAGE012
In be set to 1 and Bloom filter
Figure 106665DEST_PATH_IMAGE016
In be the quantity of 0 bit position.Similarly, definition Bloom filter
Figure 373698DEST_PATH_IMAGE016
And Bloom filter
Figure 510281DEST_PATH_IMAGE012
Degree of overlapping be:
Figure DEST_PATH_IMAGE020
Step 5. synthesis pertinence and novel degree sort to resource.Comprise three subprocess:
(1) utilizes the CORI method, calculate the degree of correlation of each resource and inquiry
Figure DEST_PATH_IMAGE022
,And sort from high to low by the degree of correlation, obtain candidate's the Resources list.Wherein calculate the degree of correlation of all resources Relevance[i]Algorithm flow be:
Figure DEST_PATH_IMAGE024
Wherein s Max It is the maximal value of the degree of correlation score of all candidate's resources.In order to obtain the degree of correlation score value of normalization Relevance[i], the degree of correlation of each resource is original score S[i]Divided by s Max
(2) calculate the novel degree of each resource
Figure DEST_PATH_IMAGE026
,And readjust putting in order of candidate's resource according to novel degree.Resource novelty degree Novelty[i]Computation process is:
Figure DEST_PATH_IMAGE028
It calls two functions NovelDocs ()With OverlapDocs ()Calculate respectively each resource with respect to the novel number of files of the Resources list of having selected N[i]With O[i], calculate N[i]With O[i]Ratio C[i], right at last C[i]Normalization obtains novel degree Novelty[i]
Wherein, function NovelDocs ()Return Bloom filter
Figure 992209DEST_PATH_IMAGE012
Middle bit position is 1 and Bloom filter
Figure 923256DEST_PATH_IMAGE016
Middle bit position is 0 quantity.Specifically, its computing formula is expressed as:
Figure DEST_PATH_IMAGE030
Function OverlapDocs ()Return Bloom filter
Figure 555881DEST_PATH_IMAGE012
And Bloom filter In be 1 bit position sum, its computing formula is expressed as:
(3) calculate optimum the Resources list.The detailed process of synthesis pertinence and novel degree is:
The degree of correlation score value of each resource S[i]And Bloom filter Bf[i]Input as algorithm.At first that the degree of correlation is the highest resource of algorithm is selected an optimum resource at every turn and is joined in new the Resources list as seed from surplus resources.Wherein, calculating optimum method is to obtain by the degree of correlation and novel degree ranking operation to each resource:
Figure DEST_PATH_IMAGE036
Wherein It is a parameter between [0,1].

Claims (3)

1. the resource selection method under the non-co-operative environment, it is characterized in that: take into account resource dependency degree and overlapping degree when resource selection, thereby improve the efficient of inquiry, the method adopts following steps to realize:
Step 1: at first utilize the resource selection method based on the degree of correlation, calculate each resource dependency degree and ordering, obtain one according to the Resources list of resource dependency degree ordering;
Step 2: the fingerprint collection that from Query Result, obtains result document, specifically: suppose a resource group<P1, P2 ... Pi ... Pn 〉, and suppose that a node produces an inquiry Q, after node is received return results, to each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document;
Step 3: management covers statistical information, and this process has comprised three subprocess: concentrate process, the storing process that covers statistical information that covers statistical information, the process that covers the statistical information retrieval extracted from the fingerprint of result document; Described management comprises two generic operations: storage and retrieval; When one group cover statistical information and produce after, system need to be according to covering the semanteme of inquiring about in the statistical information, covers in each resource that statistical information is distributed to system and store the convenient retrieval that covers statistical information;
Step 4: calculate the novel degree of each resource, be specially: according to given one group of resource and covering statistical information thereof, calculate the quantity that each resource contains novel result, and then calculate each resource to the novel degree of Query Result;
Step 5: according to the resource dependency degree that calculates in the step 1, in conjunction with novel degree the Resources list after the foundation resource dependency degree ordering is adjusted, so that the maximization of novel fruiting quantities;
In step 2, after node is received return results, each result's title content is taken the fingerprint, namely the numeral with a string regular length represents a result document, thus the corresponding fingerprint set of the result that each resource is returned; Then, utilize Bloom filter further to compress this fingerprint set, thereby obtain each resource Pi about the as a result fingerprint collection of inquiry Q;
In the process of computational resource novelty degree, Bloom filter forms after the covering statistical information of inquiry Q in the step 4, by comparing the overlapping degree between the Bloom filter, calculates the degree of overlapping of corresponding fingerprint collection, calculates at last the novel degree of each resource.
2. the resource selection method under a kind of non-co-operative environment according to claim 1, it is characterized in that: in step 3, the fingerprint that obtains from step 2 is concentrated to extract and is covered statistical information, then will cover statistical information is distributed to each resource and stores, distribution procedure adopts the strategy based on the searching keyword semanteme, and corresponding to cover statistical information poly-for same class and be stored on the same resource with the inquiry of similar semanteme; Correspondingly, a given inquiry, the covering statistical information that this inquiry is relevant is retrieved by the semantic vector of this inquiry, finds the relevant resource that covers statistical information of this inquiry of storage.
3. the resource selection method under a kind of non-co-operative environment according to claim 1, it is characterized in that: in step 5, utilize obtained one according to the Resources list after the ordering of resource dependency degree, novel degree according to each resource that calculates in the step 4, each resource dependency degree and novel degree are computed weighted, obtain optimum the Resources list.
CN 201210035195 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment Expired - Fee Related CN102609536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210035195 CN102609536B (en) 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210035195 CN102609536B (en) 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment

Publications (2)

Publication Number Publication Date
CN102609536A CN102609536A (en) 2012-07-25
CN102609536B true CN102609536B (en) 2013-09-18

Family

ID=46526908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210035195 Expired - Fee Related CN102609536B (en) 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment

Country Status (1)

Country Link
CN (1) CN102609536B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164698B (en) * 2013-03-29 2016-01-27 华为技术有限公司 Text fingerprints library generating method and device, text fingerprints matching process and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809695B2 (en) * 2004-08-23 2010-10-05 Thomson Reuters Global Resources Information retrieval systems with duplicate document detection and presentation functions
CN101283357A (en) * 2005-10-11 2008-10-08 泰普有限公司 Search using changes in prevalence of content items on the web
CN101535945A (en) * 2006-04-25 2009-09-16 英孚威尔公司 Full text query and search systems and method of use
US8396873B2 (en) * 2010-03-10 2013-03-12 Emc Corporation Index searching using a bloom filter

Also Published As

Publication number Publication date
CN102609536A (en) 2012-07-25

Similar Documents

Publication Publication Date Title
CN100541495C (en) A kind of searching method of individual searching engine
Ma et al. Efficiently finding web services using a clustering semantic approach
KR102080362B1 (en) Query expansion
CN104376406A (en) Enterprise innovation resource management and analysis system and method based on big data
CN107391502B (en) Time interval data query method and device and index construction method and device
JP6216467B2 (en) Visual-semantic composite network and method for forming the network
CN101192235A (en) Method, system and equipment for delivering advertisement based on user feature
Cambazoglu et al. Scalability challenges in web search engines
CN104375992A (en) Address matching method and device
CN103518187A (en) Method and system for information modeling and applications thereof
CN101694670A (en) Chinese Web document online clustering method based on common substrings
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN102270241A (en) Image retrieving method based on sparse nonnegative matrix factorization
Elshater et al. godiscovery: Web service discovery made efficient
CN103871402A (en) Language model training system, a voice identification system and corresponding method
CN105404677A (en) Tree structure based retrieval method
CN101840438B (en) Retrieval system oriented to meta keywords of source document
CN102609536B (en) Resource selection method in non-cooperative environment
CN103559269A (en) Knowledge recommending method for mobile news subscription
CN102447737A (en) Service push method based on cloud platform
Bouhlel et al. Hypergraph learning with collaborative representation for image search reranking
KR101592670B1 (en) Apparatus for searching data using index and method for using the apparatus
Bai et al. An efficient skyline query algorithm in the distributed environment
JP2013105393A (en) Video additional information relationship learning device, method and program
CN101763441A (en) Technology organizing search results in active directory mode

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130918

CF01 Termination of patent right due to non-payment of annual fee