CN102609536B

CN102609536B - Resource selection method in non-cooperative environment

Info

Publication number: CN102609536B
Application number: CN 201210035195
Authority: CN
Inventors: 任祖杰; 徐向华; 万健; 张纪林; 蒋从锋; 任永坚
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2012-02-16
Filing date: 2012-02-16
Publication date: 2013-09-18
Anticipated expiration: 2032-02-16
Also published as: CN102609536A

Abstract

The invention discloses a resource selection method in a non-cooperative environment, which includes the steps: firstly, computing relevance of resources and sequencing the relevance based on a relevance resource selection method in the non-cooperative environment so as to obtain a resource list in terms of the relevance sequence; secondly, extracting coverage statistical information from each resource by means of the fingerprint extracting technology, and compressing by the aid of Bloom filters; thirdly, performing high-efficiency storage and retrieval by means of a distribution strategy based on queried keyword semantic meanings; fourthly, comparing overlap degree of corresponding fingerprint sets by comparing the Bloom filters so as to obtain novelty of each resource; fifthly, computing the novelty of each resource, and rearranging the sequence of the candidate resources according to the novelty; and finally, performing weighting computation according to the relevance and the novelty so as to obtain an optimal resource list. The method gives consideration to the resource relevance and the overlap degree in resource selection, and querying efficiency is improved.

Description

Resource selection method under a kind of non-co-operative environment

Technical field

The present invention relates to the resource selection method under a kind of non-co-operative environment, in particular, the present invention relates to a kind of resource selection method resource dependency degree and overlapping degree, under the non-co-operative environment of taking into account.

Background technology

Resource selection is a popular research theme in distributed information retrieval field.For a given inquiry Q, distributed search engine utilizes resource selection method to determine to inquire about related resource tabulation with this, then inquiry is issued the resource of related resource in tabulating.Outstanding resource selection method can be to each inquiry, only need a small amount of resource to participate in that inquiry just can reach and all resources participate in the approaching result of inquiry.Therefore, the effect of resource selection has directly determined the efficient of query execution process and the quality of Query Result.

Most of traditional resource selection method is paid close attention to the degree of correlation of resource and inquiry.It is overlapping that these methods suppose that usually the document sets of each resource does not exist, and perhaps thinks overlapping less so that can ignore.Yet in the P2P search engine under a Non-synergic environment, each its document sets of resource independent maintenance is inevitably so that have a great deal of identical or closely similar document between the resource under the Non-synergic environment.For example, have a lot of similar papers between famous library automation such as ACM, the IEEE, news category website such as Netease, Sina etc. also can comprise a large amount of similar news web pages.

In the face of this problem, if resource selection method is not considered the overlapping of resource document collection, just an inquiry may be transmitted to two resources (such as two mirror image website) that overlapping degree is very high, cause network resources waste and reduce the efficient of inquiring about.Therefore, be necessary to study a kind of resource selection method of taking into account the overlapping and degree of correlation of resource.

Summary of the invention

For the problems referred to above, the invention discloses the resource selection method under a kind of non-co-operative environment, the method can be taken into account resource degree of overlapping and the degree of correlation simultaneously when selecting resource, the novel as a result total amount of maximization expection, improve the validity of resource selection, thereby improve the efficient of inquiry.

The technical scheme steps that the present invention solves its technical matters employing is as follows:

Resource selection method under a kind of non-co-operative environment is to take into account resource dependency degree and overlapping degree when resource selection, thereby improves the efficient of inquiry, and the method adopts following steps to realize:

Step 1: at first utilize the resource selection method based on the degree of correlation, calculate each resource dependency degree and ordering, obtain one according to the Resources list of resource dependency degree ordering.

Step 2: the fingerprint collection that from Query Result, obtains result document; Suppose a resource group＜P1, P2 ... Pi ... Pn 〉, and suppose that a node produces an inquiry Q, after node is received return results, to each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document.

Step 3: management covers statistical information; This process has comprised three subprocess: cover the process of statistical information, the storing process of covering statistical information, the process that the covering statistical information is retrieved from the as a result concentrated extraction of fingerprint; Described management comprises two generic operations: storage and retrieval; When one group cover statistical information and produce after, system need to be according to covering the semanteme of inquiring about in the statistical information, is distributed in each resource of system and stores the convenient retrieval that covers statistical information.

Step 4: the novel degree that calculates each resource; According to given one group of resource and covering statistical information thereof, calculate the quantity that each resource contains novel result, and then calculate each resource to the novel degree of Query Result.

Step 5: according to the resource dependency degree that calculates in the step 1, in conjunction with novel degree the tabulation of resource ordering is adjusted, so that the maximization of novel fruiting quantities.

Beneficial effect of the present invention:

1. the present invention can extract from Query Result and cover statistical information, these covering statistical informations can be used in the overlapping degree between computational resource in follow-up query script, the novel as a result total amount of maximization expection when resource selection, thereby the validity of improvement resource selection.

2. the present invention will cover statistical information and store in the Chord network according to the semantic vector space of its inquiry, thereby so that similar semantic query collection, can share the covering statistical information, greatly reduce the storage space that system covers statistical information, and increased the hit rate that covers statistical information, solve the problem of many words synonym.

3. in the situation that exist overlappingly between resource, the present invention can reduce the waste of query messages than other resource selection methods, effectively improves search efficiency.

Description of drawings

Fig. 1 is the present invention carries out resource selection method under non-co-operative environment step.

Embodiment

Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in further detail.Its concrete steps are described as shown in Figure 1:

Step 1. generates the initial resource tabulation.Utilization calculates the degree of correlation and the ordering of each resource based on the resource selection method of the degree of correlation, obtains one according to the tabulation of relevancy ranking.

Step 2. is obtained the fingerprint collection of result document from Query Result.Comprise two sub-steps:

1) collection that from the result, takes the fingerprint.To each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document.Two very approaching titles of content can show by enough same fingerprints.To certain resource and inquiry, the set of all fingerprints is exactly the covering statistical information of this resource.In order to solve better the problem that takes the fingerprint from short text, the present invention adopt a kind of efficiently, healthy and strong, do not need the fingerprint technique (Shingle-based Discrete Cosine Transform, S-DCT) of global statistics information.Be filtering noise vocabulary, S-DCT is with stop words and punctuation mark deletion; From word sequence, generate one group of shingle, utilize DCT that each shingle is changed into a fingerprint.Specifically, the S-DCT method may further comprise the steps:

(1) obtains a result Title content.

(2) deletion stop words and punctuation mark.

(3) the root operation is got in each word execution.

(4) lexcographical order pressed in the residue word and arrange, generate a word order.

(5) utilize sliding window technique, word order is generated one group Shingles

(6) to each Shingle, calculate ShingleIn cryptographic hash.

(7) all cryptographic hash are carried out vertical conversion, the average that makes it cryptographic hash drops on 0.

(8) use the Hash maximal value, all cryptographic hash of standardizing.

(9) all normalized cryptographic hash are carried out dct transform.

(10) be on a small amount of bit position to each DCT coefficient quantization.

(11) merge all bit positions, create fingerprint.

(12) fingerprint of all shingles is used for this result of expression

2) compression fingerprint collection.In order to save bandwidth and storage space, utilize Bloom filter to store the fingerprint collection.Thereby the representation of the covering statistical information of a resource is:

Figure 2012100351954100002DEST_PATH_IMAGE004

Generally, the fingerprint of a document should produce based on all the elements of document.

Step 3. management covers statistical information.

After one group of covering statistical information produced, system need to according to the semanteme of query word in the statistical information, be distributed in the P2P network.The covering statistical information of the inquiry correspondence that semanteme is close can be placed on the same resource.Correspondingly, the covering statistical information of a particular keywords of inquiry is to inquire about in semantic space according to the semanteme of this keyword.Correspondingly, a given inquiry, the covering statistical information that this inquiry is relevant is retrieved by the semantic vector of this inquiry.Thereby can in efficient storage and retrieval covering statistical information, reduce the storage overhead of system and the extensibility of raising system.

For the storage overhead that reduces system and the extensibility that improves system, the present invention adopts the distribution policy based on the searching keyword semanteme, utilize potential semantic indexing that each query vector is mapped to its semantic vector, again semantic vector is mapped to a round values that is positioned at Chord ID scope, determines which resource is this covering statistical information should be placed on.This process has comprised three subprocess: cover the process of statistical information, the storing process of covering statistical information, the process that the covering statistical information is retrieved from the as a result concentrated extraction of fingerprint.

1) cover statistical information from the as a result concentrated extraction of fingerprint, algorithm flow is:

Figure 2012100351954100002DEST_PATH_IMAGE006

Wherein set up potential semantic indexing and be mapped to the step of semantic space as follows:

(1) analytical documentation set, the matrix of word-document that the structure document sets is corresponding;

(2) word-document matrix is carried out svd (SVD);

(3) matrix after the SVD decomposition is carried out dimensionality reduction;

(4) matrix behind the use dimensionality reduction makes up latent semantic space.

2) storing process of covering statistical information, the algorithm implementation is as follows:

(1) after resource A obtained the covering statistical information CV (Q) of an inquiry Q, resource A utilized potential semantic indexing to obtain the semantic vector of this inquiry VQ

(2) then, VQ is mapped to the ID space of Chord, route is pointed to resource B.

(3) last, CV (Q) is sent to its destination resource B.

3) cover the statistical information retrieving.When a resource (being assumed to A) is initiated an inquiry QAfter, this inquiry is switched to semantic vector VQ, and then be mapped to one Chord ID, point to resource B.If resource B has inquiry QCorresponding covering statistical information CV (Q) then will cover statistical information CV (Q) and issue resource A.If there is no, then whether resource B searching exists and inquiry QSimilar inquiry Q ', satisfy

Figure 2012100351954100002DEST_PATH_IMAGE008

If find, then return the covering statistical information CV (Q ')If still do not find similarly, then return an inquiry and cover the statistical information failure, and notice resource A needs to extract inquiry after the result returns QThe covering statistical information.

The novel degree of each resource of step 4. estimation

Compare resource

Figure 2012100351954100002DEST_PATH_IMAGE010

Bloom filter

Figure 2012100351954100002DEST_PATH_IMAGE012

With Bloom filter

Figure 2012100351954100002DEST_PATH_IMAGE014

, wherein SIt is the set of the resource chosen.

Figure 2012100351954100002DEST_PATH_IMAGE016

The document space that expression has covered.Define a resource

Novel degree be:

Namely at Bloom filter

In be set to 1 and Bloom filter

In be the quantity of 0 bit position.Similarly, definition Bloom filter

And Bloom filter

Degree of overlapping be:

Step 5. synthesis pertinence and novel degree sort to resource.Comprise three subprocess:

(1) utilizes the CORI method, calculate the degree of correlation of each resource and inquiry

,And sort from high to low by the degree of correlation, obtain candidate's the Resources list.Wherein calculate the degree of correlation of all resources Relevance[i]Algorithm flow be:

Wherein s _MaxIt is the maximal value of the degree of correlation score of all candidate's resources.In order to obtain the degree of correlation score value of normalization Relevance[i], the degree of correlation of each resource is original score S[i]Divided by s _Max

(2) calculate the novel degree of each resource

,And readjust putting in order of candidate's resource according to novel degree.Resource novelty degree Novelty[i]Computation process is:

It calls two functions NovelDocs ()With OverlapDocs ()Calculate respectively each resource with respect to the novel number of files of the Resources list of having selected N[i]With O[i], calculate N[i]With O[i]Ratio C[i], right at last C[i]Normalization obtains novel degree Novelty[i]

Wherein, function NovelDocs ()Return Bloom filter

Middle bit position is 1 and Bloom filter

Middle bit position is 0 quantity.Specifically, its computing formula is expressed as:

Function OverlapDocs ()Return Bloom filter

And Bloom filter In be 1 bit position sum, its computing formula is expressed as:

(3) calculate optimum the Resources list.The detailed process of synthesis pertinence and novel degree is:

The degree of correlation score value of each resource S[i]And Bloom filter Bf[i]Input as algorithm.At first that the degree of correlation is the highest resource of algorithm is selected an optimum resource at every turn and is joined in new the Resources list as seed from surplus resources.Wherein, calculating optimum method is to obtain by the degree of correlation and novel degree ranking operation to each resource:

Wherein It is a parameter between [0,1].

Claims

1. the resource selection method under the non-co-operative environment, it is characterized in that: take into account resource dependency degree and overlapping degree when resource selection, thereby improve the efficient of inquiry, the method adopts following steps to realize:

Step 1: at first utilize the resource selection method based on the degree of correlation, calculate each resource dependency degree and ordering, obtain one according to the Resources list of resource dependency degree ordering;

Step 2: the fingerprint collection that from Query Result, obtains result document, specifically: suppose a resource group＜P1, P2 ... Pi ... Pn 〉, and suppose that a node produces an inquiry Q, after node is received return results, to each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document;

Step 3: management covers statistical information, and this process has comprised three subprocess: concentrate process, the storing process that covers statistical information that covers statistical information, the process that covers the statistical information retrieval extracted from the fingerprint of result document; Described management comprises two generic operations: storage and retrieval; When one group cover statistical information and produce after, system need to be according to covering the semanteme of inquiring about in the statistical information, covers in each resource that statistical information is distributed to system and store the convenient retrieval that covers statistical information;

Step 4: calculate the novel degree of each resource, be specially: according to given one group of resource and covering statistical information thereof, calculate the quantity that each resource contains novel result, and then calculate each resource to the novel degree of Query Result;

Step 5: according to the resource dependency degree that calculates in the step 1, in conjunction with novel degree the Resources list after the foundation resource dependency degree ordering is adjusted, so that the maximization of novel fruiting quantities;

In step 2, after node is received return results, each result's title content is taken the fingerprint, namely the numeral with a string regular length represents a result document, thus the corresponding fingerprint set of the result that each resource is returned; Then, utilize Bloom filter further to compress this fingerprint set, thereby obtain each resource Pi about the as a result fingerprint collection of inquiry Q;

In the process of computational resource novelty degree, Bloom filter forms after the covering statistical information of inquiry Q in the step 4, by comparing the overlapping degree between the Bloom filter, calculates the degree of overlapping of corresponding fingerprint collection, calculates at last the novel degree of each resource.

2. the resource selection method under a kind of non-co-operative environment according to claim 1, it is characterized in that: in step 3, the fingerprint that obtains from step 2 is concentrated to extract and is covered statistical information, then will cover statistical information is distributed to each resource and stores, distribution procedure adopts the strategy based on the searching keyword semanteme, and corresponding to cover statistical information poly-for same class and be stored on the same resource with the inquiry of similar semanteme; Correspondingly, a given inquiry, the covering statistical information that this inquiry is relevant is retrieved by the semantic vector of this inquiry, finds the relevant resource that covers statistical information of this inquiry of storage.

3. the resource selection method under a kind of non-co-operative environment according to claim 1, it is characterized in that: in step 5, utilize obtained one according to the Resources list after the ordering of resource dependency degree, novel degree according to each resource that calculates in the step 4, each resource dependency degree and novel degree are computed weighted, obtain optimum the Resources list.