CN102609536A

CN102609536A - Resource selection method in non-cooperative environment

Info

Publication number: CN102609536A
Application number: CN2012100351954A
Authority: CN
Inventors: 任祖杰; 徐向华; 万健; 张纪林; 蒋从锋; 任永坚
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2012-02-16
Filing date: 2012-02-16
Publication date: 2012-07-25
Anticipated expiration: 2032-02-16
Also published as: CN102609536B

Abstract

The invention discloses a resource selection method in a non-cooperative environment, which includes the steps: firstly, computing relevance of resources and sequencing the relevance based on a relevance resource selection method in the non-cooperative environment so as to obtain a resource list in terms of the relevance sequence; secondly, extracting coverage statistical information from each resource by means of the fingerprint extracting technology, and compressing by the aid of Bloom filters; thirdly, performing high-efficiency storage and retrieval by means of a distribution strategy based on queried keyword semantic meanings; fourthly, comparing overlap degree of corresponding fingerprint sets by comparing the Bloom filters so as to obtain novelty of each resource; fifthly, computing the novelty of each resource, and rearranging the sequence of the candidate resources according to the novelty; and finally, performing weighting computation according to the relevance and the novelty so as to obtain an optimal resource list. The method gives consideration to the resource relevance and the overlap degree in resource selection, and querying efficiency is improved.

Description

A kind of resource selection method under non-co-operative environment

Technical field

The present invention relates to the resource selection method under a kind of non-co-operative environment, more particularly, it relates to which a kind of take into account the resource degree of correlation and overlapping degree, resource selection method under non-co-operative environment.

Background technology

Resource selection is a popular research theme in distributed information retrieval field.For given inquiry Q, distributed search engine is determined and the inquiry most related resource list using resource selection method, then will inquire about the resource issued in most related resource list.Outstanding resource selection method is enabled to each inquiry, it is only necessary to which a small amount of resource, which participates in inquiring about, can just reach and whole resources participate in the close result of inquiry.Therefore, the effect of resource selection directly determines the efficiency of query execution process and the quality of Query Result.

Most of traditional resource selection method focuses on resource and the degree of correlation of inquiry.It is overlapping that these methods often assume that the document sets of each resource are not present, or thinks overlapping smaller so that can be ignored.However, in P2P search engines under a Non-synergic environment, its document sets of each resource independent maintenance inevitably to have a great deal of identical or closely similar document between the resource under Non-synergic environment.For example, there are many similar papers between famous library automation such as ACM, IEEE, news category website such as Netease, Sina etc. can also include substantial amounts of similar news web page.

In face of this problem, if resource selection method does not consider the overlapping of resource document collection, it is possible to which an inquiry is transmitted into two overlapping degrees very high resource（Such as two mirror image website）, cause network resources waste and reduce the efficiency of inquiry.A kind of take into account that resource is overlapping and the resource selection method of the degree of correlation therefore, it is necessary to study.

The content of the invention

Regarding to the issue above, the invention discloses the resource selection method under a kind of non-co-operative environment, this method can take into account resource degree of overlapping and the degree of correlation simultaneously when selecting resource, maximize expected novel results total amount, the validity of resource selection is improved, so as to improve the efficiency of inquiry.

The technical scheme steps that the present invention solves the use of its technical problem are as follows：

A kind of resource selection method under non-co-operative environment, is that the resource degree of correlation and overlapping degree are taken into account in resource selection, so as to improve the efficiency of inquiry, this method is realized using following steps：

Step 1：First with the resource selection method based on the degree of correlation, calculate each resource degree of correlation and sort, obtain the Resources list according to resource relevancy ranking.

Step 2：The fingerprint collection of result document is obtained from Query Result；It is assumed that a resource group<P1,P2…Pi…Pn>, and assume that a node produces an inquiry Q, after node receives returning result, to each result document, extract the numeral of a string of regular lengths to represent the title content of a result document using fingerprint extraction technology.

Step 3：Management covering statistical information；This process contains three subprocess：The process for extracting covering statistical information, the storing process for covering statistical information, the process of covering statistical information retrieval are concentrated from result fingerprint；Described management includes two generic operations：Storage and retrieval；After one group of covering statistical information is produced, system needs, according to the semanteme inquired about in covering statistical information, to be stored in each resource for being distributed to system, the retrieval of convenient covering statistical information.

Step 4：Calculate the novel degree of each resource；According to given one group of resource and its covering statistical information, quantity of each resource containing novel results is calculated, and then calculate novel degree of each resource to Query Result.

Step 5：According to the resource degree of correlation calculated in step 1, it is adjusted with reference to the list that novel degree sorts to resource so that novel results quantity is maximized.

Beneficial effects of the present invention：

1. the present invention can extract covering statistical information from Query Result, the overlapping degree that these covering statistical informations can be used in follow-up query process between computing resource, expected novel results total amount is maximized in resource selection, so as to improve the validity of resource selection.

2. the present invention stores the semantic vector space that covering statistical information is inquired about according to it into Chord networks, so that similar semantic query set, covering statistical information can be shared, greatly reduce the memory space that system covers statistical information, and the hit rate of covering statistical information is increased, solve the problem of many words are synonymous.

3. being deposited between resource in a case of overlap, the present invention can reduce the waste of query messages, effectively improve search efficiency compared to other resource selection methods.

Brief description of the drawings

The step of Fig. 1 performs resource selection method for the present invention under non-co-operative environment.

Embodiment

Below in conjunction with the accompanying drawings, specific embodiments of the present invention are described in further detail.The description of its specific steps is as shown in Figure 1：

Step 1. generates initial resource list.Using the resource selection method based on the degree of correlation, the degree of correlation and the sequence of each resource are calculated, a list according to relevancy ranking is obtained.

Step 2. obtains the fingerprint collection of result document from Query Result.Including two sub-steps：

1）Take the fingerprint collection from result.To each result document, extract the numeral of a string of regular lengths to represent the title content of a result document using fingerprint extraction technology.Two contents very close to title can be showed with same fingerprint.To some resource and inquiry, the set of all fingerprints is exactly the covering statistical information of the resource.In order to preferably solve from short text take the fingerprint the problem of, the present invention is healthy and strong, it is not necessary to the fingerprint technique of global statistics information using a kind of efficient（Shingle-based Discrete Cosine Transform, S-DCT）.For filtering noise vocabulary, S-DCT deletes stop words and punctuation mark；One group of shingle is generated from word sequence, each shingle is changed into a fingerprint using DCT.Specifically, S-DCT methods comprise the following steps：

(1) result is obtained

Title content.

(2) stop words and punctuation mark are deleted.

(3) each word is performed takes root to operate.

(4) remaining word is arranged by lexcographical order, generates a word order.

(5) sliding window technique is utilized, one group is generated to word ordershingles。

(6) to eachshingle, calculateshingleIn cryptographic Hash.

(7) vertical transitions are carried out to all cryptographic Hash, the average for being allowed to cryptographic Hash falls 0.

(8) Hash maximum is used, all cryptographic Hash of standardizing.

(9) cryptographic Hash to all standardization carries out dct transform.

(10) each DCT coefficient is quantified as on a small amount of bit positions.

(11) merge all bit, create fingerprint.

(12) all shingles fingerprint, for representing this result

。

2）Compress fingerprint collection.In order to save bandwidth and memory space, fingerprint collection is stored using Bloom filter.So as to which the representation of the covering statistical information of a resource is：

Under normal circumstances, all the elements generation that the fingerprint of a document should be based on document.

Step 3. management covering statistical information.

After one group of covering statistical information is produced, system needs the semanteme according to query word in statistical information, is distributed in P2P networks.The corresponding covering statistical information of inquiry of semantic similarity, can be placed in same resource.Correspondingly, the covering statistical information of a particular keywords is inquired about, is semanteme inquiry in semantic space according to the keyword.Correspondingly, an inquiry is given, the related covering statistical information of the inquiry is retrieved by the semantic vector of the inquiry.So as to while efficient storage and retrieval covering statistical information, reduce the storage overhead of system and improve the scalability of system.

In order to reduce the storage overhead of system and improve the scalability of system, the present invention is using based on the semantic distribution policy of searching keyword, each query vector is mapped to its semantic vector using potential applications index, semantic vector is mapped to an integer value for being located at Chord ID scopes again, determines which resource is the covering statistical information should be placed on.This process contains three subprocess：The process for extracting covering statistical information, the storing process for covering statistical information, the process of covering statistical information retrieval are concentrated from result fingerprint.

1）Concentrated from result fingerprint and extract covering statistical information, algorithm flow is：

Wherein set up the step of potential applications index and are mapped to semantic space as follows：

(1) collection of document is analyzed, the matrix of the corresponding word-document of document sets is built；

(2) singular value decomposition is carried out to word-document matrix（SVD）；

(3) matrix after being decomposed to SVD carries out dimensionality reduction；

(4) latent semantic space is built using the matrix after dimensionality reduction.

2）The storing process of statistical information is covered, algorithm performs process is as follows：

(1) after resource A obtains covering statistical information CV (Q) for inquiring about Q, resource A indexes the semantic vector for obtaining the inquiry using potential applicationsVQ。

(2) then, VQ is mapped to Chord ID spaces, route points to resource B.

(3) last, CV (Q) is sent to its destination resource B.

3）Cover statistical information retrieving.When a resource（It is assumed that A）Initiate an inquiryQAfterwards, the inquiry is switched to semantic vectorVQ, and then it is mapped to oneChord ID, point to resource B.If resource B has inquiryQCorresponding covering statistical information CV (Q), then issue resource A by covering statistical information CV (Q).If it does not exist, then resource B looks for whether exist and inquiryQSimilar inquiryQ’, meet

.If it is found, then returning to covering statistical informationCV (Q’)；If still not finding similar, one inquiry covering statistical information failure of return, and notify that resource A needs to extract inquiry after result returnQCovering statistical information.

Step 4. estimates the novel degree of each resource

Compare resource

Bloom filter

With Bloom filter

, whereinSIt is the set for the resource chosen.

Represent the document space covered.Define a resource

Novel degree be：

I.e. in Bloom filter

In be set to 1 and Bloom filter

In for 0 bit positions quantity.Similarly, Bloom filter is defined

And Bloom filter

Degree of overlapping be：

Step 5. synthesis pertinence and novelty degree are ranked up to resource.Including three subprocess：

（1）Using CORI methods, each resource and the degree of correlation of inquiry are calculated ,And sorted from high to low by the degree of correlation, obtain a candidate resource list.Wherein calculate the degree of correlation of all resourcesrelevance[i]Algorithm flow be：

Whereins _maxIt is the maximum of the relevance score of all candidate resources.In order to obtain the relevance score of normalizationrelevance[i], the degree of correlation of each resource is raw scores[i]Divided bys _max。

（2）Calculate the novel degree of each resource ,And putting in order for candidate resource is readjusted according to novel degree.Resource novelty degreenovelty[i]Calculating process is：

It calls two functionsnovelDocs()WithoverlapDocs()Novel number of files of each resource relative to the Resources list selected is calculated respectivelyn[i]Witho[i], calculaten[i]Witho[i]Ratioc[i], it is finally rightc[i]Normalization obtains novel degreenovelty[i]。

Wherein, functionnovelDocs()Return to Bloom filter

Middle bit is 1 and Bloom filter

Middle bit be 0 quantity.Specifically, its computing formula is expressed as：

FunctionoverlapDocs()Return to Bloom filter

And Bloom filter

In be 1 bit positions sum, its computing formula is expressed as：

（3）Calculate optimal the Resources list.The detailed process of synthesis pertinence and novelty degree is：

The relevance score of each resources[i]And its Bloom filterbf[i]It is used as the input of algorithm.Algorithm, as seed, an optimal resource is selected from surplus resources and is added in new the Resources list every time first using degree of correlation highest resource.Wherein, calculating optimal method is obtained by the degree of correlation to each resource and novelty degree ranking operation：

Wherein

It is the parameter between one [0,1].

Claims

1. the resource selection method under a kind of non-co-operative environment, it is characterised in that：The resource degree of correlation and overlapping degree are taken into account in resource selection, so as to improve the efficiency of inquiry, this method is realized using following steps：

Step 1：First with the resource selection method based on the degree of correlation, calculate each resource degree of correlation and sort, obtain the Resources list according to resource relevancy ranking；

Step 2：The fingerprint collection of result document is obtained from Query Result；It is assumed that a resource group<P1,P2…Pi…Pn>, and assume that a node produces an inquiry Q, after node receives returning result, to each result document, extract the numeral of a string of regular lengths to represent the title content of a result document using fingerprint extraction technology；

Step 3：Management covering statistical information；This process contains three subprocess：The process for extracting covering statistical information, the storing process for covering statistical information, the process of covering statistical information retrieval are concentrated from result fingerprint；Described management includes two generic operations：Storage and retrieval；After one group of covering statistical information is produced, system needs, according to the semanteme inquired about in covering statistical information, to be stored in each resource for being distributed to system, the retrieval of convenient covering statistical information；

Step 4：Calculate the novel degree of each resource；According to given one group of resource and its covering statistical information, quantity of each resource containing novel results is calculated, and then calculate novel degree of each resource to Query Result；

2. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that：In step 2, after node receives returning result, the title content of each result is taken the fingerprint, i.e., a result document is represented with the numeral of a string of regular lengths, so that result one fingerprint set of correspondence that each resource is returned；Then, the fingerprint set is further compressed using Bloom filter, so as to obtain result fingerprint collection of each resource Pi on inquiring about Q.

3. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that：In step 3, the fingerprint obtained from step 2, which is concentrated, extracts covering statistical information, then covering statistical information is distributed into each resource to be stored, distribution procedure is used based on the semantic strategy of searching keyword, and the inquiry correspondence covering statistical information of similar semantic is gathered for same class and is stored in same resource；Correspondingly, an inquiry is given, the related covering statistical information of the inquiry is retrieved by the semantic vector of the inquiry, is quickly found out the resource for storing inquiry correlation covering statistical information, reduced the storage overhead of system and improve the scalability of system.

4. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that：In step 4 during computing resource novelty degree, after Bloom filter formation inquiry Q covering statistical information, by comparing the overlapping degree between Bloom filter, the degree of overlapping of corresponding fingerprint collection is calculated, the novel degree for obtaining each resource is finally calculated.

5. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that：In steps of 5, using obtained the Resources list by relevancy ranking, the novel degree of each resource is calculated, each resource degree of correlation and novelty degree are weighted, optimal the Resources list is obtained.