CN102609536B - Resource selection method in non-cooperative environment - Google Patents
Resource selection method in non-cooperative environment Download PDFInfo
- Publication number
- CN102609536B CN102609536B CN 201210035195 CN201210035195A CN102609536B CN 102609536 B CN102609536 B CN 102609536B CN 201210035195 CN201210035195 CN 201210035195 CN 201210035195 A CN201210035195 A CN 201210035195A CN 102609536 B CN102609536 B CN 102609536B
- Authority
- CN
- China
- Prior art keywords
- resource
- statistical information
- degree
- inquiry
- fingerprint
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 28
- 238000000605 extraction Methods 0.000 claims description 6
- 239000012141 concentrate Substances 0.000 claims 1
- 238000012163 sequencing technique Methods 0.000 abstract 1
- 239000011159 matrix material Substances 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a resource selection method in a non-cooperative environment, which includes the steps: firstly, computing relevance of resources and sequencing the relevance based on a relevance resource selection method in the non-cooperative environment so as to obtain a resource list in terms of the relevance sequence; secondly, extracting coverage statistical information from each resource by means of the fingerprint extracting technology, and compressing by the aid of Bloom filters; thirdly, performing high-efficiency storage and retrieval by means of a distribution strategy based on queried keyword semantic meanings; fourthly, comparing overlap degree of corresponding fingerprint sets by comparing the Bloom filters so as to obtain novelty of each resource; fifthly, computing the novelty of each resource, and rearranging the sequence of the candidate resources according to the novelty; and finally, performing weighting computation according to the relevance and the novelty so as to obtain an optimal resource list. The method gives consideration to the resource relevance and the overlap degree in resource selection, and querying efficiency is improved.
Description
Technical field
The present invention relates to the resource selection method under a kind of non-co-operative environment, in particular, the present invention relates to a kind of resource selection method resource dependency degree and overlapping degree, under the non-co-operative environment of taking into account.
Background technology
Resource selection is a popular research theme in distributed information retrieval field.For a given inquiry Q, distributed search engine utilizes resource selection method to determine to inquire about related resource tabulation with this, then inquiry is issued the resource of related resource in tabulating.Outstanding resource selection method can be to each inquiry, only need a small amount of resource to participate in that inquiry just can reach and all resources participate in the approaching result of inquiry.Therefore, the effect of resource selection has directly determined the efficient of query execution process and the quality of Query Result.
Most of traditional resource selection method is paid close attention to the degree of correlation of resource and inquiry.It is overlapping that these methods suppose that usually the document sets of each resource does not exist, and perhaps thinks overlapping less so that can ignore.Yet in the P2P search engine under a Non-synergic environment, each its document sets of resource independent maintenance is inevitably so that have a great deal of identical or closely similar document between the resource under the Non-synergic environment.For example, have a lot of similar papers between famous library automation such as ACM, the IEEE, news category website such as Netease, Sina etc. also can comprise a large amount of similar news web pages.
In the face of this problem, if resource selection method is not considered the overlapping of resource document collection, just an inquiry may be transmitted to two resources (such as two mirror image website) that overlapping degree is very high, cause network resources waste and reduce the efficient of inquiring about.Therefore, be necessary to study a kind of resource selection method of taking into account the overlapping and degree of correlation of resource.
Summary of the invention
For the problems referred to above, the invention discloses the resource selection method under a kind of non-co-operative environment, the method can be taken into account resource degree of overlapping and the degree of correlation simultaneously when selecting resource, the novel as a result total amount of maximization expection, improve the validity of resource selection, thereby improve the efficient of inquiry.
The technical scheme steps that the present invention solves its technical matters employing is as follows:
Resource selection method under a kind of non-co-operative environment is to take into account resource dependency degree and overlapping degree when resource selection, thereby improves the efficient of inquiry, and the method adopts following steps to realize:
Step 1: at first utilize the resource selection method based on the degree of correlation, calculate each resource dependency degree and ordering, obtain one according to the Resources list of resource dependency degree ordering.
Step 2: the fingerprint collection that from Query Result, obtains result document; Suppose a resource group<P1, P2 ... Pi ... Pn 〉, and suppose that a node produces an inquiry Q, after node is received return results, to each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document.
Step 3: management covers statistical information; This process has comprised three subprocess: cover the process of statistical information, the storing process of covering statistical information, the process that the covering statistical information is retrieved from the as a result concentrated extraction of fingerprint; Described management comprises two generic operations: storage and retrieval; When one group cover statistical information and produce after, system need to be according to covering the semanteme of inquiring about in the statistical information, is distributed in each resource of system and stores the convenient retrieval that covers statistical information.
Step 4: the novel degree that calculates each resource; According to given one group of resource and covering statistical information thereof, calculate the quantity that each resource contains novel result, and then calculate each resource to the novel degree of Query Result.
Step 5: according to the resource dependency degree that calculates in the step 1, in conjunction with novel degree the tabulation of resource ordering is adjusted, so that the maximization of novel fruiting quantities.
Beneficial effect of the present invention:
1. the present invention can extract from Query Result and cover statistical information, these covering statistical informations can be used in the overlapping degree between computational resource in follow-up query script, the novel as a result total amount of maximization expection when resource selection, thereby the validity of improvement resource selection.
2. the present invention will cover statistical information and store in the Chord network according to the semantic vector space of its inquiry, thereby so that similar semantic query collection, can share the covering statistical information, greatly reduce the storage space that system covers statistical information, and increased the hit rate that covers statistical information, solve the problem of many words synonym.
3. in the situation that exist overlappingly between resource, the present invention can reduce the waste of query messages than other resource selection methods, effectively improves search efficiency.
Description of drawings
Fig. 1 is the present invention carries out resource selection method under non-co-operative environment step.
Embodiment
Below in conjunction with accompanying drawing, specific embodiments of the present invention is described in further detail.Its concrete steps are described as shown in Figure 1:
Step 1. generates the initial resource tabulation.Utilization calculates the degree of correlation and the ordering of each resource based on the resource selection method of the degree of correlation, obtains one according to the tabulation of relevancy ranking.
Step 2. is obtained the fingerprint collection of result document from Query Result.Comprise two sub-steps:
1) collection that from the result, takes the fingerprint.To each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document.Two very approaching titles of content can show by enough same fingerprints.To certain resource and inquiry, the set of all fingerprints is exactly the covering statistical information of this resource.In order to solve better the problem that takes the fingerprint from short text, the present invention adopt a kind of efficiently, healthy and strong, do not need the fingerprint technique (Shingle-based Discrete Cosine Transform, S-DCT) of global statistics information.Be filtering noise vocabulary, S-DCT is with stop words and punctuation mark deletion; From word sequence, generate one group of shingle, utilize DCT that each shingle is changed into a fingerprint.Specifically, the S-DCT method may further comprise the steps:
(1) obtains a result
Title content.
(2) deletion stop words and punctuation mark.
(3) the root operation is got in each word execution.
(4) lexcographical order pressed in the residue word and arrange, generate a word order.
(5) utilize sliding window technique, word order is generated one group
Shingles
(6) to each
Shingle, calculate
ShingleIn cryptographic hash.
(7) all cryptographic hash are carried out vertical conversion, the average that makes it cryptographic hash drops on 0.
(8) use the Hash maximal value, all cryptographic hash of standardizing.
(9) all normalized cryptographic hash are carried out dct transform.
(10) be on a small amount of bit position to each DCT coefficient quantization.
(11) merge all bit positions, create fingerprint.
2) compression fingerprint collection.In order to save bandwidth and storage space, utilize Bloom filter to store the fingerprint collection.Thereby the representation of the covering statistical information of a resource is:
Generally, the fingerprint of a document should produce based on all the elements of document.
Step 3. management covers statistical information.
After one group of covering statistical information produced, system need to according to the semanteme of query word in the statistical information, be distributed in the P2P network.The covering statistical information of the inquiry correspondence that semanteme is close can be placed on the same resource.Correspondingly, the covering statistical information of a particular keywords of inquiry is to inquire about in semantic space according to the semanteme of this keyword.Correspondingly, a given inquiry, the covering statistical information that this inquiry is relevant is retrieved by the semantic vector of this inquiry.Thereby can in efficient storage and retrieval covering statistical information, reduce the storage overhead of system and the extensibility of raising system.
For the storage overhead that reduces system and the extensibility that improves system, the present invention adopts the distribution policy based on the searching keyword semanteme, utilize potential semantic indexing that each query vector is mapped to its semantic vector, again semantic vector is mapped to a round values that is positioned at Chord ID scope, determines which resource is this covering statistical information should be placed on.This process has comprised three subprocess: cover the process of statistical information, the storing process of covering statistical information, the process that the covering statistical information is retrieved from the as a result concentrated extraction of fingerprint.
1) cover statistical information from the as a result concentrated extraction of fingerprint, algorithm flow is:
Wherein set up potential semantic indexing and be mapped to the step of semantic space as follows:
(1) analytical documentation set, the matrix of word-document that the structure document sets is corresponding;
(2) word-document matrix is carried out svd (SVD);
(3) matrix after the SVD decomposition is carried out dimensionality reduction;
(4) matrix behind the use dimensionality reduction makes up latent semantic space.
2) storing process of covering statistical information, the algorithm implementation is as follows:
(1) after resource A obtained the covering statistical information CV (Q) of an inquiry Q, resource A utilized potential semantic indexing to obtain the semantic vector of this inquiry
VQ
(2) then, VQ is mapped to the ID space of Chord, route is pointed to resource B.
(3) last, CV (Q) is sent to its destination resource B.
3) cover the statistical information retrieving.When a resource (being assumed to A) is initiated an inquiry
QAfter, this inquiry is switched to semantic vector
VQ, and then be mapped to one
Chord ID, point to resource B.If resource B has inquiry
QCorresponding covering statistical information CV (Q) then will cover statistical information CV (Q) and issue resource A.If there is no, then whether resource B searching exists and inquiry
QSimilar inquiry
Q ', satisfy
If find, then return the covering statistical information
CV (Q ')If still do not find similarly, then return an inquiry and cover the statistical information failure, and notice resource A needs to extract inquiry after the result returns
QThe covering statistical information.
The novel degree of each resource of step 4. estimation
Compare resource
Bloom filter
With Bloom filter
, wherein
SIt is the set of the resource chosen.
The document space that expression has covered.Define a resource
Novel degree be:
Namely at Bloom filter
In be set to 1 and Bloom filter
In be the quantity of 0 bit position.Similarly, definition Bloom filter
And Bloom filter
Degree of overlapping be:
Step 5. synthesis pertinence and novel degree sort to resource.Comprise three subprocess:
(1) utilizes the CORI method, calculate the degree of correlation of each resource and inquiry
,And sort from high to low by the degree of correlation, obtain candidate's the Resources list.Wherein calculate the degree of correlation of all resources
Relevance[i]Algorithm flow be:
Wherein
s Max It is the maximal value of the degree of correlation score of all candidate's resources.In order to obtain the degree of correlation score value of normalization
Relevance[i], the degree of correlation of each resource is original score
S[i]Divided by
s Max
(2) calculate the novel degree of each resource
,And readjust putting in order of candidate's resource according to novel degree.Resource novelty degree
Novelty[i]Computation process is:
It calls two functions
NovelDocs ()With
OverlapDocs ()Calculate respectively each resource with respect to the novel number of files of the Resources list of having selected
N[i]With
O[i], calculate
N[i]With
O[i]Ratio
C[i], right at last
C[i]Normalization obtains novel degree
Novelty[i]
Wherein, function
NovelDocs ()Return Bloom filter
Middle bit position is 1 and Bloom filter
Middle bit position is 0 quantity.Specifically, its computing formula is expressed as:
Function
OverlapDocs ()Return Bloom filter
And Bloom filter
In be 1 bit position sum, its computing formula is expressed as:
(3) calculate optimum the Resources list.The detailed process of synthesis pertinence and novel degree is:
The degree of correlation score value of each resource
S[i]And Bloom filter
Bf[i]Input as algorithm.At first that the degree of correlation is the highest resource of algorithm is selected an optimum resource at every turn and is joined in new the Resources list as seed from surplus resources.Wherein, calculating optimum method is to obtain by the degree of correlation and novel degree ranking operation to each resource:
Wherein
It is a parameter between [0,1].
Claims (3)
1. the resource selection method under the non-co-operative environment, it is characterized in that: take into account resource dependency degree and overlapping degree when resource selection, thereby improve the efficient of inquiry, the method adopts following steps to realize:
Step 1: at first utilize the resource selection method based on the degree of correlation, calculate each resource dependency degree and ordering, obtain one according to the Resources list of resource dependency degree ordering;
Step 2: the fingerprint collection that from Query Result, obtains result document, specifically: suppose a resource group<P1, P2 ... Pi ... Pn 〉, and suppose that a node produces an inquiry Q, after node is received return results, to each result document, the numeral of utilizing the fingerprint extraction technology to extract a string regular length represents the title content of a result document;
Step 3: management covers statistical information, and this process has comprised three subprocess: concentrate process, the storing process that covers statistical information that covers statistical information, the process that covers the statistical information retrieval extracted from the fingerprint of result document; Described management comprises two generic operations: storage and retrieval; When one group cover statistical information and produce after, system need to be according to covering the semanteme of inquiring about in the statistical information, covers in each resource that statistical information is distributed to system and store the convenient retrieval that covers statistical information;
Step 4: calculate the novel degree of each resource, be specially: according to given one group of resource and covering statistical information thereof, calculate the quantity that each resource contains novel result, and then calculate each resource to the novel degree of Query Result;
Step 5: according to the resource dependency degree that calculates in the step 1, in conjunction with novel degree the Resources list after the foundation resource dependency degree ordering is adjusted, so that the maximization of novel fruiting quantities;
In step 2, after node is received return results, each result's title content is taken the fingerprint, namely the numeral with a string regular length represents a result document, thus the corresponding fingerprint set of the result that each resource is returned; Then, utilize Bloom filter further to compress this fingerprint set, thereby obtain each resource Pi about the as a result fingerprint collection of inquiry Q;
In the process of computational resource novelty degree, Bloom filter forms after the covering statistical information of inquiry Q in the step 4, by comparing the overlapping degree between the Bloom filter, calculates the degree of overlapping of corresponding fingerprint collection, calculates at last the novel degree of each resource.
2. the resource selection method under a kind of non-co-operative environment according to claim 1, it is characterized in that: in step 3, the fingerprint that obtains from step 2 is concentrated to extract and is covered statistical information, then will cover statistical information is distributed to each resource and stores, distribution procedure adopts the strategy based on the searching keyword semanteme, and corresponding to cover statistical information poly-for same class and be stored on the same resource with the inquiry of similar semanteme; Correspondingly, a given inquiry, the covering statistical information that this inquiry is relevant is retrieved by the semantic vector of this inquiry, finds the relevant resource that covers statistical information of this inquiry of storage.
3. the resource selection method under a kind of non-co-operative environment according to claim 1, it is characterized in that: in step 5, utilize obtained one according to the Resources list after the ordering of resource dependency degree, novel degree according to each resource that calculates in the step 4, each resource dependency degree and novel degree are computed weighted, obtain optimum the Resources list.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201210035195 CN102609536B (en) | 2012-02-16 | 2012-02-16 | Resource selection method in non-cooperative environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201210035195 CN102609536B (en) | 2012-02-16 | 2012-02-16 | Resource selection method in non-cooperative environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102609536A CN102609536A (en) | 2012-07-25 |
CN102609536B true CN102609536B (en) | 2013-09-18 |
Family
ID=46526908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201210035195 Expired - Fee Related CN102609536B (en) | 2012-02-16 | 2012-02-16 | Resource selection method in non-cooperative environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102609536B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103164698B (en) * | 2013-03-29 | 2016-01-27 | 华为技术有限公司 | Text fingerprints library generating method and device, text fingerprints matching process and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7809695B2 (en) * | 2004-08-23 | 2010-10-05 | Thomson Reuters Global Resources | Information retrieval systems with duplicate document detection and presentation functions |
CN101283357A (en) * | 2005-10-11 | 2008-10-08 | 泰普有限公司 | Search using changes in prevalence of content items on the web |
CN101535945A (en) * | 2006-04-25 | 2009-09-16 | 英孚威尔公司 | Full text query and search systems and method of use |
US8396873B2 (en) * | 2010-03-10 | 2013-03-12 | Emc Corporation | Index searching using a bloom filter |
-
2012
- 2012-02-16 CN CN 201210035195 patent/CN102609536B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN102609536A (en) | 2012-07-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100541495C (en) | A kind of searching method of individual searching engine | |
US7644069B2 (en) | Search ranking method for file system and related search engine | |
Ma et al. | Efficiently finding web services using a clustering semantic approach | |
KR102080362B1 (en) | Query expansion | |
CN104376406A (en) | Enterprise innovation resource management and analysis system and method based on big data | |
CN107391502B (en) | Time interval data query method and device and index construction method and device | |
CN101192235A (en) | Method, system and equipment for delivering advertisement based on user feature | |
CN104412266A (en) | Method and apparatus for multidimensional data storage and file system with a dynamic ordered tree structure | |
CN102591942A (en) | Method and device for automatic application recommendation | |
CN102043863B (en) | Method for Web service clustering | |
CN104375992A (en) | Address matching method and device | |
CN103518187A (en) | Method and system for information modeling and applications thereof | |
CN101271476A (en) | Relevant feedback retrieval method based on clustering in network image search | |
JP2016540332A (en) | Visual-semantic composite network and method for forming the network | |
Martín et al. | Using semi-structured data for assessing research paper similarity | |
Elshater et al. | godiscovery: Web service discovery made efficient | |
CN108959580A (en) | A kind of optimization method and system of label data | |
CN104915388B (en) | It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology | |
CN105404677A (en) | Tree structure based retrieval method | |
CN103020141A (en) | Method and equipment for providing searching results | |
CN101840438B (en) | Retrieval system oriented to meta keywords of source document | |
CN102609536B (en) | Resource selection method in non-cooperative environment | |
KR101592670B1 (en) | Apparatus for searching data using index and method for using the apparatus | |
CN105426490A (en) | Tree structure based indexing method | |
CN109783508A (en) | Data query method, apparatus, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130918 |
|
CF01 | Termination of patent right due to non-payment of annual fee |