CN102609536A - Resource selection method in non-cooperative environment - Google Patents

Resource selection method in non-cooperative environment Download PDF

Info

Publication number
CN102609536A
CN102609536A CN2012100351954A CN201210035195A CN102609536A CN 102609536 A CN102609536 A CN 102609536A CN 2012100351954 A CN2012100351954 A CN 2012100351954A CN 201210035195 A CN201210035195 A CN 201210035195A CN 102609536 A CN102609536 A CN 102609536A
Authority
CN
China
Prior art keywords
resource
statistical information
degree
inquiry
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100351954A
Other languages
Chinese (zh)
Other versions
CN102609536B (en
Inventor
任祖杰
徐向华
万健
张纪林
蒋从锋
任永坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN 201210035195 priority Critical patent/CN102609536B/en
Publication of CN102609536A publication Critical patent/CN102609536A/en
Application granted granted Critical
Publication of CN102609536B publication Critical patent/CN102609536B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a resource selection method in a non-cooperative environment, which includes the steps: firstly, computing relevance of resources and sequencing the relevance based on a relevance resource selection method in the non-cooperative environment so as to obtain a resource list in terms of the relevance sequence; secondly, extracting coverage statistical information from each resource by means of the fingerprint extracting technology, and compressing by the aid of Bloom filters; thirdly, performing high-efficiency storage and retrieval by means of a distribution strategy based on queried keyword semantic meanings; fourthly, comparing overlap degree of corresponding fingerprint sets by comparing the Bloom filters so as to obtain novelty of each resource; fifthly, computing the novelty of each resource, and rearranging the sequence of the candidate resources according to the novelty; and finally, performing weighting computation according to the relevance and the novelty so as to obtain an optimal resource list. The method gives consideration to the resource relevance and the overlap degree in resource selection, and querying efficiency is improved.

Description

A kind of resource selection method under non-co-operative environment
Technical field
The present invention relates to the resource selection method under a kind of non-co-operative environment, more particularly, it relates to which a kind of take into account the resource degree of correlation and overlapping degree, resource selection method under non-co-operative environment.
Background technology
Resource selection is a popular research theme in distributed information retrieval field.For given inquiry Q, distributed search engine is determined and the inquiry most related resource list using resource selection method, then will inquire about the resource issued in most related resource list.Outstanding resource selection method is enabled to each inquiry, it is only necessary to which a small amount of resource, which participates in inquiring about, can just reach and whole resources participate in the close result of inquiry.Therefore, the effect of resource selection directly determines the efficiency of query execution process and the quality of Query Result.
Most of traditional resource selection method focuses on resource and the degree of correlation of inquiry.It is overlapping that these methods often assume that the document sets of each resource are not present, or thinks overlapping smaller so that can be ignored.However, in P2P search engines under a Non-synergic environment, its document sets of each resource independent maintenance inevitably to have a great deal of identical or closely similar document between the resource under Non-synergic environment.For example, there are many similar papers between famous library automation such as ACM, IEEE, news category website such as Netease, Sina etc. can also include substantial amounts of similar news web page.
In face of this problem, if resource selection method does not consider the overlapping of resource document collection, it is possible to which an inquiry is transmitted into two overlapping degrees very high resource(Such as two mirror image website), cause network resources waste and reduce the efficiency of inquiry.A kind of take into account that resource is overlapping and the resource selection method of the degree of correlation therefore, it is necessary to study.
The content of the invention
Regarding to the issue above, the invention discloses the resource selection method under a kind of non-co-operative environment, this method can take into account resource degree of overlapping and the degree of correlation simultaneously when selecting resource, maximize expected novel results total amount, the validity of resource selection is improved, so as to improve the efficiency of inquiry.
The technical scheme steps that the present invention solves the use of its technical problem are as follows:
A kind of resource selection method under non-co-operative environment, is that the resource degree of correlation and overlapping degree are taken into account in resource selection, so as to improve the efficiency of inquiry, this method is realized using following steps:
Step 1:First with the resource selection method based on the degree of correlation, calculate each resource degree of correlation and sort, obtain the Resources list according to resource relevancy ranking.
Step 2:The fingerprint collection of result document is obtained from Query Result;It is assumed that a resource group<P1,P2…Pi…Pn>, and assume that a node produces an inquiry Q, after node receives returning result, to each result document, extract the numeral of a string of regular lengths to represent the title content of a result document using fingerprint extraction technology.
Step 3:Management covering statistical information;This process contains three subprocess:The process for extracting covering statistical information, the storing process for covering statistical information, the process of covering statistical information retrieval are concentrated from result fingerprint;Described management includes two generic operations:Storage and retrieval;After one group of covering statistical information is produced, system needs, according to the semanteme inquired about in covering statistical information, to be stored in each resource for being distributed to system, the retrieval of convenient covering statistical information.
Step 4:Calculate the novel degree of each resource;According to given one group of resource and its covering statistical information, quantity of each resource containing novel results is calculated, and then calculate novel degree of each resource to Query Result.
Step 5:According to the resource degree of correlation calculated in step 1, it is adjusted with reference to the list that novel degree sorts to resource so that novel results quantity is maximized.
Beneficial effects of the present invention:
1. the present invention can extract covering statistical information from Query Result, the overlapping degree that these covering statistical informations can be used in follow-up query process between computing resource, expected novel results total amount is maximized in resource selection, so as to improve the validity of resource selection.
2. the present invention stores the semantic vector space that covering statistical information is inquired about according to it into Chord networks, so that similar semantic query set, covering statistical information can be shared, greatly reduce the memory space that system covers statistical information, and the hit rate of covering statistical information is increased, solve the problem of many words are synonymous.
3. being deposited between resource in a case of overlap, the present invention can reduce the waste of query messages, effectively improve search efficiency compared to other resource selection methods.
Brief description of the drawings
The step of Fig. 1 performs resource selection method for the present invention under non-co-operative environment.
Embodiment
Below in conjunction with the accompanying drawings, specific embodiments of the present invention are described in further detail.The description of its specific steps is as shown in Figure 1:
Step 1. generates initial resource list.Using the resource selection method based on the degree of correlation, the degree of correlation and the sequence of each resource are calculated, a list according to relevancy ranking is obtained.
Step 2. obtains the fingerprint collection of result document from Query Result.Including two sub-steps:
1)Take the fingerprint collection from result.To each result document, extract the numeral of a string of regular lengths to represent the title content of a result document using fingerprint extraction technology.Two contents very close to title can be showed with same fingerprint.To some resource and inquiry, the set of all fingerprints is exactly the covering statistical information of the resource.In order to preferably solve from short text take the fingerprint the problem of, the present invention is healthy and strong, it is not necessary to the fingerprint technique of global statistics information using a kind of efficient(Shingle-based Discrete Cosine Transform, S-DCT).For filtering noise vocabulary, S-DCT deletes stop words and punctuation mark;One group of shingle is generated from word sequence, each shingle is changed into a fingerprint using DCT.Specifically, S-DCT methods comprise the following steps:
(1) result is obtained
Figure DEST_PATH_IMAGE002
Title content.
(2) stop words and punctuation mark are deleted.
(3) each word is performed takes root to operate.
(4) remaining word is arranged by lexcographical order, generates a word order.
(5) sliding window technique is utilized, one group is generated to word ordershingles
(6) to eachshingle, calculateshingleIn cryptographic Hash.
(7) vertical transitions are carried out to all cryptographic Hash, the average for being allowed to cryptographic Hash falls 0.
(8) Hash maximum is used, all cryptographic Hash of standardizing.
(9) cryptographic Hash to all standardization carries out dct transform.
(10) each DCT coefficient is quantified as on a small amount of bit positions.
(11) merge all bit, create fingerprint.
(12) all shingles fingerprint, for representing this result
Figure 576085DEST_PATH_IMAGE002
2)Compress fingerprint collection.In order to save bandwidth and memory space, fingerprint collection is stored using Bloom filter.So as to which the representation of the covering statistical information of a resource is:
Figure DEST_PATH_IMAGE004
Under normal circumstances, all the elements generation that the fingerprint of a document should be based on document.
Step 3. management covering statistical information.
After one group of covering statistical information is produced, system needs the semanteme according to query word in statistical information, is distributed in P2P networks.The corresponding covering statistical information of inquiry of semantic similarity, can be placed in same resource.Correspondingly, the covering statistical information of a particular keywords is inquired about, is semanteme inquiry in semantic space according to the keyword.Correspondingly, an inquiry is given, the related covering statistical information of the inquiry is retrieved by the semantic vector of the inquiry.So as to while efficient storage and retrieval covering statistical information, reduce the storage overhead of system and improve the scalability of system.
In order to reduce the storage overhead of system and improve the scalability of system, the present invention is using based on the semantic distribution policy of searching keyword, each query vector is mapped to its semantic vector using potential applications index, semantic vector is mapped to an integer value for being located at Chord ID scopes again, determines which resource is the covering statistical information should be placed on.This process contains three subprocess:The process for extracting covering statistical information, the storing process for covering statistical information, the process of covering statistical information retrieval are concentrated from result fingerprint.
1)Concentrated from result fingerprint and extract covering statistical information, algorithm flow is:
Figure DEST_PATH_IMAGE006
Wherein set up the step of potential applications index and are mapped to semantic space as follows:
(1) collection of document is analyzed, the matrix of the corresponding word-document of document sets is built;
(2) singular value decomposition is carried out to word-document matrix(SVD);
(3) matrix after being decomposed to SVD carries out dimensionality reduction;
(4) latent semantic space is built using the matrix after dimensionality reduction.
2)The storing process of statistical information is covered, algorithm performs process is as follows:
(1) after resource A obtains covering statistical information CV (Q) for inquiring about Q, resource A indexes the semantic vector for obtaining the inquiry using potential applicationsVQ
(2) then, VQ is mapped to Chord ID spaces, route points to resource B.
(3) last, CV (Q) is sent to its destination resource B.
3)Cover statistical information retrieving.When a resource(It is assumed that A)Initiate an inquiryQAfterwards, the inquiry is switched to semantic vectorVQ, and then it is mapped to oneChord ID, point to resource B.If resource B has inquiryQCorresponding covering statistical information CV (Q), then issue resource A by covering statistical information CV (Q).If it does not exist, then resource B looks for whether exist and inquiryQSimilar inquiryQ’, meet
Figure DEST_PATH_IMAGE008
.If it is found, then returning to covering statistical informationCV (Q’);If still not finding similar, one inquiry covering statistical information failure of return, and notify that resource A needs to extract inquiry after result returnQCovering statistical information.
Step 4. estimates the novel degree of each resource
Compare resource
Figure DEST_PATH_IMAGE010
Bloom filter
Figure DEST_PATH_IMAGE012
With Bloom filter
Figure DEST_PATH_IMAGE014
, whereinSIt is the set for the resource chosen.
Figure DEST_PATH_IMAGE016
Represent the document space covered.Define a resource
Figure 659710DEST_PATH_IMAGE010
Novel degree be:
Figure DEST_PATH_IMAGE018
I.e. in Bloom filter
Figure 346519DEST_PATH_IMAGE012
In be set to 1 and Bloom filter
Figure 106665DEST_PATH_IMAGE016
In for 0 bit positions quantity.Similarly, Bloom filter is defined
Figure 373698DEST_PATH_IMAGE016
And Bloom filter
Figure 510281DEST_PATH_IMAGE012
Degree of overlapping be:
Figure DEST_PATH_IMAGE020
Step 5. synthesis pertinence and novelty degree are ranked up to resource.Including three subprocess:
(1)Using CORI methods, each resource and the degree of correlation of inquiry are calculated ,And sorted from high to low by the degree of correlation, obtain a candidate resource list.Wherein calculate the degree of correlation of all resourcesrelevance[i]Algorithm flow be:
Figure DEST_PATH_IMAGE024
Whereins max It is the maximum of the relevance score of all candidate resources.In order to obtain the relevance score of normalizationrelevance[i], the degree of correlation of each resource is raw scores[i]Divided bys max
(2)Calculate the novel degree of each resource ,And putting in order for candidate resource is readjusted according to novel degree.Resource novelty degreenovelty[i]Calculating process is:
Figure DEST_PATH_IMAGE028
It calls two functionsnovelDocs()WithoverlapDocs()Novel number of files of each resource relative to the Resources list selected is calculated respectivelyn[i]Witho[i], calculaten[i]Witho[i]Ratioc[i], it is finally rightc[i]Normalization obtains novel degreenovelty[i]
Wherein, functionnovelDocs()Return to Bloom filter
Figure 992209DEST_PATH_IMAGE012
Middle bit is 1 and Bloom filter
Figure 923256DEST_PATH_IMAGE016
Middle bit be 0 quantity.Specifically, its computing formula is expressed as:
Figure DEST_PATH_IMAGE030
FunctionoverlapDocs()Return to Bloom filter
Figure 555881DEST_PATH_IMAGE012
And Bloom filter
Figure 230576DEST_PATH_IMAGE016
In be 1 bit positions sum, its computing formula is expressed as:
Figure DEST_PATH_IMAGE032
(3)Calculate optimal the Resources list.The detailed process of synthesis pertinence and novelty degree is:
The relevance score of each resources[i]And its Bloom filterbf[i]It is used as the input of algorithm.Algorithm, as seed, an optimal resource is selected from surplus resources and is added in new the Resources list every time first using degree of correlation highest resource.Wherein, calculating optimal method is obtained by the degree of correlation to each resource and novelty degree ranking operation:
Figure DEST_PATH_IMAGE036
Wherein
Figure DEST_PATH_IMAGE038
It is the parameter between one [0,1].

Claims (5)

1. the resource selection method under a kind of non-co-operative environment, it is characterised in that:The resource degree of correlation and overlapping degree are taken into account in resource selection, so as to improve the efficiency of inquiry, this method is realized using following steps:
Step 1:First with the resource selection method based on the degree of correlation, calculate each resource degree of correlation and sort, obtain the Resources list according to resource relevancy ranking;
Step 2:The fingerprint collection of result document is obtained from Query Result;It is assumed that a resource group<P1,P2…Pi…Pn>, and assume that a node produces an inquiry Q, after node receives returning result, to each result document, extract the numeral of a string of regular lengths to represent the title content of a result document using fingerprint extraction technology;
Step 3:Management covering statistical information;This process contains three subprocess:The process for extracting covering statistical information, the storing process for covering statistical information, the process of covering statistical information retrieval are concentrated from result fingerprint;Described management includes two generic operations:Storage and retrieval;After one group of covering statistical information is produced, system needs, according to the semanteme inquired about in covering statistical information, to be stored in each resource for being distributed to system, the retrieval of convenient covering statistical information;
Step 4:Calculate the novel degree of each resource;According to given one group of resource and its covering statistical information, quantity of each resource containing novel results is calculated, and then calculate novel degree of each resource to Query Result;
Step 5:According to the resource degree of correlation calculated in step 1, it is adjusted with reference to the list that novel degree sorts to resource so that novel results quantity is maximized.
2. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that:In step 2, after node receives returning result, the title content of each result is taken the fingerprint, i.e., a result document is represented with the numeral of a string of regular lengths, so that result one fingerprint set of correspondence that each resource is returned;Then, the fingerprint set is further compressed using Bloom filter, so as to obtain result fingerprint collection of each resource Pi on inquiring about Q.
3. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that:In step 3, the fingerprint obtained from step 2, which is concentrated, extracts covering statistical information, then covering statistical information is distributed into each resource to be stored, distribution procedure is used based on the semantic strategy of searching keyword, and the inquiry correspondence covering statistical information of similar semantic is gathered for same class and is stored in same resource;Correspondingly, an inquiry is given, the related covering statistical information of the inquiry is retrieved by the semantic vector of the inquiry, is quickly found out the resource for storing inquiry correlation covering statistical information, reduced the storage overhead of system and improve the scalability of system.
4. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that:In step 4 during computing resource novelty degree, after Bloom filter formation inquiry Q covering statistical information, by comparing the overlapping degree between Bloom filter, the degree of overlapping of corresponding fingerprint collection is calculated, the novel degree for obtaining each resource is finally calculated.
5. the resource selection method under a kind of non-co-operative environment according to right 1, it is characterised in that:In steps of 5, using obtained the Resources list by relevancy ranking, the novel degree of each resource is calculated, each resource degree of correlation and novelty degree are weighted, optimal the Resources list is obtained.
CN 201210035195 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment Expired - Fee Related CN102609536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210035195 CN102609536B (en) 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210035195 CN102609536B (en) 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment

Publications (2)

Publication Number Publication Date
CN102609536A true CN102609536A (en) 2012-07-25
CN102609536B CN102609536B (en) 2013-09-18

Family

ID=46526908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210035195 Expired - Fee Related CN102609536B (en) 2012-02-16 2012-02-16 Resource selection method in non-cooperative environment

Country Status (1)

Country Link
CN (1) CN102609536B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164698A (en) * 2013-03-29 2013-06-19 华为技术有限公司 Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN101283357A (en) * 2005-10-11 2008-10-08 泰普有限公司 Search using changes in prevalence of content items on the web
CN101535945A (en) * 2006-04-25 2009-09-16 英孚威尔公司 Full text query and search systems and method of use
US20110225191A1 (en) * 2010-03-10 2011-09-15 Data Domain, Inc. Index searching using a bloom filter

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN101283357A (en) * 2005-10-11 2008-10-08 泰普有限公司 Search using changes in prevalence of content items on the web
CN101535945A (en) * 2006-04-25 2009-09-16 英孚威尔公司 Full text query and search systems and method of use
US20110225191A1 (en) * 2010-03-10 2011-09-15 Data Domain, Inc. Index searching using a bloom filter

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
毛许光: "网页查重算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164698A (en) * 2013-03-29 2013-06-19 华为技术有限公司 Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested
CN103164698B (en) * 2013-03-29 2016-01-27 华为技术有限公司 Text fingerprints library generating method and device, text fingerprints matching process and device

Also Published As

Publication number Publication date
CN102609536B (en) 2013-09-18

Similar Documents

Publication Publication Date Title
Huang et al. Embedding-based retrieval in facebook search
Ma et al. Efficiently finding web services using a clustering semantic approach
US7644069B2 (en) Search ranking method for file system and related search engine
CN107391502B (en) Time interval data query method and device and index construction method and device
US20120130997A1 (en) Hybrid-distribution model for search engine indexes
JP6216467B2 (en) Visual-semantic composite network and method for forming the network
CN101127043A (en) Lightweight individualized search engine and its searching method
CN104376406A (en) Enterprise innovation resource management and analysis system and method based on big data
CN103577416A (en) Query expansion method and system
CN101334773A (en) Method for filtrating search engine searching result
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
CN104484392B (en) Query sentence of database generation method and device
US20140032568A1 (en) System and Method for Indexing Streams Containing Unstructured Text Data
CN113434778B (en) Recommendation method based on regularization framework and attention mechanism
Martín et al. Using semi-structured data for assessing research paper similarity
CN104915388B (en) It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology
Dourado et al. Bag of textual graphs (BoTG): A general graph‐based text representation model
CN103150336A (en) Sky line online calculation method based on user clustering
CN101840438A (en) Retrieval system oriented to meta keywords of source document
CN102609536A (en) Resource selection method in non-cooperative environment
CN103559269A (en) Knowledge recommending method for mobile news subscription
Noughabi et al. Predicting Students' Behavioral Patterns in University Networks for Efficient Bandwidth Allocation: A Hybrid Data Mining Method (Application Paper)
Podnar et al. Beyond term indexing: A P2P framework for web information retrieval
CN101763441B (en) Technology organizing search results in active directory mode
Ji et al. Vocabulary hierarchy optimization for effective and transferable retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130918

CF01 Termination of patent right due to non-payment of annual fee