CN103064846A

CN103064846A - Retrieval device and retrieval method

Info

Publication number: CN103064846A
Application number: CN2011103195652A
Authority: CN
Inventors: 吴尉林; 许欢庆; 史献忠; 郭永福; 陈沛
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY CO., LTD.
Priority date: 2011-10-20
Filing date: 2011-10-20
Publication date: 2013-04-24
Anticipated expiration: 2031-10-20
Also published as: CN103064846B

Abstract

The invention provides a retrieval device and a retrieval method. The retrieval device is used in an information retrieval and search engine system and comprises a minimum hitting window acquisition module which is used for acquiring the minimum hitting window of a plurality of key words used for querying in documents, a global adjacent degree calculation module which is used for calculating the extension hitting length of the minimum hitting window to be taken as the global adjacent degree of the plurality of key words according to the hitting length of the minimum hitting window and the number of inversion pairs of the plurality of key words in the minimum hitting window, a position correlation calculation module which is used for calculating the position correlation of the plurality of key words in the documents according to the global adjacent degree, and a result-generation module which is used for sorting the documents and generating result according to the position correlation. According to the retrieval device and the retrieval method, global adjacent degree is improved, reasonable position correlation can be calculated based on the improved global adjacent degree, and thus more accurate and efficient retrieval can be realized.

Description

Indexing unit and search method

Technical field

The present invention relates to information retrieval field, in particular to indexing unit and search method.

Background technology

Along with the progress of computer technology (particularly Internet technology), the information (such as e-book, webpage etc.) of electronization is with the speed increment of explosion type.Face magnanimity and disperse unordered digitized information, people are in the urgent need to the Method and kit for of fast finding and location information needed.Information retrieval and search engine system produce for this demand that satisfies people just.Typical search engine system comprises downloads subsystem (collection of information and obtain), pre-service and index subsystem (processing of information and tissue) and retrieval subsystem (providing inquiry service to the user).Wherein, retrieval subsystem is accepted the inquiry of user's input, returns the result for retrieval tabulation according to certain sort method.

Function to the result for retrieval ordering is finished by the relevance ranking module, and it is the core of search engine.The relevance ranking module relies on a retrieval model document is given a mark, and the marking factor of usually considering comprises the importance (such as PageRank) of tfidf (i.e. the frequency of occurrences in document and inverted entry frequency), webpage of term and the position correlation (namely according to keyword appearance position in document in inquiry and order marking) of term in document etc.Wherein, position correlation is one of key factor that improves the search engine quality, because it has reflected that to a great extent inquiry and document are in semantically relevance.For example, two pieces of documents have all comprised all keywords in many words inquiries, and the keyword that wherein hits in the document 1 connects together, and the keyword that hits in the document 2 is to be dispersed in to occur in two different sentences, what obviously, the rank of document 1 should be than document 2 is higher.

The method of calculating location correlativity roughly can be divided into two classes:

1) sets up hybrid index, except the keyword in the index webpage, go back indexing key words N tuple (n-gram) or phrase, come the calculating location relevance scores according to the match condition of the keyword N tuple in the inquiry;

2) only set up keyword index, but record the appearance position of all keywords in document, then come the calculating location correlativity according to certain adjacency.

First kind method does not need the position of recorded key word, the space expense less.Because the performance limitations of computing machine, early stage search engine is mainly taked this mode (two tuples of common indexing key words).The shortcoming of the method is that keyword N tuple has often only reflected local message.For example, in the situation of an index two tuples, for inquiry " student of Peking University ", corresponding keyword two tuples are " Peking University " and " university student ".If certain webpage has comprised " Peking University " and " university student " simultaneously, but both distances are far, and the content that is to say this webpage is not directly about " student of Peking University ".If directly calculate by the hit situation of two tuples, this webpage still has higher position correlation mark.Simultaneously, the method is only effective in the situation that keyword N tuple is hit, for example, it can not distinguish following situation: document 1 and 2 all comprises keyword " A " and " B " in the inquiry " AB ", and not that the next-door neighbour occurs, 1 word in an interval between " A " and " B " in the webpage 1, and in the webpage 2 " A " and " B " interval 100 words.In addition, concordance list can expand (number that is index entry also increases greatly), and the maintenance of index and retrieving are complicated.There are at present a kind of computer index based on vocabulary and search method, characteristics according to Chinese, a kind of mutation method of binary group index has been proposed, for example, for the literal fragment in the document " Shanghai local conditions and customs ", result after the participle is " Shanghai/local conditions and customs ", " # sea wind " (being called " stealthy keyword " in this patent) also can be added index, if hit stealthy keyword then weighting when retrieval.Add index owing to only got right the former's of adjacent keyword tail word and the latter's lead-in as two tuples, therefore, the advantage of the method is to reduce to a certain extent the index thesaurus size, but does not avoid the locality of N tuple and the defective of index maintenance and retrieving complexity.Also there is at present a kind of scheme, judge the method for a group polling key word or word position correlation in webpage, this also is the mutation of a binary group index, it is all binary group indexes of index not, but to the keyword of the highest forward and backward of the record of each keyword in the document and its co-occurrence frequency, if find that the front and back word of the keyword in the inquiry just in time appears at then weighting in its forward direction or the backward table during retrieval.The method overall space expense is less, and its shortcoming is only to have recorded partial information, and applicable surface is less, and is only effective to partial query.On the other hand, it is not high to search the efficient of forward and backward table during retrieval yet.

The Equations of The Second Kind method need to record the appearance position of all keywords in the webpage, and space expense is larger, and the time overhead of calculating location correlativity is also larger.Its advantage is that index structure, index maintenance and retrieving are all fairly simple, and more flexible, can support different position correlation models.Adjacency (proximity measure) commonly used can be divided into two classes:

Overall situation adjacency (global proximity measure): the proximity of considering all keywords in the inquiry:

Overall situation adjacency mainly contains the minimum hit length of window, namely comprises the length of the minimum window of all searching keywords in the document.The advantage that window hits length has been to reflect the proximity of inquiring about integral body in document, better to short inquiry (inquiries of 2 words or 3 words) effect, but then not too applicable for long inquiry, because inquiry is longer, the possibility that all keywords drop in the less window is less.A kind of scheme is arranged at present, the definition of hitting window is expanded, loosened the requirement that all keywords all must occur, as long as comprising an above keyword can consist of and hit window, to the word frequency weighting, add up all Weighted Term Frequencies by the BM25 formula according to the keyword number that hits length of window and comprise at last.

Local adjacency (local proximity measure): consider the right proximity of keyword in the inquiry:

The representative of local adjacency is words distance mark accumulation method: any " vicinity " two keywords in the statistic document are to the distance of (this word between keyword in any inquiry can not occur), then be word frequency according to the distance conversion, at last the word frequency of conversion is added up as the position correlation mark (referring to Y.Rasolofo and J.Savoy.Term proximity scoring for keyword-based retrieval systems.In Proceedings of the 25th European Conference on IR Research (ECIR 2003) by the BM25 formula, pp.207-218,2003).The advantage of the method is that counting yield is higher, has considered the mark of a plurality of words distances, and shortcoming mainly is its locality, because it only considers the distance between the contiguous keyword in document.

In sum, the room and time complexity less of first kind method (namely setting up the method for N unit group index), but the maintenance of index and retrieving are complicated.The most important thing is that keyword N tuple has often only reflected local message, and is limited to the raising of retrieval effectiveness.And Equations of The Second Kind method (be that the recorded key lexeme is put, come the calculating location correlativity by certain adjacency) room and time complexity is relatively large, but the potentiality of raising retrieval effectiveness are larger.In the situation that the computer nowadays performance significantly improves, can satisfy the demand of the room and time expense of Equations of The Second Kind method, therefore, the Equations of The Second Kind method becomes main flow gradually.But, still there are various defectives in present Equations of The Second Kind method, as hit the concentration class that the length of window method has only been considered all keywords usually, whether the order of not considering to hit keyword in the window is consistent with the original order in the inquiry, and bad for the effect of long inquiry; And the method for words distance only reflects local message.

Therefore, the weak point that need to exist for the position correlation method (namely hitting length of window and words distance method) of any main flow in two kinds, a kind of new position correlation scheme (being the improved length of window method of hitting) is proposed, retrieval effectiveness can be improved further, higher recall precision can be guaranteed again simultaneously.

Summary of the invention

Technical matters to be solved by this invention is, the weak point that exists for the position correlation method (namely hitting length of window and words distance method) of any main flow in two kinds, a kind of new position correlation scheme (being the improved length of window method of hitting) is proposed, retrieval effectiveness can be improved further, higher recall precision can be guaranteed again simultaneously.

In view of this, the invention provides a kind of indexing unit, be used for information retrieval and search engine system, comprising: minimum hit window acquisition module, obtain the used minimum hit window of a plurality of keywords in document of inquiry; Overall situation adjacency computing module, the length of window of hitting according to described minimum hit window, and the backward logarithm of a plurality of keywords described in the described minimum hit window, length of window is hit in the expansion that calculates described minimum hit window, as the overall adjacency of described a plurality of keywords; The position correlation computing module according to described overall adjacency, calculates the position correlation of described a plurality of keyword in described document; Result-generation module according to described position correlation, sorts to described document, and generates result for retrieval.By this technical scheme, realized the improvement to overall adjacency, based on this improved overall adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, preferably, described overall adjacency computing module is by following formula, length of window is hit in the expansion that calculates described minimum hit window: ExpSpanLen (Q, D)=OriSpanLen+ ε .InvNum, wherein, D represents described document, Q represents described a plurality of keyword, OriSpanLen represents the predetermined length of window of hitting of hitting window, and InvNum represents the described specific backward logarithm that hits window, and ε represents preset value, length of window is hit in the described predetermined expansion of hitting window of ExpSpanLen (Q, D) expression.By this technical scheme, rationally be provided with expansion and hit length of window, it is conducive to realize accurately efficiently retrieval.

The present invention also provides a kind of search method, is used for information retrieval and search engine system, comprising: step 202, minimum hit window acquisition module are obtained the used minimum hit window of a plurality of keywords in document of inquiry; Step 204, overall situation adjacency computing module is according to the length of window of hitting of described minimum hit window, and the backward logarithm of a plurality of keywords described in the described minimum hit window, length of window is hit in the expansion that calculates described minimum hit window, as the overall adjacency of described a plurality of keywords; Step 206, position correlation computing module calculate the position correlation of described a plurality of keyword in described document according to described overall adjacency; Step 208, result-generation module sort to described document according to described position correlation, and generate result for retrieval.By this technical scheme, realized the improvement to overall adjacency, based on this improved overall adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, preferably, in described step 204, described overall adjacency computing module is by following formula, length of window is hit in the expansion that calculates described minimum hit window: ExpSpanLen (Q, D)=OriSpanLen+ ε .InvNum, wherein, D represents described document, and Q represents described a plurality of keyword, and OriSpanLen represents the predetermined length of window of hitting of hitting window, InvNum represents the described specific backward logarithm that hits window, ε represents preset value, and length of window is hit in the described predetermined expansion of hitting window of ExpSpanLen (Q, D) expression.By this technical scheme, rationally be provided with expansion and hit length of window, it is conducive to realize accurately efficiently retrieval.

The present invention also provides a kind of indexing unit, is used for information retrieval and search engine system, comprising: the minimum distance calculation module calculates keyword in the used a plurality of keywords of inquiry to the minor increment in document; Local adjacency computing module, the minor increment right according to described keyword calculates the geometric mean minor increment of described a plurality of keyword in described document, as the local adjacency of described a plurality of keywords; The position correlation computing module according to described local adjacency, calculates the position correlation of described a plurality of keyword in described document; Result-generation module according to described position correlation, sorts to described document, and generates result for retrieval.By this technical scheme, realized the improvement to local adjacency, based on this improved local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, preferably, described local adjacency computing module calculates described geometric mean minor increment according to following formula:

GeoMeanMinDist (Q, D) = \sqrt[\frac{2}{n (n - 1)}]{Π_{t_{1}, t_{2} &Element; Q \cap D, t 1 &NotEqual; t_{2}} MinDist (t_{1}, t_{2}; D)},

Wherein, D represents described document, and Q represents described a plurality of keyword, t ₁, t ₂Represent described keyword pair, described keyword is to representing contiguous word pair, MinDist (t ₁, t ₂D) expression t ₁, t ₂Minor increment in D, GeoMeanMinDist (Q, D) represents described geometric mean minor increment.By this technical scheme, rationally be provided with the geometric mean minor increment, it is conducive to realize accurately efficiently retrieval, for contiguous word pair, has significantly improved recall precision simultaneously.

In technique scheme, preferably, described minimum distance calculation module is selected the right minor increment of described keyword from the right a plurality of distances of described keyword, wherein, described keyword to be backward to the time, according to following formula, calculate the right minor increment of described keyword: Dist (t _i, t _j)=Dist (t _j, t _i)+ρ, wherein, i＜j, ρ represents preset value, Dist (t _i, t _j) the right distance of the described keyword of expression.By this technical scheme, effectively with the words distance of positive sequence and backward, unify, for the calculating of minor increment, it is accurate to have guaranteed that minor increment is chosen with all.

The present invention also provides a kind of search method, is used for information retrieval and search engine system, comprising: step 402, minimum distance calculation module calculate keyword in the used a plurality of keywords of inquiry to the minor increment in document; Step 404, the minor increment that local adjacency computing module is right according to described keyword calculates the geometric mean minor increment of described a plurality of keyword in described document, as the local adjacency of described a plurality of keywords; Step 406, position correlation computing module calculate the position correlation of described a plurality of keyword in described document according to described local adjacency; Step 408, result-generation module sort to described document according to described position correlation, and generate result for retrieval.By this technical scheme, realized the improvement to local adjacency, based on this improved local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, preferably, in described step 204, described local adjacency computing module calculates described geometric mean minor increment according to following formula:

GeoMeanMinDist (Q, D) = \sqrt[\frac{2}{n (n - 1)}]{Π_{t_{1}, t_{2} &Element; Q \cap D, t 1 &NotEqual; t_{2}} MinDist (t_{1}, t_{2}; D)},

In technique scheme, preferably, in described step 402, described minimum distance calculation module is selected the right minor increment of described keyword from the right a plurality of distances of described keyword, wherein, described keyword to be backward to the time, according to following formula, calculate the right distance of described keyword: Dist (t _i, t _j)=Dist (t _j, t _i)+ρ, wherein, i＜j, ρ represents preset value, Dist (t _i, t _j) the right distance of the described keyword of expression.By this technical scheme, effectively with the words distance of positive sequence and backward, unify, for the calculating of minor increment, it is accurate to have guaranteed that minor increment is chosen with all.

The present invention also provides a kind of indexing unit, be used for information retrieval and search engine system, comprise: the position correlation computing module, a plurality of keywords overall adjacency and the local adjacency in document used according to inquiry, and default transfer function, calculate the position correlation of described a plurality of keyword in described document; Result-generation module according to described position correlation, sorts to described document, and generates result for retrieval.By this technical scheme, realized the improvement to overall adjacency and local adjacency, based on this improved overall adjacency and local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, preferably, described transfer function comprises: π (Q, D)=c+log (λ+e ^{-δ (Q, D)}) and/or

Wherein, D represents described document, and Q represents described a plurality of keyword, δ (Q, D) the described overall adjacency of expression or described local adjacency, and c represents preset value, and λ represents preset value, and π (Q, D) represents described position correlation.By this technical scheme, realized effective conversion of adjacency, be conducive to realize the accurately retrieval of height.

In technique scheme, preferably, described position correlation computing module calculates described position correlation by following formula:

π (Q, D) = α . {c + \log (λ + e^{- \frac{MinExpSpanLen (Q, D)}{n}})} + (1 - α) . {c + \log (λ + e^{- GeoMeanMinDist (Q, D)})},

Wherein, MinExpSpanLen (Q, D) represents described overall adjacency, the local adjacency of GeoMeanMinDist (Q, D) the described overall situation of expression.By this technical scheme, reasonable combination overall adjacency and local adjacency, guaranteed the reasonable computation of position correlation, be conducive to realize accurately efficiently retrieval.

The present invention also provides a kind of search method, be used for information retrieval and search engine system, comprise: step 602, overall adjacency and the local adjacency of a plurality of keywords in document that the position correlation computing module is used according to inquiry, and default transfer function, calculate the position correlation of described a plurality of keyword in described document; Step 604, result-generation module sort to described document according to described position correlation, and generate result for retrieval.By this technical scheme, realized the improvement to overall adjacency and local adjacency, based on this improved overall adjacency and local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, preferably, in described step 602, described position correlation computing module calculates described position correlation by following formula:

π (Q, D) = α . {c + \log (λ + e^{- \frac{GeoMeanMinDist (Q, D)}{n}})} + (1 - α) . {c + \log (λ + e^{- GeoMeanMinDist (Q, D)})},

By above technical scheme, indexing unit and search method have been realized, the weak point that exists for the position correlation method (namely hitting length of window and words distance method) of any main flow in two kinds, a kind of new position correlation scheme (being the improved length of window method of hitting) is proposed, retrieval effectiveness can be improved further, higher recall precision can be guaranteed again simultaneously.

Description of drawings

Fig. 1 is the block diagram of indexing unit according to an embodiment of the invention;

Fig. 2 is the process flow diagram of search method according to an embodiment of the invention;

Fig. 3 is the block diagram of indexing unit according to an embodiment of the invention;

Fig. 4 is the process flow diagram of search method according to an embodiment of the invention;

Fig. 5 is the block diagram of indexing unit according to an embodiment of the invention;

Fig. 6 is the process flow diagram of search method according to an embodiment of the invention;

Fig. 7 is the process flow diagram of search method according to an embodiment of the invention;

Fig. 8 be search method according to an embodiment of the invention fall into a trap tell the fortune in the flow process of window backward logarithm.

Embodiment

In order more clearly to understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments the present invention is further described in detail.

Set forth in the following description a lot of details so that fully understand the present invention, still, the present invention can also adopt other to be different from other modes described here and implement, and therefore, the present invention is not limited to the restriction of following public specific embodiment.

Fig. 1 is the block diagram of indexing unit according to an embodiment of the invention.

As shown in Figure 1, the invention provides a kind of indexing unit 100, be used for information retrieval and search engine system, comprising: minimum hit window acquisition module 102, obtain the used minimum hit window of a plurality of keywords in document of inquiry; Overall situation adjacency computing module 104, the length of window of hitting according to described minimum hit window, and the backward logarithm of a plurality of keywords described in the described minimum hit window, length of window is hit in the expansion that calculates described minimum hit window, as the overall adjacency of described a plurality of keywords; Position correlation computing module 106 according to described overall adjacency, calculates the position correlation of described a plurality of keyword in described document; Result-generation module 108 according to described position correlation, sorts to described document, and generates result for retrieval.By this technical scheme, realized the improvement to overall adjacency, based on this improved overall adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, described overall adjacency computing module 104 is by following formula, length of window is hit in the expansion that calculates described minimum hit window: ExpSpanLen (Q, D)=OriSpanLen+ ε .InvNum, wherein, D represents described document, Q represents described a plurality of keyword, OriSpanLen represents the predetermined length of window of hitting of hitting window, InvNum represents the described specific backward logarithm that hits window, ε represents preset value, and length of window is hit in the described predetermined expansion of hitting window of ExpSpanLen (Q, D) expression.By this technical scheme, rationally be provided with expansion and hit length of window, it is conducive to realize accurately efficiently retrieval.

Fig. 2 is the process flow diagram of search method according to an embodiment of the invention.

As shown in Figure 2, the present invention also provides a kind of search method, is used for information retrieval and search engine system, comprising: step 202, minimum hit window acquisition module are obtained the used minimum hit window of a plurality of keywords in document of inquiry; Step 204, overall situation adjacency computing module is according to the length of window of hitting of described minimum hit window, and the backward logarithm of a plurality of keywords described in the described minimum hit window, length of window is hit in the expansion that calculates described minimum hit window, as the overall adjacency of described a plurality of keywords; Step 206, position correlation computing module calculate the position correlation of described a plurality of keyword in described document according to described overall adjacency; Step 208, result-generation module sort to described document according to described position correlation, and generate result for retrieval.By this technical scheme, realized the improvement to overall adjacency, based on this improved overall adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, in described step 204, described overall adjacency computing module is by following formula, length of window is hit in the expansion that calculates described minimum hit window: ExpSpanLen (Q, D)=OriSpanLen+ ε .InvNum, wherein, D represents described document, Q represents described a plurality of keyword, OriSpanLen represents the predetermined length of window of hitting of hitting window, and InvNum represents the described specific backward logarithm that hits window, and ε represents preset value, length of window is hit in the described predetermined expansion of hitting window of ExpSpanLen (Q, D) expression.By this technical scheme, rationally be provided with expansion and hit length of window, it is conducive to realize accurately efficiently retrieval.

Fig. 3 is the block diagram of indexing unit according to an embodiment of the invention.

As shown in Figure 3, the present invention also provides a kind of indexing unit 300, is used for information retrieval and search engine system, comprising: minimum distance calculation module 302 calculates keyword in the used a plurality of keywords of inquiry to the minor increment in document; Local adjacency computing module 304, the minor increment right according to described keyword calculates the geometric mean minor increment of described a plurality of keyword in described document, as the local adjacency of described a plurality of keywords; Position correlation computing module 306 according to described local adjacency, calculates the position correlation of described a plurality of keyword in described document; Result-generation module 308 according to described position correlation, sorts to described document, and generates result for retrieval.By this technical scheme, realized the improvement to local adjacency, based on this improved local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, described local adjacency computing module 304 calculates described geometric mean minor increment according to following formula:

GeoMeanMinDist (Q, D) = \sqrt[\frac{2}{n (n - 1)}]{Π_{t_{1}, t_{2} &Element; Q \cap D, t 1 &NotEqual; t_{2}} MinDist (t_{1}, t_{2}; D)},

In technique scheme, described minimum distance calculation module 302 is selected the right minor increment of described keyword from the right a plurality of distances of described keyword, wherein, described keyword to be backward to the time, according to following formula, calculate the right minor increment of described keyword: Dist (t _i, t _j)=Dist (t _j, t _i)+ρ, wherein, i＜j, ρ represents preset value, Dist (t _i, t _j) the right distance of the described keyword of expression.By this technical scheme, effectively with the words distance of positive sequence and backward, unify, for the calculating of minor increment, it is accurate to have guaranteed that minor increment is chosen with all.

Fig. 4 is the process flow diagram of search method according to an embodiment of the invention.

As shown in Figure 4, the present invention also provides a kind of search method, is used for information retrieval and search engine system, comprising: step 402, minimum distance calculation module calculate keyword in the used a plurality of keywords of inquiry to the minor increment in document; Step 404, the minor increment that local adjacency computing module is right according to described keyword calculates the geometric mean minor increment of described a plurality of keyword in described document, as the local adjacency of described a plurality of keywords; Step 406, position correlation computing module calculate the position correlation of described a plurality of keyword in described document according to described local adjacency; Step 408, result-generation module sort to described document according to described position correlation, and generate result for retrieval.By this technical scheme, realized the improvement to local adjacency, based on this improved local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, in described step 204, described local adjacency computing module calculates described geometric mean minor increment according to following formula:

GeoMeanMinDist (Q, D) = \sqrt[\frac{2}{n (n - 1)}]{Π_{t_{1}, t_{2} &Element; Q \cap D, t 1 &NotEqual; t_{2}} MinDist (t_{1}, t_{2}; D)},

In technique scheme, in described step 402, described minimum distance calculation module is selected the right minor increment of described keyword from the right a plurality of distances of described keyword, wherein, described keyword to be backward to the time, according to following formula, calculate the right distance of described keyword: Dist (t _i, t _j)=Dist (t _j, t _i)+ρ, wherein, i＜j, ρ represents preset value, Dist (t _i, t _j) the right distance of the described keyword of expression.By this technical scheme, effectively with the words distance of positive sequence and backward, unify, for the calculating of minor increment, it is accurate to have guaranteed that minor increment is chosen with all.

Fig. 5 is the block diagram of indexing unit according to an embodiment of the invention.

As shown in Figure 5, the present invention also provides a kind of indexing unit 500, be used for information retrieval and search engine system, comprise: position correlation computing module 502, a plurality of keywords overall adjacency and the local adjacency in document used according to inquiry, and default transfer function, calculate the position correlation of described a plurality of keyword in described document; Result-generation module 504 according to described position correlation, sorts to described document, and generates result for retrieval.By this technical scheme, realized the improvement to overall adjacency and local adjacency, based on this improved overall adjacency and local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, described transfer function comprises: π (Q, D)=c+log (λ+e ^{-δ (Q, D)}) and/or

In technique scheme, described position correlation computing module 502 calculates described position correlation by following formula:

π (Q, D) = α . {c + \log (λ + e^{- \frac{MinExpSpanLen (Q, D)}{n}})} + (1 - α) . {c + \log (λ + e^{- GeoMeanMinDist (Q, D)})},

Fig. 6 is the process flow diagram of search method according to an embodiment of the invention.

As shown in Figure 6, the present invention also provides a kind of search method, be used for information retrieval and search engine system, comprise: step 602, overall adjacency and the local adjacency of a plurality of keywords in document that the position correlation computing module is used according to inquiry, and default transfer function, calculate the position correlation of described a plurality of keyword in described document; Step 604, result-generation module sort to described document according to described position correlation, and generate result for retrieval.By this technical scheme, realized the improvement to overall adjacency and local adjacency, based on this improved overall adjacency and local adjacency, can calculate rational position correlation, more accurately and efficiently to retrieve.

In technique scheme, in described step 602, described position correlation computing module calculates described position correlation by following formula:

π (Q, D) = α . {c + \log (λ + e^{- \frac{GeoMeanMinDist (Q, D)}{n}})} + (1 - α) . {c + \log (λ + e^{- GeoMeanMinDist (Q, D)})},

Below the principle of the technical scheme of the embodiment of the invention is elaborated.

Position correlation method synthesis in the present embodiment overall adjacency and local adjacency.

Overall situation adjacency is based on the query hit length of window in the document and backward logarithm.The query hit window definition is the document fragment that comprises all keywords in the inquiry.For each query hit window, calculate corresponding length of window and with respect to the inquiry the backward logarithm.For example, document d={t1, t4, t3, t5, t2, t1, t3, t4}, and inquiry is for q={t1, t2, t4}, then first query hit window is [t1, t4, t3, t5, t2], and its length of window is 5, and the backward logarithm then is 1 (only have＜t4 t2〉be backward to).Here, we are defined as overall degree (being that expanding query hits length of window) weighted value of original length of window and backward logarithm:

ExpSpanLen(Q，D)＝OriSpanLen+ε·InvNum (1)

Wherein OriSpanLen is original length of window (being 5 in the example), and InvNum is backward logarithm (being 1 in the example), and ε is the penalty factor of backward logarithm.After complete document of scanning, can obtain minimum expansion and hit length of window.Like this, position correlation has not only been considered the proximity of keyword, and has reflected word order, can judge better the semantic relevance of inquiry and document.For example, query string is " student of Peking University " (cuts word result for " Beijing/university/student "), and the window that hits that occurs in the document 1 is " Beijing/university// student ", and the window that hits that occurs in the document 2 is " university// Beijing/student ".If only consider the length hit window, then the position degree of correlation of two documents identical (both minimum hit length of window all are 4).But, consider that the position degree of correlation of document 1 is higher than document 2 (the former backward is to being 0, and the latter is 1) after the backward logarithm.

Local adjacency based on the keyword of inquiry to the geometric mean minor increment in document.Hit in the process of window in scan for inquiries, " vicinity " two keywords in can statistic document are to the distance of (this word between keyword in any inquiry can not occur).In upper example, when finding first query hit window, can calculate dist (t1, t4)=1, dist (t4, t2)=3.After the been scanned, can obtain the minimum value in the distance in every pair of keyword.We are defined as local adjacency the geometrical mean of minimum words distance:

GeoMeanMinDist (Q, D) = \sqrt[\frac{2}{n (n - 1)}]{Π_{t_{1}, t_{2} &Element; Q \cap D, t 1 &NotEqual; t_{2}} MinDist (t_{1}, t_{2}; D)} - - - (2)

Wherein n is the keyword number in the inquiry, MinDist (t ₁, t ₂D) be that word is to (t1, the t2) minor increment in document D.

Note, when calculating words distance, can consider word order equally.In fact, we only record the right distance of forward word, are (t1, t2) in upper example, and (t2, t4), (t1, t4), still, if backward pair occurred in document, the distance that then backward is right also can be converted and is the right distance of corresponding forward word:

dist(t _i，t _j)＝dist(t _j，t _i)+ρ (3)

I＜j wherein, ρ is penalty factor.In addition, if certain word in document never " vicinity " occur, then corresponding words distance is set to a default maximal value (usually being directly proportional with query length n).The geometric mean minor increment has been done well overall adjacency and has been replenished, and can improve further the effect of position correlation.Equally, consider that query string is " student of Peking University " (cuts the word result and be " Beijing/university/student "), the window that hits that occurs in the document 3 is " Beijing/university// outstanding/student ", and the window that hits that occurs in the document 4 is " Beijing/science and engineering/university// student ".Hit length of window, then the position degree of correlation of two documents identical (both minimum hit length of window all are 5, and both backward logarithms all are 0) if only consider expansion.But both geometric mean minor increments have difference, and the former is

Less than the latter's (suppose that 8 are maximum words distance, owing to only calculate contiguous words distance, the distance in " Beijing " and " student " does not all calculate at two documents, is set to maximal value 8).Need to prove, reason for efficient, only added up the distance between the contiguous keyword (do not have between the keyword other query word occur), if the application scenario in the less demanding situation of efficient, also can statistic document in the right distance of all keywords.

The original value of adjacency also needs to be converted to the position correlation mark.The ground that is without loss of generality suppose that adjacency is δ (Q, D), and final position correlation mark is π (Q, D)=f (δ (Q, D)), and wherein f is transfer function.For position correlation, transfer function should satisfy following character: when adjacency was less, corresponding position correlation mark was larger; And along with the increase of adjacency, the mark of position correlation changes more and more less.Usually, selectable transfer function has:

π(Q，D)＝c+log(λ+e ^-δ(Q，D))

Perhaps

π (Q, D) = \frac{1}{δ {(Q, D)}^{x}} (x &GreaterEqual; 1)

Adopt first transfer function in the style of writing below, λ in this transfer function is regulatory factor, the speed that can the adjusting position relevance scores descends with the increase of adjacency, and c is normalized factor, guarantee that the span of π (Q, D) is between 0 to 1.At last, comprehensive two kinds of adjacency, the position correlation mark among the present invention is defined as:

π (Q, D) =

α \cdot {c + \log (λ + e^{- \frac{MinExpSpanLen (Q, D)}{n}})} + (1 - α) \cdot {c + \log (α + e^{- GeoMeanMinDist (Q, D)})} - - - (4)

Wherein α is the weighting factor of two kinds of adjacency, and still between 0 to 1, this mark can combine with other weight such as word frequency mark, PageRank etc. the span of this position associated score, forms total relevance scores.

Fig. 7 is the process flow diagram of search method according to an embodiment of the invention.

As shown in Figure 7, idiographic flow is as follows:

Step 702 is obtained the list of locations of keyword in current document, initialization minimum hit length of window (with the MinSpanLen MAX_SPAN that uses as default).

Step 704 reads the hit location of next keyword in order.

Step 706, more the latest position table of new keywords.To hit length of window in order calculating efficiently, to need the latest position (namely keyword the position occurs at last in document in scanning process) of the searching keyword that record occurs.

Step 708, more neologisms namely calculate last keyword t to the minor increment matrix _iWith current keyword t _jApart from dist, if less than corresponding to minDist[i in the minor increment matrix, j], then with minDist[i, j] be set to dist.

Step 710 judges whether to have formed and hits window, namely whether comprises all searching keywords in this window, if, then proceed step 712, otherwise, jump to step 708.

Step 712 is judged the left margin hit window whether slide (being that whether history is hit leftmost keyword in the window to current keyword).If slide, then enter step 714, otherwise enter step 718;

Step 714 is calculated the current length C urSpanLen that hits window (subtract each other get final product with the maximum position in the keyword latest position table of record and minimum position);

Step 716 whether less than minimum window length M inSpanLen, is upgraded MinSpanLen according to current window length.

Step 718 is judged whether been scanned of keyword position, if all keyword list of locations been scanned carry out step 720, otherwise, return step 704.

Step 720 is called the sub-process of calculating the backward logarithm that hits window, obtains the backward logarithm InvNum of minimum hit window;

Step 722, by formula (1) is calculated expansion and is hit length of window, utilizes the minor increment matrix, and by formula (2) calculate the geometrical mean of minimum words distance, and are last, and by formula (4) calculate final position correlation mark.

As shown in Figure 8, calculate the backward logarithm based on the method for merge sort in the mode of recurrence.

Idiographic flow comprises:

Step 802 is got keyword subscript array a[0 ... r] intermediate value m, InvNum=0;

Step 804, recursive calculation a[0 ... m] backward logarithm InvNum1, InvNum+=InvNum1;

Step 806, recursive calculation a[m ... r] backward logarithm InvNum2, InvNum+=InvNum2;

Step 808, merger is a[0 relatively ... m] and a[m+1 ... r], calculate backward logarithm MergeInvNum, InvNum+=MergeInvNum.

Its time complexity is O (nlogn), and n is the number of keyword in the inquiry.And the time complexity of the scanning algorithm of minimum hit window is O (k), k is the number of the query word that comprises in the document, so total time complexity is O (k+nlogn), because the length of inquiry usually less (having statistics to show that the length of average inquiry is between 2～3), therefore this time complexity is equivalent to O (k) substantially, almost is linear.It is pointed out that this algorithm is to scan first all keyword positions, obtain the minimum hit window, and then to this minimum hit window calculation backward logarithm, it is not optimum that the minimal expansion that obtains is like this hit window.Hit window if obtain optimum minimal expansion, the time complexity that best algorithm needs is O (nk), and efficient reduces (particularly long string) greatly, therefore takes approximate data or a good compromise.

By the present invention, can realize indexing unit and search method, combine the position correlation method of overall adjacency and local adjacency, combine both advantages.Key point of the present invention and advantage are:

(1) not only taken into account whole hit situation and the local hit situation of inquiry in document, keyword and the degree of consistency of inquiry on word order in the window have also been considered to hit, so that position correlation has reflected the semantic relevance of document and inquiry better, can improve further retrieval effectiveness, improve the user and experience.

(2) although considered more information, the present invention still has higher counting yield, has guaranteed the response time of inquiry.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. an indexing unit is characterized in that, comprising:

Minimum hit window acquisition module obtains the used minimum hit window of a plurality of keywords in document of inquiry;

Overall situation adjacency computing module, the length of window of hitting according to described minimum hit window, and the backward logarithm of a plurality of keywords described in the described minimum hit window, length of window is hit in the expansion that calculates described minimum hit window, as the overall adjacency of described a plurality of keywords;

The position correlation computing module according to described overall adjacency, calculates the position correlation of described a plurality of keyword in described document;

Result-generation module according to described position correlation, sorts to described document, and generates result for retrieval.

2. indexing unit according to claim 1 is characterized in that, described overall adjacency computing module is by following formula, and length of window is hit in the expansion that calculates described minimum hit window:

ExpSpanLen (Q, D)=OriSpanLen+ ε .InvNum, wherein, D represents described document, and Q represents described a plurality of keyword, and OriSpanLen represents the predetermined length of window of hitting of hitting window, InvNum represents the described specific backward logarithm that hits window, ε represents preset value, and length of window is hit in the described predetermined expansion of hitting window of ExpSpanLen (Q, D) expression.

3. a search method is characterized in that, comprising:

Step 202, minimum hit window acquisition module are obtained the used minimum hit window of a plurality of keywords in document of inquiry;

Step 204, overall situation adjacency computing module is according to the length of window of hitting of described minimum hit window, and the backward logarithm of a plurality of keywords described in the described minimum hit window, length of window is hit in the expansion that calculates described minimum hit window, as the overall adjacency of described a plurality of keywords;

Step 206, position correlation computing module calculate the position correlation of described a plurality of keyword in described document according to described overall adjacency;

Step 208, result-generation module sort to described document according to described position correlation, and generate result for retrieval.

4. search method according to claim 3 is characterized in that, in described step 204, described overall adjacency computing module is by following formula, and length of window is hit in the expansion that calculates described minimum hit window:

5. an indexing unit is characterized in that, comprising:

The minimum distance calculation module calculates keyword in the used a plurality of keywords of inquiry to the minor increment in document;

Local adjacency computing module, the minor increment right according to described keyword calculates the geometric mean minor increment of described a plurality of keyword in described document, as the local adjacency of described a plurality of keywords;

The position correlation computing module according to described local adjacency, calculates the position correlation of described a plurality of keyword in described document;

6. indexing unit according to claim 5 is characterized in that, described local adjacency computing module calculates described geometric mean minor increment according to following formula:

GeoMeanMinDist (Q, D) = \sqrt[\frac{2}{n (n - 1)}]{Π_{t_{1}, t_{2} &Element; Q \cap D, t 1 &NotEqual; t_{2}} MinDist (t_{1}, t_{2}; D)},

Wherein, D represents described document, and Q represents described a plurality of keyword, t ₁, t ₂Represent described keyword pair, described keyword is to representing contiguous word pair, MinDist (t ₁, t ₂D) expression t ₁, t ₂Minor increment in D, GeoMeanMinDist (Q, D) represents described geometric mean minor increment.

7. according to claim 5 or 6 described indexing units, it is characterized in that, described minimum distance calculation module is selected the right minor increment of described keyword from the right a plurality of distances of described keyword, wherein, described keyword to be backward to the time, according to following formula, calculate the right minor increment of described keyword:

Dist (t _i, t _j)=Dist (t _j, t _i)+ρ, wherein, i＜j, ρ represents preset value, Dist (t _i, t _j) the right distance of the described keyword of expression.

8. a search method is characterized in that, comprising:

Step 402, minimum distance calculation module calculate keyword in the used a plurality of keywords of inquiry to the minor increment in document;

Step 404, the minor increment that local adjacency computing module is right according to described keyword calculates the geometric mean minor increment of described a plurality of keyword in described document, as the local adjacency of described a plurality of keywords;

Step 406, position correlation computing module calculate the position correlation of described a plurality of keyword in described document according to described local adjacency;

Step 408, result-generation module sort to described document according to described position correlation, and generate result for retrieval.

9. search method according to claim 8 is characterized in that, in described step 204, described local adjacency computing module calculates described geometric mean minor increment according to following formula:

GeoMeanMinDist (Q, D) = \sqrt[\frac{2}{n (n - 1)}]{Π_{t_{1}, t_{2} &Element; Q \cap D, t 1 &NotEqual; t_{2}} MinDist (t_{1}, t_{2}; D)},

10. according to claim 8 or 9 described search methods, it is characterized in that, in described step 402, described minimum distance calculation module is selected the right minor increment of described keyword from the right a plurality of distances of described keyword, wherein, described keyword to be backward to the time, according to following formula, calculate the right distance of described keyword:

11. an indexing unit is characterized in that, comprising:

The position correlation computing module, a plurality of keywords overall adjacency and the local adjacency in document used according to inquiry, and default transfer function calculate the position correlation of described a plurality of keyword in described document;

12. indexing unit according to claim 11 is characterized in that, described transfer function comprises:

π (Q, D)=c+log (λ+e ^{-δ (Q, D)}) and/or Wherein, D represents described document, and Q represents described a plurality of keyword, δ (Q, D) the described overall adjacency of expression or described local adjacency, and c represents preset value, and λ represents preset value, and π (Q, D) represents described position correlation.

13. indexing unit according to claim 12 is characterized in that, described position correlation computing module calculates described position correlation by following formula:

π (Q, D) = α . {c + \log (λ + e^{- \frac{MinExpSpanLen (Q, D)}{n}})} + (1 - α) . {c + \log (λ + e^{- GeoMeanMinDist (Q, D)})},

Wherein, MinExpSpanLen (Q, D) represents described overall adjacency, the local adjacency of GeoMeanMinDist (Q, D) the described overall situation of expression.

14. a search method is characterized in that, comprising:

Step 602, overall adjacency and the local adjacency of a plurality of keywords in document that the position correlation computing module is used according to inquiry, and default transfer function calculate the position correlation of described a plurality of keyword in described document;

Step 604, result-generation module sort to described document according to described position correlation, and generate result for retrieval.

15. search method according to claim 14 is characterized in that, described transfer function comprises:

π (Q, D)=c+log (λ+e ^{-δ (Q, D)}) and/or

Wherein, D represents described document, and Q represents described a plurality of keyword, δ (Q, D) the described overall adjacency of expression or described local adjacency, and c represents preset value, and λ represents preset value, and π (Q, D) represents described position correlation.

16. search method according to claim 14 is characterized in that, in described step 602, described position correlation computing module calculates described position correlation by following formula:

π (Q, D) = α . {c + \log (λ + e^{- \frac{GeoMeanMinDist (Q, D)}{n}})} + (1 - α) . {c + \log (λ + e^{- GeoMeanMinDist (Q, D)})},