WO2012151781A1

WO2012151781A1 - Inverted index intersection method

Info

Publication number: WO2012151781A1
Application number: PCT/CN2011/076841
Authority: WO
Inventors: 刘晓光; 王刚; 敖耐勇; 吴迪; 张帆
Original assignee: 南开大学
Priority date: 2011-05-09
Filing date: 2011-08-01
Publication date: 2012-11-15
Also published as: CN102136011A

Abstract

An inverted index intersection method. The method includes: pre-processing, for each inverted list, constructing a two-dimensional scatter diagram by taking the index of the docID as the horizontal ordinate and the value thereof as the vertical ordinate, generating a linear regression straight line based on the least square method to get the minimum sum of squares of the vertical deviation from all the points in the diagram to the straight line, deriving a left secure search distance and a right secure search distance, and saving the derived linear regression information. The inverted index intersection determines the secure search range of the docID to be found in the inverted list according to the saved linear regression information about the inverted list, then applies a certain existing search method to search within the range. The inverted index intersection method of the present invention can narrow the search range, decrease the search time, shorten the response time of the search engine, and improve user experience.

Description

Inverted index intersection method

[Technical Field]

The invention belongs to the technical field of inverted index, and particularly relates to a method for intersecting index intersection.

【Background technique】

The most widely used data structure in search engines is the inverted index, which consists of a dictionary and an inverted list. The dictionary has a one-to-one correspondence between keywords and inverted lists, and the inverted list consists of a series of basic units called postings. Each post consists of information such as the document identifier (called docID), frequency, and location of the web page containing the corresponding keyword. In the present invention, we assume that each inverted list consists of only a series of docIDs.

Referring to Figure 1, the processing flow of the existing search engine is shown. The specific steps are as follows:

Step S101: Acquire a user query request. The search engine continuously receives the user query request, and then segments the query to obtain the corresponding keywords.

Step S102: Perform an intersection on the inverted list corresponding to the query request. The inverted list corresponding to the keyword of the query is found by the dictionary in the inverted index, and they are intersected.

Step S103: Return the result of the intersection to the user in a certain manner.

The binary search, the interpolation search, and the jump table based search are the most commonly used search methods in step S102. S102 takes up more time in the entire process, which is the main object of our optimization.

[Summary of the Invention]

The object of the present invention is to provide a novel inverted regression index intersection method based on linear regression, in view of the shortcomings of the existing inverted index intersection method.

The inverted index intersection method provided by the present invention includes:

First, offline preprocessing:

For each inverted list ^, with the docID of the bow | for the abscissa, value) ^ for the ordinate as a two-dimensional scatter plot, where

>2, ^ is a non-negative integer, based on least squares

Difference _ ; (the sum of squares of 0 (}^-/; ( ^{2 is the} smallest, find the left safe search distance L^max, ^- ¹ ^,) - ^ and the right safe search distance R ^ max ^ '- /;- ^, , save the obtained linear regression information ", β,, L BR, (step S201); the second, inverted index intersection method, the specific steps are: (1) For a query containing a keyword ί , .,.Α, it is a positive integer and ≥ 2, corresponding to the inverted list 1), ^ ₂ ), ..., ^ contains (1 ₀ (: 10 numbers) In non-descending order, initialize docID index ζ ' = 1, keyword index 7 = 2, result set Α = , where = 1, 2, ..., | ^ (^|, 2 < j < k (step S401);

(2) According to the first step, the linear regression information of the saved (t) is preprocessed offline, and the safe search range of the i-th element in the determination is determined ^ Q _t - y _i )-L _ti \ n\\l{ t _] )\ _t y _l ) + R _t (step S402);

(3) using an existing search method to determine whether ^ is in the secure search range determined in step S402 (step S403);

(4) If the result of step S403 is YES, it is checked whether '< is established (step S404);

(5) If the result of step S404 is YES, then + + and return to step S402 (step S405);

(6) If the result of step S404 is no, save _{y; go} to set A and perform step S407 (step S406);

(7) If the result of step S403 is no, step S407 is performed;

(8) Check if <|^(^)| is established (step S407);

(9) If the result of step S407 is YES, then + +, j' = 2 and return to step S402 (step S408);

(10) If the result of the step S407 is NO, the search is ended, and A is taken as the final result set (step S409). Advantages and advantages of the present invention:

The inverted index intersection method based on linear regression can narrow the search range, reduce the search time, and improve the user experience.

[Description of the Drawings]

Figure 1 shows the flow chart of the search engine processing.

2 is a diagram showing an embodiment of a preprocessing method of an inverted index intersection method according to the present invention.

FIG. 3 is a schematic diagram of the inverted index intersection method of the present invention.

4 is a flow chart of an embodiment of an inverted index intersection method of the present invention.

Figure 5 shows the average goodness of fit and average shrinkage on different inverted index data sets.

6 is a response time diagram of a binary search on GOV data and an inverted index intersection method of the present invention.

【Detailed ways】

The present invention will be further described in detail below with reference to the drawings and specific embodiments. Example 1

Referring to FIG. 2, an embodiment of a preprocessing method of the inverted index intersection method of the present invention is shown. The specific steps are as follows: step S201, for each inverted list, and the index of the docID is the abscissa and the value X. A two-dimensional scatter plot for the ordinate, where = l, 2,...,k (represents the number of docIDs contained in i(t) and | )| ≥ 2, X. is a non-negative integer, based on the most a _t away

all

The search distance R^max, ^y , saves the obtained linear regression information ", β, L and R.

^J -vi,

Define R ∑(y -Ϋ), R ² is called goodness of fit, obviously o ≤ i? ² ≤ i

It is the square of the sample correlation coefficient between the regression dependent variable Y and the independent variable I. Since the correlation coefficient is a measure of the degree of linear correlation between the two quantities, the closer R ^{2 is} to 1, the better the regression equation fits the data, and the test data has better linear characteristics. Referring to FIG. 3, the basic principle of the inverted index intersection method based on the preprocessing method is shown. Given the inverted list i(t) and its linear regression line; ^ = ;(/) = «,+ , horizontal to the left distance regression line, horizontal to right distance regression line respectively make the parallel line of the regression line, obviously i ( All points in t) (ί,> are between two parallel lines. That is, if you want to search for docID y in i(t), obviously at ;; ^y), no more than L, right to the left Within the range not exceeding, we can determine if y is in i(t). Considering the left and right boundaries of i(t) itself and ^ t)|, we can get the final search range of + R _t )

Example 2

Referring to FIG. 4, a flowchart of an embodiment of the inverted index intersection method of the present invention is shown. The specific steps are as follows: (1) For a query containing a keyword ^^..., it is a positive integer and ≥2, corresponding inverted list , (ί ₂ ),..., (the number of docIDs included is non-descending, initializing the docID index z' = l, keyword index 7 = 2, result set A = , where = 1,2,..., |^( , 2<j<k (step S401); (2) offline preprocessing the saved linear regression information according to step S201, determining the safe search range of the i-th element in the ^ _{t t} - y _i )-L _ti \ n\\l{t _] )\ _t y _l ) + R _t (step S402);

(7) If the result of step S403 is no, step S407 is performed;

(8) Check if it is established (step S407);

(10) If the result of step S407 is NO, the search is ended, and A is taken as the final result set (step S409). For the existing search method, the search range size in the inverted list i(t) is (t)|; for the line-return regression-based inverted index intersection method of the present invention, in the inverted list i(t) The search range in size is (L _f +R _f ). We define

^ = (^+ ^/1 ^1 is the shrinkage rate of the inverted list. The smaller the cr, the smaller the safe search range, and the faster the linear regression-based inverted line I intersection method of the present invention. 5, showing the average goodness of fit R ² and the average shrinkage rate cr of the inverted list in different length ranges on various inverted index data sets. The following uses the inverted index data set as follows:

(1) GOV and GOV2 represent data sets captured from .gov domain names in 2002 and 2004, respectively, and BD represents data sets obtained from Baidu.

(2) GOVPR indicates a data set obtained by rearranging the GOV data set according to PageRank;

(3) GOVR, GOV2R, and BDR represent data sets obtained by randomly rearranging GOV, GOV2, and BD using the Fisher-Yates method, respectively.

It can be seen that there are very good goodness of fit and shrinkage on a variety of different data sets. That is to say, the linear regression-based inverted index intersection method of the present invention has a good effect on various data sets. Referring to FIG. 6, a traditional binary search method and a linear regression-based inverted index intersection method of the present invention are shown. On the NVIDIA GTX480 platform, a response time map of the GOV data set is performed by performing an inverted index intersection. It can be seen that the inverted index intersection method of the present invention has a small response time when the calculation threshold is large. The method for intersecting the index of the inverted index of the present invention is described in detail above. The principles and embodiments of the present invention are described in the following. The description of the above embodiments is only for helping to understand the method and the core idea of the present invention. At the same time, the content of the present invention is not limited by the scope of the present invention.

Claims

1. An inverted index intersection method, characterized in that:

First, offline preprocessing:

For each inverted list, the docID of the docID | is the abscissa, the value) ^ is the ordinate for the two-dimensional scatter plot, where = l, 2,..Kt)|, | t)| represents the docID included Number and t)|≥2, which is a non-negative integer, based on the most, y^ makes all points in the graph to the vertical of the line

Straight deviation - the square sum of f _t (i)

- f _t ( ) is the smallest, find the left safe search distance L _t = max, {f - ¹ and the right safe search distance R ^ max, ^ - / ^ (y ^, save the obtained linear regression information ", β , L and R; 2nd, inverted index intersection method, the specific steps are:

2.1, for a query containing a keyword ^ ..., is a positive integer and ≥ 2, corresponding to the inverted list (^, 2), ..., ^ ^) contains the number of docID in non-descending order, initialize docID Cable bow = l, keyword index 7 = 2, result set A, where = 1,2,···, | )|, 2<j≤k; 2.2, according to step 1 offline pre-prepared saved 4t _; ) linear regression information, determine the first element in ^ ^ in the safe search range min lit

Section 2.3, using some existing search method, determine) ^ is in the safe search range determined in step 2.2; 2.4, if the result of step 2.3 is yes, then check if <fc is true; The result of step 2.4 is yes, then + + and return to step 2.2; 2.6, if the result of step 2.4 is no, save ^ to set A and perform step 2.8;

Clause 2.7. If the result of step 2.3 is no, proceed to step 2.8;

Section 2.8, check < ^ (is it true; 2.9, if the result of step 2.8 is yes, IJ + + , = 2 and return to step 2.2;

2.10. If the result of step 2.8 is no, the search ends and A is used as the final result set ₍