CN102136011A - Reverse index intersection method - Google Patents
Reverse index intersection method Download PDFInfo
- Publication number
- CN102136011A CN102136011A CN2011101181617A CN201110118161A CN102136011A CN 102136011 A CN102136011 A CN 102136011A CN 2011101181617 A CN2011101181617 A CN 2011101181617A CN 201110118161 A CN201110118161 A CN 201110118161A CN 102136011 A CN102136011 A CN 102136011A
- Authority
- CN
- China
- Prior art keywords
- docid
- result
- linear regression
- index
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/328—Management therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a reverse index intersection method which comprises the following steps of: pre-processing; drawing a two-dimensional scatter diagram by taking the index of docID as a transverse coordinate and the value of the docID as a longitudinal coordinate; generating a linear regression straight line based on a least square method to ensure that the sum of squares of vertical deviations among all points in a figure and the straight line is minimized; evaluating a left safe search range and a right safe search range; and storing evaluated linear regression information. During reverse index intersection, the safe search range of the docID to be searched in a reverse list is determined according to the stored linear regression information of the reverse list, and a certain existing search method is adopted to search in the range. By adopting the reverse index intersection method disclosed by the invention, the search range can be reduced, the search time can be shortened, the response time of a search engine can be shortened, and the user experience can be enhanced.
Description
Technical field
The invention belongs to the inverted index technical field, particularly inverted index is asked the method for friendship.
Background technology
Most popular data structure is an inverted index in the search engine, and it is made up of dictionary and the two parts of falling the permutation table.Wherein dictionary is keyword and falls and set up one-to-one relationship between the permutation table, and permutation table is made up of a series of elementary cells of putting up that are called.Each is puted up by information such as the document identifier of the webpage that comprises corresponding keyword (being called docID), frequency and positions and forms.In the present invention, we suppose that each falls permutation table and only be made up of a series of docID.
Consult Fig. 1, show the treatment scheme of existing search engine, concrete steps are as described below:
Step S101, obtain the user inquiring request.Search engine constantly receives the user inquiring request, then participle is carried out in inquiry, obtains the keyword corresponding with it.
Step S102, the permutation table that falls of query requests correspondence is asked friendship.Find the permutation table of the keyword correspondence of inquiry by the dictionary in the inverted index, and they are asked friendship.
Step S103, will ask and hand over the result to return to the user by certain mode.
Binary search, interpolation search and the search of showing based on jumping are searching methods the most frequently used among the step S102.The S102 holding time is more in the entire process flow process, is the main object that we optimize.
Summary of the invention
The objective of the invention is to ask the more deficiency of friendship method holding time, provide a kind of novel inverted index to ask the friendship method based on linear regression at existing inverted index.
Inverted index provided by the invention is asked the friendship method, comprising:
1st, off-line pre-service:
Each is fallen permutation table
, with the index of docID
Be horizontal ordinate, value
For ordinate is made two-dimentional scatter diagram, wherein
,
Expression
The docID number that comprises and
,
Be nonnegative integer, generate a linear regression straight line based on least square method
,
,
, wherein
,
, make among the figure have a few
Vertical deviation to this straight line
Quadratic sum
Minimum is obtained left safe detection range
With the safe detection range in the right side
, preserve the linear regression information of being obtained
,
,
With
(step S201);
2nd, inverted index is asked the friendship method, and concrete steps are:
(1) for comprising
Individual keyword
Inquiry,
For positive integer and
, corresponding permutation table
The docID number that comprises is non-descending, initialization docID index
, keyword index
, results set
, wherein
,
(step S401);
(2) preserved according to the off-line pre-service of the 1st step
Linear regression information, determine
In i element
In safe hunting zone
(step S402);
(3) adopt existing certain searching method, determine
Whether in the safe hunting zone that step S402 determines (step S403);
(6) if the result of step S404 for not, then preserves
To set
In and execution in step S407(step S406);
(7) result as if step S403 is not, then execution in step S407;
(10) if the result of step S407 for not, then finishes search, and will
As net result collection (step S409).
Advantage of the present invention and beneficial effect:
Ask the friendship method can shrink the hunting zone based on the inverted index of linear regression, reduce search time, improve user experience.
Description of drawings
Fig. 1 is the search engine processing flow chart.
Fig. 2 asks the enforcement illustration of the preprocess method of friendship method for inverted index of the present invention.
Fig. 3 asks the schematic diagram of friendship method for inverted index of the present invention.
Fig. 4 asks the process flow diagram of the embodiment of friendship method for inverted index of the present invention.
Fig. 5 is average fit goodness and the average shrinkage ratio on the different inverted index data sets.
Fig. 6 asks the response time figure of friendship method for binary search on the GOV data and inverted index of the present invention.
Embodiment
For ease of understanding above-mentioned purpose of the present invention, feature and advantage, the present invention is described in further detail below in conjunction with the drawings and specific embodiments.
Embodiment 1
Consult Fig. 2, show the enforcement illustration that inverted index of the present invention is asked the preprocess method of friendship method, concrete steps are as described below:
Step S201, each is fallen permutation table
, with the index of docID
Be horizontal ordinate, value
For ordinate is made two-dimentional scatter diagram, wherein
,
Expression
The docID number that comprises and
,
Be nonnegative integer, generate a linear regression straight line based on least square method
,
,
, wherein
,
, make among the figure have a few
Vertical deviation to this straight line
Quadratic sum
Minimum is obtained left safe detection range
With the safe detection range in the right side
, preserve the linear regression information of being obtained
,
,
With
Definition
,
Be called as the goodness of fit, obviously
It is return sample correlation coefficient between dependent variable Y and the independent variable I square.Because related coefficient is a kind of tolerance of linear dependence degree between two amounts, therefore
Near 1, just represent that regression equation and data fitting must be good more more, test data has better linear feature.
Consult Fig. 3, show the ultimate principle of asking the friendship method based on the inverted index of described preprocess method.Given permutation table
With its linear regression straight line
, level is left apart from regression straight line
, level is to the right apart from regression straight line
Do the parallel lines of regression straight line respectively, obviously
Middle having a few
All between two parallel lines.That is to say, if
Middle search docID y obviously exists
The place is no more than left
, be no more than to the right
Scope in, we can determine that whether y exists
In.Consider again
Border, the left and right sides 0 He of itself
, we can obtain final hunting zone and are
Consult Fig. 4, show the process flow diagram that inverted index of the present invention is asked the embodiment of friendship method, concrete steps are as described below:
(1) for comprising
Individual keyword
Inquiry,
For positive integer and
, corresponding permutation table
The docID number that comprises is non-descending, initialization docID index
, keyword index
, results set
, wherein
,
(step S401);
What (2) pre-service had been preserved according to step S201 off-line
Linear regression information, determine
In i element
In safe hunting zone
(step S402);
(3) adopt existing certain searching method, determine
Whether in the safe hunting zone that step S402 determines (step S403);
(4) if the result of step S403 is for being then inspection
Whether set up (step S404);
(6) if the result of step S404 for not, then preserves
To set
In and execution in step S407(step S406);
(7) result as if step S403 is not, then execution in step S407;
(10) if the result of step S407 for not, then finishes search, and will
As net result collection (step S409).
For existing searching method, falling permutation table
In search range be
Ask the friendship method for the inverted index based on linear regression of the present invention, falling permutation table
In search range be
We define
For falling permutation table
Shrinkage factor.
More little, safe hunting zone is just more little, and the inverted index based on linear regression of the present invention asks the friendship method just fast more.
Consult Fig. 5, show the average fit goodness of permutation table in the different length scope on various inverted index data sets
And average shrinkage ratio
Used inverted index data set is done following explanation:
(1) GOV and GOV2 are illustrated respectively in 2002 and 2004 and grasp the data set that gets off from the .gov domain name, and BD represents the data set that obtains from company of Baidu;
(2) data set that obtains after representing the GOV data set to be reset of GOVPR according to PageRank;
(3) GOVR, GOV2R and BDR represent to use the Fisher-Yates method GOV, GOV2 and BD to be carried out the data set that obtains behind the random rearrangement respectively.
Can see on various data set, the extraordinary goodness of fit and shrinkage factor being arranged all.That is to say that on various data set the inverted index based on linear regression of the present invention asks the friendship method to have good effect.
Consult Fig. 6, show and use traditional binary search method and inverted index based on linear regression of the present invention to ask the friendship method, on NVIDIA GTX480 platform, the GOV data set is carried out the response time figure that inverted index is asked friendship, come as can be seen, when calculated threshold was big, inverted index of the present invention asked the friendship method that the less response time is arranged.
More than ask the friendship method to be described in detail to inverted index of the present invention, used specific case herein principle of the present invention and embodiment set forth, the explanation of above embodiment just is used for help understanding method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.
Claims (1)
1. an inverted index is asked the friendship method, it is characterized in that, comprising:
1st, off-line pre-service:
Each is fallen permutation table
, with the index of docID
Be horizontal ordinate, value
For ordinate is made two-dimentional scatter diagram, wherein
,
Expression
The docID number that comprises and
,
Be nonnegative integer, generate a linear regression straight line based on least square method
,
,
, wherein
,
, make among the figure have a few
Vertical deviation to this straight line
Quadratic sum
Minimum is obtained left safe detection range
With the safe detection range in the right side
, preserve the linear regression information of being obtained
,
,
With
2nd, inverted index is asked the friendship method, and concrete steps are:
2.1st, for comprising
Individual keyword
Inquiry,
For positive integer and
, corresponding permutation table
The docID number that comprises is non-descending, initialization docID index
, keyword index
, results set
, wherein
,
2.2nd, preserved according to the off-line pre-service of the 1st step
Linear regression information, determine
In
Individual element
In safe hunting zone
2.3rd, adopt existing certain searching method, determine
Whether go on foot in the safe hunting zone of determining the 2.2nd;
2.6th, if the result in the 2.4th step for not, then preserves
To set
In and carry out the 2.8th the step;
2.7th, if the result in the 2.3rd step for not, then carried out for the 2.8th step;
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101181617A CN102136011A (en) | 2011-05-09 | 2011-05-09 | Reverse index intersection method |
PCT/CN2011/076841 WO2012151781A1 (en) | 2011-05-09 | 2011-08-01 | Inverted index intersection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101181617A CN102136011A (en) | 2011-05-09 | 2011-05-09 | Reverse index intersection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102136011A true CN102136011A (en) | 2011-07-27 |
Family
ID=44295797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011101181617A Pending CN102136011A (en) | 2011-05-09 | 2011-05-09 | Reverse index intersection method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN102136011A (en) |
WO (1) | WO2012151781A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012151781A1 (en) * | 2011-05-09 | 2012-11-15 | 南开大学 | Inverted index intersection method |
WO2016173366A1 (en) * | 2015-04-28 | 2016-11-03 | 腾讯科技(深圳)有限公司 | Intersection algorithm-based searching method, searching system and storage medium |
CN110083679A (en) * | 2019-03-18 | 2019-08-02 | 北京三快在线科技有限公司 | Processing method, device, electronic equipment and the storage medium of searching request |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1909515A (en) * | 2006-08-07 | 2007-02-07 | 华为技术有限公司 | Method and device for realizing elastic sectionalization ring guiding protective inverting |
CN101242430A (en) * | 2008-02-22 | 2008-08-13 | 华中科技大学 | Fixed data pre-access method in peer network order system |
CN101930473A (en) * | 2010-09-14 | 2010-12-29 | 何吴迪 | Method for constructing cloud computing window search system with executable structure |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7039625B2 (en) * | 2002-11-22 | 2006-05-02 | International Business Machines Corporation | International information search and delivery system providing search results personalized to a particular natural language |
CN1292371C (en) * | 2003-04-11 | 2006-12-27 | 国际商业机器公司 | Inverted index storage method, inverted index mechanism and on-line updating method |
US7496568B2 (en) * | 2006-11-30 | 2009-02-24 | International Business Machines Corporation | Efficient multifaceted search in information retrieval systems |
CN102023985B (en) * | 2009-09-17 | 2014-05-28 | 日电(中国)有限公司 | Method and device for generating blind mixed invert index table as well as method and device for searching joint keywords |
CN102136011A (en) * | 2011-05-09 | 2011-07-27 | 南开大学 | Reverse index intersection method |
-
2011
- 2011-05-09 CN CN2011101181617A patent/CN102136011A/en active Pending
- 2011-08-01 WO PCT/CN2011/076841 patent/WO2012151781A1/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1909515A (en) * | 2006-08-07 | 2007-02-07 | 华为技术有限公司 | Method and device for realizing elastic sectionalization ring guiding protective inverting |
CN101242430A (en) * | 2008-02-22 | 2008-08-13 | 华中科技大学 | Fixed data pre-access method in peer network order system |
CN101930473A (en) * | 2010-09-14 | 2010-12-29 | 何吴迪 | Method for constructing cloud computing window search system with executable structure |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012151781A1 (en) * | 2011-05-09 | 2012-11-15 | 南开大学 | Inverted index intersection method |
WO2016173366A1 (en) * | 2015-04-28 | 2016-11-03 | 腾讯科技(深圳)有限公司 | Intersection algorithm-based searching method, searching system and storage medium |
CN106156000A (en) * | 2015-04-28 | 2016-11-23 | 腾讯科技(深圳)有限公司 | Searching method based on intersection algorithm and search system |
CN106156000B (en) * | 2015-04-28 | 2020-03-17 | 腾讯科技(深圳)有限公司 | Search method and search system based on intersection algorithm |
US10902036B2 (en) | 2015-04-28 | 2021-01-26 | Tencent Technology (Shenzhen) Company Limited | Intersection algorithm-based search method and system, and storage medium |
CN110083679A (en) * | 2019-03-18 | 2019-08-02 | 北京三快在线科技有限公司 | Processing method, device, electronic equipment and the storage medium of searching request |
Also Published As
Publication number | Publication date |
---|---|
WO2012151781A1 (en) | 2012-11-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104200087B (en) | For the parameter optimization of machine learning and the method and system of feature tuning | |
CN107239497B (en) | Hot content search method and system | |
JP6243045B2 (en) | Graph data query method and apparatus | |
Yang et al. | Experimental study on the five sort algorithms | |
CN104281664B (en) | Distributed figure computing system data segmentation method and system | |
CN102682046A (en) | Member searching and analyzing method in social network and searching system | |
CN107766406A (en) | A kind of track similarity join querying method searched for using time priority | |
CN105550368A (en) | Approximate nearest neighbor searching method and system of high dimensional data | |
CN105404675A (en) | Ranked reverse nearest neighbor space keyword query method and apparatus | |
CN103473268B (en) | Linear element spatial index structuring method, system and search method and system thereof | |
Huang et al. | Improving the relevancy of document search using the multi-term adjacency keyword-order model | |
CN102136011A (en) | Reverse index intersection method | |
CN106599610A (en) | Method and system for predicting association between long non-coding RNA and protein | |
CN112036737A (en) | Method and device for calculating regional electric quantity deviation | |
CN102081666A (en) | Index construction method for distributed picture search and server | |
CN105447187A (en) | Webpage search method and system | |
CN116108076B (en) | Materialized view query method, materialized view query system, materialized view query equipment and storage medium | |
CN103455491A (en) | Method and device for classifying search terms | |
Gulzar et al. | D-SKY: A framework for processing skyline queries in a dynamic and incomplete database | |
CN102662973A (en) | Recommendation system and method of mechanical product design document | |
Goncalves et al. | Making recommendations using location-based skyline queries | |
Boldi et al. | Arc-community detection via triangular random walks | |
Gao et al. | Detecting geometric conflicts for generalisation of polygonal maps | |
CN107766414B (en) | Multi-document intersection acquisition method, device and equipment and readable storage medium | |
CN102156754A (en) | Web object search method based on visibility |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20110727 |