WO2012151781A1 - Procédé d'intersection d'indice inversé - Google Patents
Procédé d'intersection d'indice inversé Download PDFInfo
- Publication number
- WO2012151781A1 WO2012151781A1 PCT/CN2011/076841 CN2011076841W WO2012151781A1 WO 2012151781 A1 WO2012151781 A1 WO 2012151781A1 CN 2011076841 W CN2011076841 W CN 2011076841W WO 2012151781 A1 WO2012151781 A1 WO 2012151781A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- search
- result
- inverted
- docid
- inverted index
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/328—Management therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
Definitions
- the invention belongs to the technical field of inverted index, and particularly relates to a method for intersecting index intersection.
- the most widely used data structure in search engines is the inverted index, which consists of a dictionary and an inverted list.
- the dictionary has a one-to-one correspondence between keywords and inverted lists, and the inverted list consists of a series of basic units called postings.
- Each post consists of information such as the document identifier (called docID), frequency, and location of the web page containing the corresponding keyword.
- docID document identifier
- frequency the location of the web page containing the corresponding keyword.
- Step S101 Acquire a user query request.
- the search engine continuously receives the user query request, and then segments the query to obtain the corresponding keywords.
- Step S102 Perform an intersection on the inverted list corresponding to the query request.
- the inverted list corresponding to the keyword of the query is found by the dictionary in the inverted index, and they are intersected.
- Step S103 Return the result of the intersection to the user in a certain manner.
- step S102 takes up more time in the entire process, which is the main object of our optimization.
- the object of the present invention is to provide a novel inverted regression index intersection method based on linear regression, in view of the shortcomings of the existing inverted index intersection method.
- the inverted index intersection method provided by the present invention includes:
- ⁇ is a non-negative integer, based on least squares
- Difference _ (the sum of squares of 0 ( ⁇ -/; ( 2 is the smallest, find the left safe search distance L ⁇ max, ⁇ - 1 ⁇ ,) - ⁇ and the right safe search distance R ⁇ max ⁇ '- /;- ⁇ , , save the obtained linear regression information ", ⁇ ,, L BR, (step S201);
- the linear regression information of the saved (t) is preprocessed offline, and the safe search range of the i-th element in the determination is determined ⁇ Q t - y i )-L ti ⁇ n ⁇ l ⁇ t ] ) ⁇ t y l ) + R t (step S402);
- step S403 (3) using an existing search method to determine whether ⁇ is in the secure search range determined in step S402 (step S403);
- step S403 If the result of step S403 is YES, it is checked whether ' ⁇ is established (step S404);
- step S404 If the result of step S404 is YES, then + + and return to step S402 (step S405);
- step S404 If the result of step S404 is no, save y; go to set A and perform step S407 (step S406);
- step S407 is performed
- step S407 If the result of the step S407 is NO, the search is ended, and A is taken as the final result set (step S409).
- the inverted index intersection method based on linear regression can narrow the search range, reduce the search time, and improve the user experience.
- Figure 1 shows the flow chart of the search engine processing.
- FIG. 2 is a diagram showing an embodiment of a preprocessing method of an inverted index intersection method according to the present invention.
- FIG. 3 is a schematic diagram of the inverted index intersection method of the present invention.
- FIG. 4 is a flow chart of an embodiment of an inverted index intersection method of the present invention.
- Figure 5 shows the average goodness of fit and average shrinkage on different inverted index data sets.
- FIG. 6 is a response time diagram of a binary search on GOV data and an inverted index intersection method of the present invention.
- step S201 for each inverted list, and the index of the docID is the abscissa and the value X.
- a two-dimensional scatter plot for the ordinate, where l, 2,...,k (represents the number of docIDs contained in i(t) and
- the search distance R ⁇ max, ⁇ y saves the obtained linear regression information ", ⁇ , L and R.
- R ⁇ (y - ⁇ ) R 2 is called goodness of fit, obviously o ⁇ i? 2 ⁇ i
- step S403 (3) using an existing search method to determine whether ⁇ is in the secure search range determined in step S402 (step S403);
- step S403 If the result of step S403 is YES, it is checked whether ' ⁇ is established (step S404);
- step S404 If the result of step S404 is YES, then + + and return to step S402 (step S405);
- step S404 If the result of step S404 is no, save y; go to set A and perform step S407 (step S406);
- step S407 is performed
- step S407 If the result of step S407 is NO, the search is ended, and A is taken as the final result set (step S409).
- the search range size in the inverted list i(t) is (t)
- the search range in size is (L f +R f ).
- ⁇ ( ⁇ + ⁇ /1 ⁇ 1 is the shrinkage rate of the inverted list.
- GOV and GOV2 represent data sets captured from .gov domain names in 2002 and 2004, respectively, and BD represents data sets obtained from Baidu.
- GOVPR indicates a data set obtained by rearranging the GOV data set according to PageRank
- GOVR, GOV2R, and BDR represent data sets obtained by randomly rearranging GOV, GOV2, and BD using the Fisher-Yates method, respectively.
- the linear regression-based inverted index intersection method of the present invention has a good effect on various data sets.
- FIG. 6 a traditional binary search method and a linear regression-based inverted index intersection method of the present invention are shown.
- a response time map of the GOV data set is performed by performing an inverted index intersection.
- the inverted index intersection method of the present invention has a small response time when the calculation threshold is large.
- the method for intersecting the index of the inverted index of the present invention is described in detail above. The principles and embodiments of the present invention are described in the following. The description of the above embodiments is only for helping to understand the method and the core idea of the present invention. At the same time, the content of the present invention is not limited by the scope of the present invention.
Abstract
La présente invention concerne un procédé d'intersection d'indice inversé. Ledit procédé consiste à : prétraiter, pour chaque liste inversée, une construction d'un diagramme de dispersion bidimensionnel en prenant l'indice d'identifiant de document en tant que coordonnée horizontale et la valeur de l'indice en tant que coordonnée verticale ; générer une ligne droite de régression linéaire en se basant sur la méthode des moindres carrés afin d'obtenir la somme minimale de carrés de la déviation verticale de tous les points dans le diagramme par rapport à la ligne droite ; déduire une distance de recherche sécurisée gauche et une distance de recherche sécurisée droite ; et sauvegarder les informations de régression linéaire déduites. L'intersection d'indice inversé détermine la plage de recherche sécurisée de l'identifiant de document à trouver dans la liste inversée selon les informations de régression linéaire sauvegardées relatives à la liste inversée, puis applique un certain procédé de recherche existant afin d'effectuer des recherches dans ladite plage. Le procédé d'intersection d'indice inversé selon la présente invention permet de restreindre la plage de recherche, de diminuer le temps de recherche, de raccourcir le temps de réponse du moteur de recherche et d'améliorer le confort d'utilisation.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011101181617A CN102136011A (zh) | 2011-05-09 | 2011-05-09 | 倒排索引求交方法 |
CN201110118161.7 | 2011-05-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012151781A1 true WO2012151781A1 (fr) | 2012-11-15 |
Family
ID=44295797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2011/076841 WO2012151781A1 (fr) | 2011-05-09 | 2011-08-01 | Procédé d'intersection d'indice inversé |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN102136011A (fr) |
WO (1) | WO2012151781A1 (fr) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102136011A (zh) * | 2011-05-09 | 2011-07-27 | 南开大学 | 倒排索引求交方法 |
CN106156000B (zh) * | 2015-04-28 | 2020-03-17 | 腾讯科技(深圳)有限公司 | 基于求交算法的搜索方法及搜索系统 |
CN110083679B (zh) * | 2019-03-18 | 2020-08-18 | 北京三快在线科技有限公司 | 搜索请求的处理方法、装置、电子设备和存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1503163A (zh) * | 2002-11-22 | 2004-06-09 | �Ҵ���˾ | 提供个性化为特定语言的搜索结果的国际搜索和传送系统 |
US20040205044A1 (en) * | 2003-04-11 | 2004-10-14 | International Business Machines Corporation | Method for storing inverted index, method for on-line updating the same and inverted index mechanism |
US20080133473A1 (en) * | 2006-11-30 | 2008-06-05 | Broder Andrei Z | Efficient multifaceted search in information retrieval systems |
CN102023985A (zh) * | 2009-09-17 | 2011-04-20 | 日电(中国)有限公司 | 盲化混合倒排索引表产生方法和设备、联合关键字搜索方法和设备 |
CN102136011A (zh) * | 2011-05-09 | 2011-07-27 | 南开大学 | 倒排索引求交方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100454907C (zh) * | 2006-08-07 | 2009-01-21 | 华为技术有限公司 | 一种实现弹性分组环导引保护倒换的方法及装置 |
CN101242430B (zh) * | 2008-02-22 | 2012-03-28 | 华中科技大学 | 对等网络点播系统中的定点数据预取方法 |
CN101930473A (zh) * | 2010-09-14 | 2010-12-29 | 何吴迪 | 一种具有可执行结构的云计算视窗搜索体系的架构方法 |
-
2011
- 2011-05-09 CN CN2011101181617A patent/CN102136011A/zh active Pending
- 2011-08-01 WO PCT/CN2011/076841 patent/WO2012151781A1/fr unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1503163A (zh) * | 2002-11-22 | 2004-06-09 | �Ҵ���˾ | 提供个性化为特定语言的搜索结果的国际搜索和传送系统 |
US20040205044A1 (en) * | 2003-04-11 | 2004-10-14 | International Business Machines Corporation | Method for storing inverted index, method for on-line updating the same and inverted index mechanism |
US20080133473A1 (en) * | 2006-11-30 | 2008-06-05 | Broder Andrei Z | Efficient multifaceted search in information retrieval systems |
CN102023985A (zh) * | 2009-09-17 | 2011-04-20 | 日电(中国)有限公司 | 盲化混合倒排索引表产生方法和设备、联合关键字搜索方法和设备 |
CN102136011A (zh) * | 2011-05-09 | 2011-07-27 | 南开大学 | 倒排索引求交方法 |
Also Published As
Publication number | Publication date |
---|---|
CN102136011A (zh) | 2011-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273517B (zh) | 基于图嵌入学习的图文跨模态检索方法 | |
JP5587434B2 (ja) | テキスト分類の方法および装置 | |
RU2439686C2 (ru) | Аннотация посредством поиска | |
WO2018040503A1 (fr) | Procédé et système d'obtention de résultats de recherche | |
WO2016188347A1 (fr) | Dispositif et procédé d'évaluation de qualité de réseau, système et procédé de tri de contenu de réseau, appareil informatique et support de stockage lisible par machine non transitoire | |
WO2007123416A1 (fr) | Procede et dispositif permettant de classer efficacement des documents dans un graphe de similarite | |
CN103473307B (zh) | 跨媒体稀疏哈希索引方法 | |
CN106951526B (zh) | 一种实体集扩展方法及装置 | |
US20110246439A1 (en) | Augmented query search | |
US10185751B1 (en) | Identifying and ranking attributes of entities | |
WO2013044559A1 (fr) | Procédé et système de recommandation de site internet et serveur de réseau | |
WO2015196964A1 (fr) | Procédé de recherche d'image correspondante, procédé de recherche d'image et appareils | |
CN1818908A (zh) | 一种在搜索引擎中应用搜索者反馈信息的方法 | |
WO2020168827A1 (fr) | Index de point d'intérêt sur la base d'un emplacement géographique | |
Sandholm et al. | Real-time, location-aware collaborative filtering of web content | |
CN105677695A (zh) | 一种基于内容的计算移动应用相似性的方法 | |
WO2012151781A1 (fr) | Procédé d'intersection d'indice inversé | |
WO2020209966A1 (fr) | Entraînement d'un modèle cible | |
CN110598123B (zh) | 基于画像相似性的信息检索推荐方法、装置及存储介质 | |
CN103064907A (zh) | 基于无监督的实体关系抽取的主题元搜索系统及方法 | |
CN114201480A (zh) | 一种基于nlp技术的多源poi融合方法、装置及可读存储介质 | |
WO2023116816A1 (fr) | Procédé et appareil d'alignement de séquence de protéines, serveur et support de stockage | |
CN107562872B (zh) | 基于sql的度量空间数据相似度查询方法及装置 | |
CN113034940A (zh) | 一种基于Fisher有序聚类的单点信号交叉口优化配时方法 | |
CN107423294A (zh) | 一种社群图像检索方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |