WO2012151781A1 - Procédé d'intersection d'indice inversé - Google Patents

Procédé d'intersection d'indice inversé Download PDF

Info

Publication number
WO2012151781A1
WO2012151781A1 PCT/CN2011/076841 CN2011076841W WO2012151781A1 WO 2012151781 A1 WO2012151781 A1 WO 2012151781A1 CN 2011076841 W CN2011076841 W CN 2011076841W WO 2012151781 A1 WO2012151781 A1 WO 2012151781A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
result
inverted
docid
inverted index
Prior art date
Application number
PCT/CN2011/076841
Other languages
English (en)
Chinese (zh)
Inventor
刘晓光
王刚
敖耐勇
吴迪
张帆
Original Assignee
南开大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南开大学 filed Critical 南开大学
Publication of WO2012151781A1 publication Critical patent/WO2012151781A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/328Management therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Definitions

  • the invention belongs to the technical field of inverted index, and particularly relates to a method for intersecting index intersection.
  • the most widely used data structure in search engines is the inverted index, which consists of a dictionary and an inverted list.
  • the dictionary has a one-to-one correspondence between keywords and inverted lists, and the inverted list consists of a series of basic units called postings.
  • Each post consists of information such as the document identifier (called docID), frequency, and location of the web page containing the corresponding keyword.
  • docID document identifier
  • frequency the location of the web page containing the corresponding keyword.
  • Step S101 Acquire a user query request.
  • the search engine continuously receives the user query request, and then segments the query to obtain the corresponding keywords.
  • Step S102 Perform an intersection on the inverted list corresponding to the query request.
  • the inverted list corresponding to the keyword of the query is found by the dictionary in the inverted index, and they are intersected.
  • Step S103 Return the result of the intersection to the user in a certain manner.
  • step S102 takes up more time in the entire process, which is the main object of our optimization.
  • the object of the present invention is to provide a novel inverted regression index intersection method based on linear regression, in view of the shortcomings of the existing inverted index intersection method.
  • the inverted index intersection method provided by the present invention includes:
  • is a non-negative integer, based on least squares
  • Difference _ (the sum of squares of 0 ( ⁇ -/; ( 2 is the smallest, find the left safe search distance L ⁇ max, ⁇ - 1 ⁇ ,) - ⁇ and the right safe search distance R ⁇ max ⁇ '- /;- ⁇ , , save the obtained linear regression information ", ⁇ ,, L BR, (step S201);
  • the linear regression information of the saved (t) is preprocessed offline, and the safe search range of the i-th element in the determination is determined ⁇ Q t - y i )-L ti ⁇ n ⁇ l ⁇ t ] ) ⁇ t y l ) + R t (step S402);
  • step S403 (3) using an existing search method to determine whether ⁇ is in the secure search range determined in step S402 (step S403);
  • step S403 If the result of step S403 is YES, it is checked whether ' ⁇ is established (step S404);
  • step S404 If the result of step S404 is YES, then + + and return to step S402 (step S405);
  • step S404 If the result of step S404 is no, save y; go to set A and perform step S407 (step S406);
  • step S407 is performed
  • step S407 If the result of the step S407 is NO, the search is ended, and A is taken as the final result set (step S409).
  • the inverted index intersection method based on linear regression can narrow the search range, reduce the search time, and improve the user experience.
  • Figure 1 shows the flow chart of the search engine processing.
  • FIG. 2 is a diagram showing an embodiment of a preprocessing method of an inverted index intersection method according to the present invention.
  • FIG. 3 is a schematic diagram of the inverted index intersection method of the present invention.
  • FIG. 4 is a flow chart of an embodiment of an inverted index intersection method of the present invention.
  • Figure 5 shows the average goodness of fit and average shrinkage on different inverted index data sets.
  • FIG. 6 is a response time diagram of a binary search on GOV data and an inverted index intersection method of the present invention.
  • step S201 for each inverted list, and the index of the docID is the abscissa and the value X.
  • a two-dimensional scatter plot for the ordinate, where l, 2,...,k (represents the number of docIDs contained in i(t) and
  • the search distance R ⁇ max, ⁇ y saves the obtained linear regression information ", ⁇ , L and R.
  • R ⁇ (y - ⁇ ) R 2 is called goodness of fit, obviously o ⁇ i? 2 ⁇ i
  • step S403 (3) using an existing search method to determine whether ⁇ is in the secure search range determined in step S402 (step S403);
  • step S403 If the result of step S403 is YES, it is checked whether ' ⁇ is established (step S404);
  • step S404 If the result of step S404 is YES, then + + and return to step S402 (step S405);
  • step S404 If the result of step S404 is no, save y; go to set A and perform step S407 (step S406);
  • step S407 is performed
  • step S407 If the result of step S407 is NO, the search is ended, and A is taken as the final result set (step S409).
  • the search range size in the inverted list i(t) is (t)
  • the search range in size is (L f +R f ).
  • ( ⁇ + ⁇ /1 ⁇ 1 is the shrinkage rate of the inverted list.
  • GOV and GOV2 represent data sets captured from .gov domain names in 2002 and 2004, respectively, and BD represents data sets obtained from Baidu.
  • GOVPR indicates a data set obtained by rearranging the GOV data set according to PageRank
  • GOVR, GOV2R, and BDR represent data sets obtained by randomly rearranging GOV, GOV2, and BD using the Fisher-Yates method, respectively.
  • the linear regression-based inverted index intersection method of the present invention has a good effect on various data sets.
  • FIG. 6 a traditional binary search method and a linear regression-based inverted index intersection method of the present invention are shown.
  • a response time map of the GOV data set is performed by performing an inverted index intersection.
  • the inverted index intersection method of the present invention has a small response time when the calculation threshold is large.
  • the method for intersecting the index of the inverted index of the present invention is described in detail above. The principles and embodiments of the present invention are described in the following. The description of the above embodiments is only for helping to understand the method and the core idea of the present invention. At the same time, the content of the present invention is not limited by the scope of the present invention.

Abstract

La présente invention concerne un procédé d'intersection d'indice inversé. Ledit procédé consiste à : prétraiter, pour chaque liste inversée, une construction d'un diagramme de dispersion bidimensionnel en prenant l'indice d'identifiant de document en tant que coordonnée horizontale et la valeur de l'indice en tant que coordonnée verticale ; générer une ligne droite de régression linéaire en se basant sur la méthode des moindres carrés afin d'obtenir la somme minimale de carrés de la déviation verticale de tous les points dans le diagramme par rapport à la ligne droite ; déduire une distance de recherche sécurisée gauche et une distance de recherche sécurisée droite ; et sauvegarder les informations de régression linéaire déduites. L'intersection d'indice inversé détermine la plage de recherche sécurisée de l'identifiant de document à trouver dans la liste inversée selon les informations de régression linéaire sauvegardées relatives à la liste inversée, puis applique un certain procédé de recherche existant afin d'effectuer des recherches dans ladite plage. Le procédé d'intersection d'indice inversé selon la présente invention permet de restreindre la plage de recherche, de diminuer le temps de recherche, de raccourcir le temps de réponse du moteur de recherche et d'améliorer le confort d'utilisation.
PCT/CN2011/076841 2011-05-09 2011-08-01 Procédé d'intersection d'indice inversé WO2012151781A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2011101181617A CN102136011A (zh) 2011-05-09 2011-05-09 倒排索引求交方法
CN201110118161.7 2011-05-09

Publications (1)

Publication Number Publication Date
WO2012151781A1 true WO2012151781A1 (fr) 2012-11-15

Family

ID=44295797

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/076841 WO2012151781A1 (fr) 2011-05-09 2011-08-01 Procédé d'intersection d'indice inversé

Country Status (2)

Country Link
CN (1) CN102136011A (fr)
WO (1) WO2012151781A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102136011A (zh) * 2011-05-09 2011-07-27 南开大学 倒排索引求交方法
CN106156000B (zh) * 2015-04-28 2020-03-17 腾讯科技(深圳)有限公司 基于求交算法的搜索方法及搜索系统
CN110083679B (zh) * 2019-03-18 2020-08-18 北京三快在线科技有限公司 搜索请求的处理方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1503163A (zh) * 2002-11-22 2004-06-09 �Ҵ���˾ 提供个性化为特定语言的搜索结果的国际搜索和传送系统
US20040205044A1 (en) * 2003-04-11 2004-10-14 International Business Machines Corporation Method for storing inverted index, method for on-line updating the same and inverted index mechanism
US20080133473A1 (en) * 2006-11-30 2008-06-05 Broder Andrei Z Efficient multifaceted search in information retrieval systems
CN102023985A (zh) * 2009-09-17 2011-04-20 日电(中国)有限公司 盲化混合倒排索引表产生方法和设备、联合关键字搜索方法和设备
CN102136011A (zh) * 2011-05-09 2011-07-27 南开大学 倒排索引求交方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100454907C (zh) * 2006-08-07 2009-01-21 华为技术有限公司 一种实现弹性分组环导引保护倒换的方法及装置
CN101242430B (zh) * 2008-02-22 2012-03-28 华中科技大学 对等网络点播系统中的定点数据预取方法
CN101930473A (zh) * 2010-09-14 2010-12-29 何吴迪 一种具有可执行结构的云计算视窗搜索体系的架构方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1503163A (zh) * 2002-11-22 2004-06-09 �Ҵ���˾ 提供个性化为特定语言的搜索结果的国际搜索和传送系统
US20040205044A1 (en) * 2003-04-11 2004-10-14 International Business Machines Corporation Method for storing inverted index, method for on-line updating the same and inverted index mechanism
US20080133473A1 (en) * 2006-11-30 2008-06-05 Broder Andrei Z Efficient multifaceted search in information retrieval systems
CN102023985A (zh) * 2009-09-17 2011-04-20 日电(中国)有限公司 盲化混合倒排索引表产生方法和设备、联合关键字搜索方法和设备
CN102136011A (zh) * 2011-05-09 2011-07-27 南开大学 倒排索引求交方法

Also Published As

Publication number Publication date
CN102136011A (zh) 2011-07-27

Similar Documents

Publication Publication Date Title
CN107273517B (zh) 基于图嵌入学习的图文跨模态检索方法
JP5587434B2 (ja) テキスト分類の方法および装置
RU2439686C2 (ru) Аннотация посредством поиска
WO2018040503A1 (fr) Procédé et système d'obtention de résultats de recherche
WO2016188347A1 (fr) Dispositif et procédé d'évaluation de qualité de réseau, système et procédé de tri de contenu de réseau, appareil informatique et support de stockage lisible par machine non transitoire
WO2007123416A1 (fr) Procede et dispositif permettant de classer efficacement des documents dans un graphe de similarite
CN103473307B (zh) 跨媒体稀疏哈希索引方法
CN106951526B (zh) 一种实体集扩展方法及装置
US20110246439A1 (en) Augmented query search
US10185751B1 (en) Identifying and ranking attributes of entities
WO2013044559A1 (fr) Procédé et système de recommandation de site internet et serveur de réseau
WO2015196964A1 (fr) Procédé de recherche d'image correspondante, procédé de recherche d'image et appareils
CN1818908A (zh) 一种在搜索引擎中应用搜索者反馈信息的方法
WO2020168827A1 (fr) Index de point d'intérêt sur la base d'un emplacement géographique
Sandholm et al. Real-time, location-aware collaborative filtering of web content
CN105677695A (zh) 一种基于内容的计算移动应用相似性的方法
WO2012151781A1 (fr) Procédé d'intersection d'indice inversé
WO2020209966A1 (fr) Entraînement d'un modèle cible
CN110598123B (zh) 基于画像相似性的信息检索推荐方法、装置及存储介质
CN103064907A (zh) 基于无监督的实体关系抽取的主题元搜索系统及方法
CN114201480A (zh) 一种基于nlp技术的多源poi融合方法、装置及可读存储介质
WO2023116816A1 (fr) Procédé et appareil d'alignement de séquence de protéines, serveur et support de stockage
CN107562872B (zh) 基于sql的度量空间数据相似度查询方法及装置
CN113034940A (zh) 一种基于Fisher有序聚类的单点信号交叉口优化配时方法
CN107423294A (zh) 一种社群图像检索方法及系统

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE