WO2008098502A1 - Procédé et dispositif destinés à créer un index et procédé et système de récupération - Google Patents

Procédé et dispositif destinés à créer un index et procédé et système de récupération Download PDF

Info

Publication number
WO2008098502A1
WO2008098502A1 PCT/CN2008/070253 CN2008070253W WO2008098502A1 WO 2008098502 A1 WO2008098502 A1 WO 2008098502A1 CN 2008070253 W CN2008070253 W CN 2008070253W WO 2008098502 A1 WO2008098502 A1 WO 2008098502A1
Authority
WO
WIPO (PCT)
Prior art keywords
compound word
index
valid
webpage
module
Prior art date
Application number
PCT/CN2008/070253
Other languages
English (en)
Chinese (zh)
Inventor
Haisong Yang
Zhiyuan Liu
Yunfeng Liu
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2008098502A1 publication Critical patent/WO2008098502A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Definitions

  • the present invention relates to computer technology, and in particular, to a method and apparatus for establishing an index, a method and system for searching. Background of the invention
  • search engines are the technology that emerged to solve this "history" problem.
  • the navigation service provided by search engines has become a very important network service on the Internet, becoming the most important Internet application alongside email.
  • Search Engines Provide users with a "retrieve" service that uses spider programs to classify all information on the Internet to help users search the vast amount of Internet information they need.
  • the principle of search engine mainly includes three steps: 1) crawling web pages from the Internet, 2) building an index database, and 3) searching for sorts in the index database.
  • indexing only a single word in a webpage are as follows: the split granularity of the search term is small, the number of index queries and the number of set operations of the search engine are large, the system query efficiency is low, and the search speed is low.
  • An embodiment of the present invention provides a method for establishing an index, including:
  • the embodiment of the present invention further provides a search method, which is to establish a web page index for at least one compound word, where the compound word is a combination of at least two valid terms in a valid term obtained from at least one web page; the method includes:
  • the index of the web page established for each compound word obtained after the search term is split is obtained.
  • An embodiment of the present invention further provides an apparatus for establishing an index, including: a first module, configured to obtain at least two valid terms from at least one webpage, and determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term;
  • the second module is configured to establish a webpage index for each compound word determined by the first module.
  • An embodiment of the present invention further provides a system for searching, including:
  • a first module configured to establish a webpage index for the at least one compound word, wherein the compound word is a combination of at least two valid terms in the valid term obtained from the at least one webpage;
  • the second module is configured to split the search term into at least one compound word; and retrieve the index of the web page established for each compound word obtained after the search word is split.
  • Strip separate indexing of these combined terms, reduce the granularity of the search terms in the search, thereby reducing the number of search engine index queries and the number of intersections and unions, greatly improving the search engine search Speed, achieve the purpose of responding quickly to users, and improve user experience.
  • the selective indexing of multiple terms through probability statistics the utilization of the index database and the retrieval accuracy of the system are improved.
  • FIG. 1 is a system configuration diagram of a search system in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a search method in establishing an index database in an embodiment of the present invention.
  • 3 is a flow chart of the search method after receiving a retrieval request in an embodiment of the present invention. Mode for carrying out the invention
  • the search system 10 includes a webpage crawling module 100, a webpage database 200, and an indexing module that are sequentially connected. 300. Index database 400 and search module 500.
  • the webpage crawling module 100 is responsible for automatically extracting information from the Internet, and storing the extracted information in the webpage database 200.
  • the general practice is: The webpage crawling module 100 automatically accesses the Internet through a web spider program capable of automatically collecting webpages from the Internet, and jumps to other webpages along each URL (Uniform Resource Locator) in the current webpage, repeating This process collects all of the web pages traversed into the web page database 200.
  • URL Uniform Resource Locator
  • the "Spider" program will automatically extract the information and URL of the website into the web database 200; the other is to submit the website search, that is, the website owner actively submits the URL to the search engine, and the search engine's web page is crawled.
  • the module 100 periodically controls the "spider" program to scan the websites corresponding to the URLs and store the relevant information in the web database 200 for a certain period of time (for example, ranging from 2 days to several months).
  • the webpage database 200 is responsible for storing all the webpages obtained by the webpage crawling module 100 for the user to search for.
  • the index module 300 is responsible for analyzing the webpages stored in the webpage database 200, and extracting relevant webpage information (including the URL of the webpage, the type of the encoding, the keywords included in the page content, the location of the keyword, the generation time, the size, and the link relationship with other webpages). Etc.), performing a large number of complex calculations according to a certain correlation algorithm, obtaining the page content of each web page and the relevance (or importance) of the hyperlink for each term, and then using these related information to establish a term index, and The established term index is stored in the index database 400.
  • the indexing module 300 includes a document pre-processing unit 301, a word segmentation unit 302, a word frequency statistics unit 303, and an index establishment unit 304.
  • the document pre-processing unit 301 is responsible for reading a webpage from the webpage database 200, and converting different data formats in the input webpage into a standard data format, such as an HTML page, an electric
  • the sub-mail or PDF file is converted into a text file, and some script identifiers and some useless advertisement information need to be filtered out, and then output to the word segmentation unit 302.
  • the word segmentation unit 302 is responsible for performing word segmentation processing on the webpage content after the conversion format. In order to improve system efficiency, it is necessary to remove the stop words and function words before the word segmentation (of course, you can stop the words and function words after the word segmentation), leaving only valid terms. In this embodiment, the word segmentation unit 302 is responsible for dividing the body and title of the converted web page into words according to the dictionary. For example, “I saw you there” to stop the word after the word segmentation, divided into “I”, “see”, “you”, “in” “that” five valid terms.
  • the existing word segmentation algorithms can be divided into three categories: segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics.
  • a word segmentation based word segmentation method is employed. This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
  • the word frequency statistics unit 303 is responsible for performing word frequency statistics and laying the foundation for establishing a compound word index.
  • a compound word is a combination of two or more words (ie, a term) (that is, a combination of more than two), which is a meaning or a certain relationship.
  • a term is a compound word. It is actually composed of the words “eat” and “apple”.
  • "Bank of China” and "ceramic sand" are compound words composed of two terms.
  • the word frequency of a term is the number of times the term appears in the document. For example, the number of times a word appears in a document is thirty, and the frequency of the term for the document is thirty.
  • the word frequency statistics unit 303 first performs various combinations on the valid terms output by the word segmentation unit 302, such as combining the words "Chinese strategic choice of international intellectual property rights and domestic strategic arrangements” into “Chinese knowledge” and “intellectual property rights". "China's intellectual property rights”, “property rights international”, “international strategy”, when all the combined words are counted, sorted according to the frequency, and the combined words whose frequency is greater than the set threshold are output as compound words to the index building unit. 304. Probability of use The compound words that are counted are very close to reality, and do not require manual intervention, which can achieve good results. Of course, other methods can be used to determine a compound word, such as a compound word commonly used in daily life as a compound word, etc., which is not limited by the present invention.
  • the index establishing unit 304 is responsible for indexing all the valid terms output by the word segmentation unit 302 and the compound words output by the word frequency statistics unit 303, and saving the established index to the index database 400.
  • the index establishing unit 304 may also index the compound words that are not valid for each of the constituent composite words and the compound words output by the word frequency statistics unit 303.
  • the index establishing unit 304 also sends the compound words output by the word frequency statistics unit 303 to the index database 400, which stores all the compound words received in the compound word list (not shown in Fig. 1).
  • the search module 500 is responsible for splitting the search words after the user inputs the search term search request, and finds all relevant web pages that match the search term from the index database 400, performs calculation, sorts, and returns to the user.
  • the search module 500 includes a search term segmentation unit 501, a search unit 502, and a result processing unit 503.
  • the term segmentation word unit 501 classifies the search term based on the valid term and the compound word list in the index database 400, and sends the search term obtained after the segmentation to the search unit 502. If the search term is "People's Bank of China”, the valid terms are “China”, “People” and “Bank”. If there is a “Chinese people” in the compound vocabulary, but there is no “Chinese bank” or "People's Bank”, the search term will be split into two search terms: “Chinese people” and "bank”.
  • the search unit 502 searches the index database 400 for the search term obtained after the search term segmentation, extracts the web page that satisfies the condition, and sends it to the result processing unit 503.
  • the result processing unit 503 performs the intersection of the received web pages and the union operation to obtain a The result page collection, and then the relevance of the web page and the search term is calculated, and the first K page (K is a natural number, and the link of the K page is placed in one page) is returned to the user according to the value of the relevance. If the user wishes to view the second page, the link of the web page at the K+1th to 2*K in the sort result is placed in the second page and returned to the user. In other embodiments of the present invention, all searched page results may also be returned to the user at one time. In other embodiments of the present invention, the result pages corresponding to the compound words in the search terms input by the user are ranked first.
  • the link database (not shown in Figure 1) provides a basis for the web page rating of the web page rating module (not shown in Figure 1).
  • the search module 500 searches for the relevant webpage in the index database 400, and the webpage rating module combines the query request and the link information to evaluate the relevance of the search result, and the search module 500 performs the correlation degree according to the correlation degree. Sort, and extract the content summary of the search terms, and organize the last page to return to the user.
  • the system can split the search terms into “Chinese people” and “banks”, perform two index queries, and then perform an intersection calculation, and then perform a union operation.
  • the user returns the search result, which reduces the number of operations for finding the intersection and the union, and improves the search speed.
  • the search method of the present invention includes the following steps when establishing or updating the index database 400:
  • Step S11 reading a webpage, converting the text into a standard data format, and filtering out irrelevant information such as a script identifier and advertisement information;
  • Step S12 performing a word segmentation after stopping the word or the function word
  • Step S13 performing word frequency statistics on various combinations of valid terms obtained by the word segmentation
  • Step S14 the combined term whose output frequency is greater than the set threshold is used as a compound word
  • Step S15 indexing and saving all the valid words obtained by the compound words whose frequency is greater than the set threshold and the word segmentation are saved.
  • the established index may be updated periodically, such as adding a compound word, and indexing the newly added compound word; and searching for the webpage information in the existing compound word index. Update; or delete a compound word and its index and so on.
  • a compound word may be added when the number of occurrences of a combination of valid terms in the web page changes from less than the set threshold to greater than the set threshold;
  • the embodiment of the invention further provides a search method, after receiving the user's search term, the following steps are performed:
  • the search term is segmented to obtain at least one search term.
  • the search term is preferentially split into compound words, and the valid terms that do not participate in the compound word are not included in the search term. , directly as a search term.
  • the search term can be split into multiple compound words, and one of the compound words contains all of the other compound words, the other compound word is not used as a search term, that is, the compound word as the search term is not included by other compound words.
  • the search term itself exists in the compound word list, the entire search term is directly used as a search term.
  • the at least one search term is indexed in the index database to obtain at least one result set.
  • FIG. 3 is a flow chart of the search method after receiving the search request in the embodiment of the present invention, as shown in FIG.
  • Step S21 first classifying the search words according to the compound vocabulary, and obtaining "Chinese people” and “banks”.
  • Step S22 Perform an index query on the "Chinese people” in the index database to obtain a result set R1; perform an index query on the "bank” to obtain a result set R2.
  • Step S23 performing intersection calculation on the sets R1 and R2 to obtain a set R3;
  • Step S24 performing a summation operation on the sets R1 and R2 to obtain a set R4;
  • step S25 the results are sorted and returned to the user, and the webpages in the set R3 are ranked first, followed by the webpages in the set R4 that are not in the set R3.
  • compound words can be searched for at the same time at the time of searching to achieve a comprehensive and complete result.

Abstract

L'invention concerne un procédé destiné à créer un index et consistant à acquérir au moins deux mots valides à partir d'au moins une page Web, à déterminer au moins un mot composé, chacun des mots composés étant une combinaison desdits mots valides acquis, et à créer un index pour chacun des mots composés. L'invention concerne également un dispositif destiné à créer un index. Elle se rapporte en outre à un procédé et un système de récupération.
PCT/CN2008/070253 2007-02-06 2008-02-02 Procédé et dispositif destinés à créer un index et procédé et système de récupération WO2008098502A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710073220.7 2007-02-06
CNB2007100732207A CN100498790C (zh) 2007-02-06 2007-02-06 一种搜索方法和系统

Publications (1)

Publication Number Publication Date
WO2008098502A1 true WO2008098502A1 (fr) 2008-08-21

Family

ID=38906535

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/070253 WO2008098502A1 (fr) 2007-02-06 2008-02-02 Procédé et dispositif destinés à créer un index et procédé et système de récupération

Country Status (2)

Country Link
CN (1) CN100498790C (fr)
WO (1) WO2008098502A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377190A (zh) * 2012-04-11 2013-10-30 阿里巴巴集团控股有限公司 一种基于交易平台的供应商信息搜索方法和装置
CN111368022A (zh) * 2020-02-28 2020-07-03 山东汇贸电子口岸有限公司 一种使用反向索引实现书籍筛选的方法及工具

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100498790C (zh) * 2007-02-06 2009-06-10 腾讯科技(深圳)有限公司 一种搜索方法和系统
US8468142B2 (en) * 2008-08-06 2013-06-18 Fujitsu Limited Caching query results with binary decision diagrams (BDDs)
CN101789872B (zh) * 2009-01-22 2012-12-26 蔡亮华 基于互联网的信息处理方法
CN101876981B (zh) * 2009-04-29 2015-09-23 阿里巴巴集团控股有限公司 一种构建知识库的方法及装置
CN102200984A (zh) * 2010-03-24 2011-09-28 深圳市腾讯计算机系统有限公司 一种基于复合词的搜索方法和搜索引擎服务器
CN102314464B (zh) * 2010-07-07 2013-06-05 北京亮点时间科技有限公司 歌词搜索方法及搜索引擎
CN101894160B (zh) * 2010-07-21 2012-02-08 同方知网(北京)技术有限公司 一种智能检索方法
CN101986309A (zh) * 2010-11-16 2011-03-16 无敌科技(西安)有限公司 查询题库的方法及装置
CN102591865A (zh) * 2011-01-06 2012-07-18 北京瑞信在线系统技术有限公司 音乐词汇文件索引建立方法及其装置、音乐词汇搜索方法及其系统
CN103246664B (zh) * 2012-02-07 2016-05-25 阿里巴巴集团控股有限公司 网页检索方法和装置
KR102072113B1 (ko) 2012-10-17 2020-02-03 삼성전자주식회사 사용자 단말 장치 및 제어 방법
CN103309991A (zh) * 2013-06-19 2013-09-18 南京邮电大学 一种基于高校危险品仓库管理程序中的查询方法
CN103559313B (zh) * 2013-11-20 2018-02-23 北京奇虎科技有限公司 搜索方法及装置
CN103955449B (zh) * 2014-04-21 2018-03-06 安一恒通(北京)科技有限公司 定位目标样本的方法和装置
CN105653546B (zh) * 2014-11-11 2019-10-25 北大方正集团有限公司 一种目标主题的检索方法和系统
CN104462553B (zh) * 2014-12-25 2019-02-26 北京奇虎科技有限公司 问答页面相关问题推荐方法及装置
CN105069063A (zh) * 2015-07-29 2015-11-18 百度在线网络技术(北京)有限公司 图片搜索方法及装置
CN108427688B (zh) * 2017-02-15 2022-01-28 北京国双科技有限公司 数据查询方法和装置
CN107463655A (zh) * 2017-07-27 2017-12-12 无锡雅座在线科技股份有限公司 查询数据的方法、装置和系统
CN107659468A (zh) * 2017-10-10 2018-02-02 深圳市吉祥腾达科技有限公司 一种路由器安全可靠性的测试方法
CN108416264A (zh) * 2018-01-29 2018-08-17 山东汇贸电子口岸有限公司 一种支持ocr输入的搜索方法及搜索模块
CN108920697A (zh) * 2018-07-16 2018-11-30 佛山市影腾科技有限公司 一种应用程序的推荐方法、装置及终端
CN110287881A (zh) * 2019-06-26 2019-09-27 上海交通大学 图书识别系统、图书识别方法、电子装置及储存介质
CN113806477A (zh) * 2021-08-26 2021-12-17 广东广信通信服务有限公司 一种文本自动标注方法、装置、终端及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677402A (zh) * 2004-03-31 2005-10-05 奥西-技术有限公司 用于确定复合词的构成词的设备和计算机化方法
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking
CN1770159A (zh) * 2005-10-28 2006-05-10 北大方正集团有限公司 一种网络内容引用自动发现的方法
CN101079056A (zh) * 2007-02-06 2007-11-28 腾讯科技(深圳)有限公司 一种搜索方法和系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677402A (zh) * 2004-03-31 2005-10-05 奥西-技术有限公司 用于确定复合词的构成词的设备和计算机化方法
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking
CN1770159A (zh) * 2005-10-28 2006-05-10 北大方正集团有限公司 一种网络内容引用自动发现的方法
CN101079056A (zh) * 2007-02-06 2007-11-28 腾讯科技(深圳)有限公司 一种搜索方法和系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377190A (zh) * 2012-04-11 2013-10-30 阿里巴巴集团控股有限公司 一种基于交易平台的供应商信息搜索方法和装置
CN111368022A (zh) * 2020-02-28 2020-07-03 山东汇贸电子口岸有限公司 一种使用反向索引实现书籍筛选的方法及工具

Also Published As

Publication number Publication date
CN100498790C (zh) 2009-06-10
CN101079056A (zh) 2007-11-28

Similar Documents

Publication Publication Date Title
WO2008098502A1 (fr) Procédé et dispositif destinés à créer un index et procédé et système de récupération
CN100405371C (zh) 一种提取新词的方法和系统
CN103365924B (zh) 一种互联网信息搜索的方法、装置和终端
CN101908071B (zh) 一种提高搜索引擎搜索效率的方法及其系统
Chen et al. Web mining for web image retrieval
US7636714B1 (en) Determining query term synonyms within query context
CN101169780A (zh) 一种基于语义本体的检索系统和方法
US8527487B2 (en) Method and system for automatic construction of information organization structure for related information browsing
JP5084858B2 (ja) サマリ作成装置、サマリ作成方法及びプログラム
WO2008097856A2 (fr) Moteur de distribution de résultats de recherche
WO2009000174A1 (fr) Procédé et dispositif de classement de pages web
CN103955529A (zh) 一种互联网信息搜索聚合呈现方法
CN1389811A (zh) 搜索引擎的智能化搜索方法
KR20160124079A (ko) 인-메모리 데이터베이스 탐색을 위한 시스템 및 방법
CN101261629A (zh) 基于自动分类技术的特定信息搜索方法
Vavliakis et al. Event Detection via LDA for the MediaEval2012 SED Task.
CN103425742A (zh) 一种网站的搜索方法和装置
JP4769822B2 (ja) ページグループを用いた情報検索サービス提供サーバー、方法及びシステム
CN103226601A (zh) 一种图片搜索的方法和装置
CN109948015B (zh) 一种元搜索列表结果抽取方法及系统
CN114238735A (zh) 一种互联网数据智能采集方法
JP2006164086A (ja) オンライン知識検索支援装置、およびオンライン知識検索支援方法
WO2017113324A1 (fr) Procédé de filtrage d'url reposant sur une expression régulière
TWI483129B (zh) Retrieval method and device
CN101923548A (zh) 一种互联网信息搜索方法及一种搜索引擎

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08706628

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 3974/CHENP/2009

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: LOSS OF RIGHTS COMMUNICATION (EPO F1205A OF 09.12.09)

122 Ep: pct application non-entry in european phase

Ref document number: 08706628

Country of ref document: EP

Kind code of ref document: A1