WO2008098502A1 - Method and device for creating index as well as method and system for retrieving - Google Patents

Method and device for creating index as well as method and system for retrieving Download PDF

Info

Publication number
WO2008098502A1
WO2008098502A1 PCT/CN2008/070253 CN2008070253W WO2008098502A1 WO 2008098502 A1 WO2008098502 A1 WO 2008098502A1 CN 2008070253 W CN2008070253 W CN 2008070253W WO 2008098502 A1 WO2008098502 A1 WO 2008098502A1
Authority
WO
WIPO (PCT)
Prior art keywords
compound word
index
valid
webpage
module
Prior art date
Application number
PCT/CN2008/070253
Other languages
French (fr)
Chinese (zh)
Inventor
Haisong Yang
Zhiyuan Liu
Yunfeng Liu
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2008098502A1 publication Critical patent/WO2008098502A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Definitions

  • the present invention relates to computer technology, and in particular, to a method and apparatus for establishing an index, a method and system for searching. Background of the invention
  • search engines are the technology that emerged to solve this "history" problem.
  • the navigation service provided by search engines has become a very important network service on the Internet, becoming the most important Internet application alongside email.
  • Search Engines Provide users with a "retrieve" service that uses spider programs to classify all information on the Internet to help users search the vast amount of Internet information they need.
  • the principle of search engine mainly includes three steps: 1) crawling web pages from the Internet, 2) building an index database, and 3) searching for sorts in the index database.
  • indexing only a single word in a webpage are as follows: the split granularity of the search term is small, the number of index queries and the number of set operations of the search engine are large, the system query efficiency is low, and the search speed is low.
  • An embodiment of the present invention provides a method for establishing an index, including:
  • the embodiment of the present invention further provides a search method, which is to establish a web page index for at least one compound word, where the compound word is a combination of at least two valid terms in a valid term obtained from at least one web page; the method includes:
  • the index of the web page established for each compound word obtained after the search term is split is obtained.
  • An embodiment of the present invention further provides an apparatus for establishing an index, including: a first module, configured to obtain at least two valid terms from at least one webpage, and determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term;
  • the second module is configured to establish a webpage index for each compound word determined by the first module.
  • An embodiment of the present invention further provides a system for searching, including:
  • a first module configured to establish a webpage index for the at least one compound word, wherein the compound word is a combination of at least two valid terms in the valid term obtained from the at least one webpage;
  • the second module is configured to split the search term into at least one compound word; and retrieve the index of the web page established for each compound word obtained after the search word is split.
  • Strip separate indexing of these combined terms, reduce the granularity of the search terms in the search, thereby reducing the number of search engine index queries and the number of intersections and unions, greatly improving the search engine search Speed, achieve the purpose of responding quickly to users, and improve user experience.
  • the selective indexing of multiple terms through probability statistics the utilization of the index database and the retrieval accuracy of the system are improved.
  • FIG. 1 is a system configuration diagram of a search system in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a search method in establishing an index database in an embodiment of the present invention.
  • 3 is a flow chart of the search method after receiving a retrieval request in an embodiment of the present invention. Mode for carrying out the invention
  • the search system 10 includes a webpage crawling module 100, a webpage database 200, and an indexing module that are sequentially connected. 300. Index database 400 and search module 500.
  • the webpage crawling module 100 is responsible for automatically extracting information from the Internet, and storing the extracted information in the webpage database 200.
  • the general practice is: The webpage crawling module 100 automatically accesses the Internet through a web spider program capable of automatically collecting webpages from the Internet, and jumps to other webpages along each URL (Uniform Resource Locator) in the current webpage, repeating This process collects all of the web pages traversed into the web page database 200.
  • URL Uniform Resource Locator
  • the "Spider" program will automatically extract the information and URL of the website into the web database 200; the other is to submit the website search, that is, the website owner actively submits the URL to the search engine, and the search engine's web page is crawled.
  • the module 100 periodically controls the "spider" program to scan the websites corresponding to the URLs and store the relevant information in the web database 200 for a certain period of time (for example, ranging from 2 days to several months).
  • the webpage database 200 is responsible for storing all the webpages obtained by the webpage crawling module 100 for the user to search for.
  • the index module 300 is responsible for analyzing the webpages stored in the webpage database 200, and extracting relevant webpage information (including the URL of the webpage, the type of the encoding, the keywords included in the page content, the location of the keyword, the generation time, the size, and the link relationship with other webpages). Etc.), performing a large number of complex calculations according to a certain correlation algorithm, obtaining the page content of each web page and the relevance (or importance) of the hyperlink for each term, and then using these related information to establish a term index, and The established term index is stored in the index database 400.
  • the indexing module 300 includes a document pre-processing unit 301, a word segmentation unit 302, a word frequency statistics unit 303, and an index establishment unit 304.
  • the document pre-processing unit 301 is responsible for reading a webpage from the webpage database 200, and converting different data formats in the input webpage into a standard data format, such as an HTML page, an electric
  • the sub-mail or PDF file is converted into a text file, and some script identifiers and some useless advertisement information need to be filtered out, and then output to the word segmentation unit 302.
  • the word segmentation unit 302 is responsible for performing word segmentation processing on the webpage content after the conversion format. In order to improve system efficiency, it is necessary to remove the stop words and function words before the word segmentation (of course, you can stop the words and function words after the word segmentation), leaving only valid terms. In this embodiment, the word segmentation unit 302 is responsible for dividing the body and title of the converted web page into words according to the dictionary. For example, “I saw you there” to stop the word after the word segmentation, divided into “I”, “see”, “you”, “in” “that” five valid terms.
  • the existing word segmentation algorithms can be divided into three categories: segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics.
  • a word segmentation based word segmentation method is employed. This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
  • the word frequency statistics unit 303 is responsible for performing word frequency statistics and laying the foundation for establishing a compound word index.
  • a compound word is a combination of two or more words (ie, a term) (that is, a combination of more than two), which is a meaning or a certain relationship.
  • a term is a compound word. It is actually composed of the words “eat” and “apple”.
  • "Bank of China” and "ceramic sand" are compound words composed of two terms.
  • the word frequency of a term is the number of times the term appears in the document. For example, the number of times a word appears in a document is thirty, and the frequency of the term for the document is thirty.
  • the word frequency statistics unit 303 first performs various combinations on the valid terms output by the word segmentation unit 302, such as combining the words "Chinese strategic choice of international intellectual property rights and domestic strategic arrangements” into “Chinese knowledge” and “intellectual property rights". "China's intellectual property rights”, “property rights international”, “international strategy”, when all the combined words are counted, sorted according to the frequency, and the combined words whose frequency is greater than the set threshold are output as compound words to the index building unit. 304. Probability of use The compound words that are counted are very close to reality, and do not require manual intervention, which can achieve good results. Of course, other methods can be used to determine a compound word, such as a compound word commonly used in daily life as a compound word, etc., which is not limited by the present invention.
  • the index establishing unit 304 is responsible for indexing all the valid terms output by the word segmentation unit 302 and the compound words output by the word frequency statistics unit 303, and saving the established index to the index database 400.
  • the index establishing unit 304 may also index the compound words that are not valid for each of the constituent composite words and the compound words output by the word frequency statistics unit 303.
  • the index establishing unit 304 also sends the compound words output by the word frequency statistics unit 303 to the index database 400, which stores all the compound words received in the compound word list (not shown in Fig. 1).
  • the search module 500 is responsible for splitting the search words after the user inputs the search term search request, and finds all relevant web pages that match the search term from the index database 400, performs calculation, sorts, and returns to the user.
  • the search module 500 includes a search term segmentation unit 501, a search unit 502, and a result processing unit 503.
  • the term segmentation word unit 501 classifies the search term based on the valid term and the compound word list in the index database 400, and sends the search term obtained after the segmentation to the search unit 502. If the search term is "People's Bank of China”, the valid terms are “China”, “People” and “Bank”. If there is a “Chinese people” in the compound vocabulary, but there is no “Chinese bank” or "People's Bank”, the search term will be split into two search terms: “Chinese people” and "bank”.
  • the search unit 502 searches the index database 400 for the search term obtained after the search term segmentation, extracts the web page that satisfies the condition, and sends it to the result processing unit 503.
  • the result processing unit 503 performs the intersection of the received web pages and the union operation to obtain a The result page collection, and then the relevance of the web page and the search term is calculated, and the first K page (K is a natural number, and the link of the K page is placed in one page) is returned to the user according to the value of the relevance. If the user wishes to view the second page, the link of the web page at the K+1th to 2*K in the sort result is placed in the second page and returned to the user. In other embodiments of the present invention, all searched page results may also be returned to the user at one time. In other embodiments of the present invention, the result pages corresponding to the compound words in the search terms input by the user are ranked first.
  • the link database (not shown in Figure 1) provides a basis for the web page rating of the web page rating module (not shown in Figure 1).
  • the search module 500 searches for the relevant webpage in the index database 400, and the webpage rating module combines the query request and the link information to evaluate the relevance of the search result, and the search module 500 performs the correlation degree according to the correlation degree. Sort, and extract the content summary of the search terms, and organize the last page to return to the user.
  • the system can split the search terms into “Chinese people” and “banks”, perform two index queries, and then perform an intersection calculation, and then perform a union operation.
  • the user returns the search result, which reduces the number of operations for finding the intersection and the union, and improves the search speed.
  • the search method of the present invention includes the following steps when establishing or updating the index database 400:
  • Step S11 reading a webpage, converting the text into a standard data format, and filtering out irrelevant information such as a script identifier and advertisement information;
  • Step S12 performing a word segmentation after stopping the word or the function word
  • Step S13 performing word frequency statistics on various combinations of valid terms obtained by the word segmentation
  • Step S14 the combined term whose output frequency is greater than the set threshold is used as a compound word
  • Step S15 indexing and saving all the valid words obtained by the compound words whose frequency is greater than the set threshold and the word segmentation are saved.
  • the established index may be updated periodically, such as adding a compound word, and indexing the newly added compound word; and searching for the webpage information in the existing compound word index. Update; or delete a compound word and its index and so on.
  • a compound word may be added when the number of occurrences of a combination of valid terms in the web page changes from less than the set threshold to greater than the set threshold;
  • the embodiment of the invention further provides a search method, after receiving the user's search term, the following steps are performed:
  • the search term is segmented to obtain at least one search term.
  • the search term is preferentially split into compound words, and the valid terms that do not participate in the compound word are not included in the search term. , directly as a search term.
  • the search term can be split into multiple compound words, and one of the compound words contains all of the other compound words, the other compound word is not used as a search term, that is, the compound word as the search term is not included by other compound words.
  • the search term itself exists in the compound word list, the entire search term is directly used as a search term.
  • the at least one search term is indexed in the index database to obtain at least one result set.
  • FIG. 3 is a flow chart of the search method after receiving the search request in the embodiment of the present invention, as shown in FIG.
  • Step S21 first classifying the search words according to the compound vocabulary, and obtaining "Chinese people” and “banks”.
  • Step S22 Perform an index query on the "Chinese people” in the index database to obtain a result set R1; perform an index query on the "bank” to obtain a result set R2.
  • Step S23 performing intersection calculation on the sets R1 and R2 to obtain a set R3;
  • Step S24 performing a summation operation on the sets R1 and R2 to obtain a set R4;
  • step S25 the results are sorted and returned to the user, and the webpages in the set R3 are ranked first, followed by the webpages in the set R4 that are not in the set R3.
  • compound words can be searched for at the same time at the time of searching to achieve a comprehensive and complete result.

Abstract

A method for creating index is disclosed, which includes: acquiring at least two valid words from at least one web page; determining at least one compound word, wherein each of the compound words is a combination of the at least two valid words of the valid words acquired; creating a index for each of the compound words. The invention further discloses a device for creating index. Furthermore, a method and system for retrieving is also disclosed.

Description

建立索引的方法及装置、 搜索的方法及系统  Method and device for indexing, search method and system
技术领域 本发明涉及计算机技术, 特别涉及建立索引的方法及装置、 搜索的 方法及系统。 发明背景 TECHNICAL FIELD The present invention relates to computer technology, and in particular, to a method and apparatus for establishing an index, a method and system for searching. Background of the invention
随着因特网的迅猛发展, 各种信息呈现爆炸式的增长, 用户要在信 息海洋里查找信息, 就象大海捞针一样。 每个上网用户面临信息过载的 问题, 无法准确找到所需要的信息。搜索引擎正是为了解决这个 "迷航" 问题而出现的技术。 搜索引擎提供的导航服务已经成为互联网上非常重 要的网络服务, 成为和电子邮件并列的最重要的互联网应用。 搜索引擎 为用户提供信息"检索"服务, 它使用蜘蛛程序把因特网上的所有信息归 类以帮助用户在海量的互联网信息中搜寻其所需要的信息。 搜索引擎的 原理主要包括三步: 1 )从互联网上抓取网页, 2 )建立索引数据库, 3 ) 在索引数据库中搜索排序。  With the rapid development of the Internet, various kinds of information have exploded, and users need to find information in the information ocean, just like a needle in a haystack. Every online user faces the problem of information overload and cannot find the information that is needed accurately. Search engines are the technology that emerged to solve this "history" problem. The navigation service provided by search engines has become a very important network service on the Internet, becoming the most important Internet application alongside email. Search Engines Provide users with a "retrieve" service that uses spider programs to classify all information on the Internet to help users search the vast amount of Internet information they need. The principle of search engine mainly includes three steps: 1) crawling web pages from the Internet, 2) building an index database, and 3) searching for sorts in the index database.
目前, 在建立索引数据库时, 一般只对网页中单个词, 即一元词进 行索引, 这样当搜索引擎处理用户请求, 需要对用户的检索词进行拆分 Currently, when building an index database, only a single word in a web page, that is, a unary word, is generally indexed, so that when the search engine processes the user request, the user's search term needs to be split.
(分词), 然后对拆分后的词语分别索引查询, 得出每个词语的搜索结 果。 例如, 用户搜索 "北京体育馆" 时, 搜索引擎的动作为: ①把用户 的搜索请求 "北京体育馆" 拆分为 "北京" 和 "体育馆" 两个词; ②对(participle), then index the searched words separately, and get the search results of each word. For example, when a user searches for "Beijing Gymnasium", the search engine's actions are as follows: 1 Split the user's search request "Beijing Gymnasium" into two words: "Beijing" and "Gallery";
"北京" 进行索引查询, 得到结果集合 A; ③对 "体育馆" 进行索引查 询,得到结果集合 B; ④对 A和 B进行求交集运算,得到 AB的交集 X; ⑤对 AB进行求并运算, 得到 AB的并集合 Y; ⑥向用户输出搜索结果。 搜索结果的排序顺序为: 集合 X 中的网页排在最前面, 其次是 Y中的 不在 X的元素。 再如当检索词为 "中国人民银行" 时, 首先拆分词语为 "中国", "人民", "银行", 然后进行三次索引查询, 如果采用两两求 交, 两两求并规则, 要进行三次求交集运算、 三次求并集运算才能得到 最终搜索结果。 只对网页中单个词建立索引的缺点在于: 对搜索词条的 拆分粒度小、 搜索引擎的索引查询次数和集合运算次数多, 系统查询效 率较低, 搜索速度较低。 "Beijing" conducts an index query and obtains a result set A; 3 performs an index query on the "sports hall" to obtain a result set B; 4 performs an intersection operation of A and B to obtain an intersection X of AB; 5 performs a joint operation on AB, Get the union of AB and Y; 6 Output the search results to the user. The sort order of the search results is: The pages in the set X are ranked first, followed by the elements in the X that are not in X. If the search term is "People's Bank of China", the first word is split. "China", "People", "Bank", and then three index queries, if you use two or two to cross, two or two summation rules, you must perform three intersections and three unions to get the final search results. The disadvantages of indexing only a single word in a webpage are as follows: the split granularity of the search term is small, the number of index queries and the number of set operations of the search engine are large, the system query efficiency is low, and the search speed is low.
现在也有一些搜索引擎, 在建立索引数据库时, 对网页中二元词进 行索引, 但由于在此过程中对很多无意义的组合也建立了索引, 造成空 间浪费。 如目前的二元索引就是不管词语的逻辑关系, 直接对每个二元 组合建立索引, 如以 "我看见你在那里" 为例, 二元词就有 "我看见", There are also some search engines that index binary words in web pages when indexing databases, but because of the insignificant combination of indexes in the process, space is wasted. For example, the current binary index is to directly index each binary combination regardless of the logical relationship of words. For example, "I see you there", the binary word has "I see".
"看见你", "你在", "在那里" 等, 这样就存在^艮多没有意义的组合, 如 "你在", 导致用户得不到较好的搜索体验。 而且这样会导致空间的 急剧膨胀, 从而导致索引量不够。 发明内容 "See you", "You are", "Where", etc., so there are many meaningless combinations, such as "you are", resulting in users not getting a good search experience. Moreover, this will lead to a sharp expansion of the space, resulting in insufficient indexing. Summary of the invention
本发明实施例提供了一种建立索引的方法, 包括:  An embodiment of the present invention provides a method for establishing an index, including:
从至少一个网页中获取至少两个有效词条;  Obtain at least two valid terms from at least one web page;
确认至少一个复合词, 每一复合词为所获取的有效词条中至少两个 有效词条的组合;  Confirming at least one compound word, each compound word being a combination of at least two valid terms in the valid terms obtained;
为每一复合词建立网页索引。  Create a web page index for each compound word.
本发明实施例还提供了一种搜索的方法, 为至少一个复合词建立网 页索引, 所述复合词为从至少一个网页中获取的有效词条中至少两个有 效词条的组合; 该方法包括:  The embodiment of the present invention further provides a search method, which is to establish a web page index for at least one compound word, where the compound word is a combination of at least two valid terms in a valid term obtained from at least one web page; the method includes:
拆分检索词为至少一个复合词;  Splitting the search term into at least one compound word;
检索得到为检索词拆分后得到的每一复合词建立的网页索引。  The index of the web page established for each compound word obtained after the search term is split is obtained.
本发明实施例还提供了一种用于建立索引的装置, 包括: 第一模块, 用于从至少一个网页中获取至少两个有效词条, 并确定 至少一个复合词, 每一复合词为所获取的有效词条中至少两个有效词条 的组合; An embodiment of the present invention further provides an apparatus for establishing an index, including: a first module, configured to obtain at least two valid terms from at least one webpage, and determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term;
第二模块, 用于为第一模块确定的每一复合词建立网页索引。  The second module is configured to establish a webpage index for each compound word determined by the first module.
本发明实施例还提供了一种用于搜索的系统, 包括:  An embodiment of the present invention further provides a system for searching, including:
第一模块, 用于为至少一个复合词建立网页索引, 其中所述复合词 为从至少一个网页中获取的有效词条中至少两个有效词条的组合;  a first module, configured to establish a webpage index for the at least one compound word, wherein the compound word is a combination of at least two valid terms in the valid term obtained from the at least one webpage;
第二模块, 用于拆分检索词为至少一个复合词; 并检索得到为检索 词拆分后得到的每一复合词所建立的网页索引。 条,对这些组合词条单独建立索引,减小搜索时对搜索词条的拆分粒度, 从而减少搜索引擎的索引查询次数和求交集、 求并集运算的次数, 大大 地提高搜索引擎的检索速度,达到快速响应用户的目的,提高用户体验。 同时, 由于通过概率统计有选择的对多元词条建立索引, 提高了索引数 据库的利用率和系统的检索准确率。 附图简要说明  The second module is configured to split the search term into at least one compound word; and retrieve the index of the web page established for each compound word obtained after the search word is split. Strip, separate indexing of these combined terms, reduce the granularity of the search terms in the search, thereby reducing the number of search engine index queries and the number of intersections and unions, greatly improving the search engine search Speed, achieve the purpose of responding quickly to users, and improve user experience. At the same time, because of the selective indexing of multiple terms through probability statistics, the utilization of the index database and the retrieval accuracy of the system are improved. BRIEF DESCRIPTION OF THE DRAWINGS
图 1是本发明实施例中搜索系统的系统结构图。  1 is a system configuration diagram of a search system in an embodiment of the present invention.
图 2是本发明实施例中搜索方法在建立索引数据库时的流程图。 图 3是本发明实施例中搜索方法在收到检索请求后的流程图。 实施本发明的方式  2 is a flow chart of a search method in establishing an index database in an embodiment of the present invention. 3 is a flow chart of the search method after receiving a retrieval request in an embodiment of the present invention. Mode for carrying out the invention
下面根据附图和具体实施例对本发明作进一步阐述。  The invention will now be further elucidated with reference to the drawings and specific embodiments.
图 1是本发明实施例中搜索系统的系统结构图。 如图 1所示, 搜索 系统 10包括顺次相连的网页抓取模块 100、 网页数据库 200、 索引模块 300、 索引数据库 400和搜索模块 500。 1 is a system configuration diagram of a search system in an embodiment of the present invention. As shown in FIG. 1, the search system 10 includes a webpage crawling module 100, a webpage database 200, and an indexing module that are sequentially connected. 300. Index database 400 and search module 500.
其中, 网页抓取模块 100负责自动从互联网提取信息, 并将提取到 的信息保存在网页数据库 200中。 一般的做法是: 网页抓取模块 100通 过能够从互联网上自动收集网页的网络蜘蛛程序, 自动访问互联网, 并 沿着当前网页中的每一个 URL (统一资源定位器)跳转到其它网页, 重 复这过程, 并把遍历过的所有网页收集到网页数据库 200中。 搜索引擎 的自动信息搜集功能分两种, 一种是定期搜索, 即每隔一段时间 (比如 28天), 网页抓取模块 100主动控制"蜘蛛"程序, 对一定 IP地址范围内 的互联网站进行检索, 一旦发现新的网站, "蜘蛛" 程序会自动提取网 站的信息和网址加入网页数据库 200; 另一种是提交网站搜索, 即网站 拥有者主动向搜索引擎提交网址, 搜索引擎的网页抓取模块 100会在一 定时间内 (比如 2天到数月不等)定期控制"蜘蛛"程序对这些网址对应 的网站进行扫描并将有关信息存入网页数据库 200。  The webpage crawling module 100 is responsible for automatically extracting information from the Internet, and storing the extracted information in the webpage database 200. The general practice is: The webpage crawling module 100 automatically accesses the Internet through a web spider program capable of automatically collecting webpages from the Internet, and jumps to other webpages along each URL (Uniform Resource Locator) in the current webpage, repeating This process collects all of the web pages traversed into the web page database 200. There are two types of automatic information collection functions of search engines. One is regular search, that is, every time (for example, 28 days), the webpage crawling module 100 actively controls the "spider" program to perform Internet sites in a certain IP address range. Search, once a new website is discovered, the "Spider" program will automatically extract the information and URL of the website into the web database 200; the other is to submit the website search, that is, the website owner actively submits the URL to the search engine, and the search engine's web page is crawled. The module 100 periodically controls the "spider" program to scan the websites corresponding to the URLs and store the relevant information in the web database 200 for a certain period of time (for example, ranging from 2 days to several months).
网页数据库 200负责存储网页抓取模块 100获得的全部网页, 以备 用户搜索使用。  The webpage database 200 is responsible for storing all the webpages obtained by the webpage crawling module 100 for the user to search for.
索引模块 300负责对网页数据库 200中存储的网页进行分析, 提取 相关网页信息(包括网页所在 URL、编码类型、页面内容包含的关键词、 关键词位置、 生成时间、 大小、 与其它网页的链接关系等), 根据一定 的相关度算法进行大量复杂计算, 得到每一个网页的页面内容以及超链 接针对每个词条的相关度(或重要性), 然后用这些相关信息建立词条 索引, 并将建立好的词条索引存储到索引数据库 400中。 本实施例中, 索引模块 300包括文档预处理单元 301、 分词单元 302、 词频统计单元 303和索引建立单元 304。  The index module 300 is responsible for analyzing the webpages stored in the webpage database 200, and extracting relevant webpage information (including the URL of the webpage, the type of the encoding, the keywords included in the page content, the location of the keyword, the generation time, the size, and the link relationship with other webpages). Etc.), performing a large number of complex calculations according to a certain correlation algorithm, obtaining the page content of each web page and the relevance (or importance) of the hyperlink for each term, and then using these related information to establish a term index, and The established term index is stored in the index database 400. In this embodiment, the indexing module 300 includes a document pre-processing unit 301, a word segmentation unit 302, a word frequency statistics unit 303, and an index establishment unit 304.
文档预处理单元 301负责从网页数据库 200中读取一个网页, 将输 入的网页中不同的数据格式转换为标准数据格式,如将 HTML页面、 电 子邮件或者 PDF文件转换为文本文件,同时需要过滤掉一些脚本标示符 和一些无用的广告信息, 然后输出到分词单元 302。 The document pre-processing unit 301 is responsible for reading a webpage from the webpage database 200, and converting different data formats in the input webpage into a standard data format, such as an HTML page, an electric The sub-mail or PDF file is converted into a text file, and some script identifiers and some useless advertisement information need to be filtered out, and then output to the word segmentation unit 302.
分词单元 302负责对转换格式后的网页内容进行分词处理。 为了提 高系统效率, 在分词前首先要把停用词和虚词等去掉(当然也可以在分 词后去停用词和虚词等), 只留下有效词条。 本实施例中, 分词单元 302 负责按照词典将转换后网页的正文和标题切分成词汇。 如将 "我看见了 你在那里"去停用词后进行分词, 分为 "我"、 "看见"、 "你"、 "在" "那 里" 五个有效词条。 现有的分词算法可分为三大类: 基于字符串匹配的 分词方法、 基于理解的分词方法和基于统计的分词方法。 本实施例中采 用基于字符串匹配的分词方法。 该方法又叫做机械分词方法, 它是按照 一定的策略将待分析的汉字串与一个"充分大的"机器词典中的词条进行 匹配, 若在词典中找到某个字符串, 则匹配成功 (识别出一个词)。  The word segmentation unit 302 is responsible for performing word segmentation processing on the webpage content after the conversion format. In order to improve system efficiency, it is necessary to remove the stop words and function words before the word segmentation (of course, you can stop the words and function words after the word segmentation), leaving only valid terms. In this embodiment, the word segmentation unit 302 is responsible for dividing the body and title of the converted web page into words according to the dictionary. For example, "I saw you there" to stop the word after the word segmentation, divided into "I", "see", "you", "in" "that" five valid terms. The existing word segmentation algorithms can be divided into three categories: segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. In this embodiment, a word segmentation based word segmentation method is employed. This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).
词频统计单元 303负责进行词频统计,为建立复合词索引奠定基础。 顾名思义, 复合词就是由两个或两个以上词语(即词条)组成的组合词 条(即二元以上的组合词条), 是有一定意义或者有一定关系的词条。 例如 "吃苹果" 就是一个复合词, 它其实是由 "吃" 和 "苹果" 两个词 条组成, 再例如 "中国银行" 和 "陶瓷沙" 都是由两个词条组成的复合 词。 某个词条的词频就是词条在文档出现的次数, 例如一个词语在某文 档中出现的次数为三十, 此词条对这个文档的频度即为三十。 词频统计 单元 303首先对分词单元 302输出的有效词条进行各种组合, 如将 "中 国知识产权的国际战略选择与国内战略安排" 分词后的词条组合为 "中 国知识 "、 "知识产权"、 "中国知识产权"、 "产权国际"、 "国际战略"、 当所有的组合词都统计完之后按照频次高低进行排序, 将出现频次大于 设定阈值的组合词条作为复合词输出到索引建立单元 304。 这样用概率 统计出的复合词非常接近实际, 而且不需要人工干预, 能够达到很好的 效果。 当然也可以利用其它方式来确定复合词, 如将日常生活中常用的 组合词作为复合词等, 本发明对此不做限定。 The word frequency statistics unit 303 is responsible for performing word frequency statistics and laying the foundation for establishing a compound word index. As the name implies, a compound word is a combination of two or more words (ie, a term) (that is, a combination of more than two), which is a meaning or a certain relationship. For example, "eat apple" is a compound word. It is actually composed of the words "eat" and "apple". For example, "Bank of China" and "ceramic sand" are compound words composed of two terms. The word frequency of a term is the number of times the term appears in the document. For example, the number of times a word appears in a document is thirty, and the frequency of the term for the document is thirty. The word frequency statistics unit 303 first performs various combinations on the valid terms output by the word segmentation unit 302, such as combining the words "Chinese strategic choice of international intellectual property rights and domestic strategic arrangements" into "Chinese knowledge" and "intellectual property rights". "China's intellectual property rights", "property rights international", "international strategy", when all the combined words are counted, sorted according to the frequency, and the combined words whose frequency is greater than the set threshold are output as compound words to the index building unit. 304. Probability of use The compound words that are counted are very close to reality, and do not require manual intervention, which can achieve good results. Of course, other methods can be used to determine a compound word, such as a compound word commonly used in daily life as a compound word, etc., which is not limited by the present invention.
索引建立单元 304负责对分词单元 302输出的所有有效词条、 以及 词频统计单元 303输出的复合词建立索引, 并将建立好的索引保存到索 引数据库 400中。 索引建立单元 304也可以对为每一个未能参与组成复 合词的有效词条、 以及词频统计单元 303输出的复合词建立索引。 索引 建立单元 304还将词频统计单元 303 输出的复合词发送到索引数据库 400中, 索引数据库 400将接收到的所有复合词保存在复合词表中 (图 1中未示出)。  The index establishing unit 304 is responsible for indexing all the valid terms output by the word segmentation unit 302 and the compound words output by the word frequency statistics unit 303, and saving the established index to the index database 400. The index establishing unit 304 may also index the compound words that are not valid for each of the constituent composite words and the compound words output by the word frequency statistics unit 303. The index establishing unit 304 also sends the compound words output by the word frequency statistics unit 303 to the index database 400, which stores all the compound words received in the compound word list (not shown in Fig. 1).
搜索模块 500负责在用户输入检索词搜索请求后, 拆分检索词, 从 索引数据库 400中找到符合该检索词的所有相关网页, 进行计算、 排序 后返回给用户。 搜索模块 500包括检索词分词单元 501、 搜索单元 502 和结果处理单元 503。  The search module 500 is responsible for splitting the search words after the user inputs the search term search request, and finds all relevant web pages that match the search term from the index database 400, performs calculation, sorts, and returns to the user. The search module 500 includes a search term segmentation unit 501, a search unit 502, and a result processing unit 503.
检索词分词单元 501根据有效词条以及索引数据库 400中的复合词 表, 对检索词进行分词, 并将分词后得到的检索词条发送到搜索单元 502。 如检索词为 "中国人民银行" 时, 有效词条为 "中国"、 "人民" 及 "银行"。 若复合词表中存在 "中国人民", 但不存在 "中国银行" 或 "人民银行", 则将该检索词拆成为 "中国人民" 以及 "银行" 两个检 索词条。若复合词表中存在 "中国人民"、 "中国银行"以及 "人民银行", 则将检索词拆成为 "中国人民"、 "中国银行" 以及 "人民银行"; 若此 时复合词表中还存在 "中国人民银行", 则直接将 "中国人民银行" 作 为检索词条。 搜索单元 502在索引数据库 400中搜索检索词分词后得到 的检索词条, 提取满足条件的网页, 发送到结果处理单元 503。  The term segmentation word unit 501 classifies the search term based on the valid term and the compound word list in the index database 400, and sends the search term obtained after the segmentation to the search unit 502. If the search term is "People's Bank of China", the valid terms are "China", "People" and "Bank". If there is a "Chinese people" in the compound vocabulary, but there is no "Chinese bank" or "People's Bank", the search term will be split into two search terms: "Chinese people" and "bank". If there are "Chinese people", "Bank of China" and "People's Bank" in the compound vocabulary, the search terms will be broken into "Chinese people", "Bank of China" and "People's Bank"; if there is still a compound word list at this time" The People's Bank of China" directly uses the "People's Bank of China" as a search term. The search unit 502 searches the index database 400 for the search term obtained after the search term segmentation, extracts the web page that satisfies the condition, and sends it to the result processing unit 503.
结果处理单元 503将收到的网页进行求交集和求并集运算得到一个 结果页面集合, 然后计算网页和检索词条的相关度, 根据相关度的数值 返回前 K篇页面(K为自然数, 将 K篇页面的链接放在一个页面中)给 用户。 如果用户希望查看第二页, 则把排序结果中在第 K+1到 2*K的 网页的链接放在第二个页面中返回给用户。 本发明的其它实施例中, 也 可一次性将全部搜索得到的页面结果返回给用户。 本发明的其它实施例 中, 用户输入的检索词中的复合词所对应的结果页面排在最前面。 The result processing unit 503 performs the intersection of the received web pages and the union operation to obtain a The result page collection, and then the relevance of the web page and the search term is calculated, and the first K page (K is a natural number, and the link of the K page is placed in one page) is returned to the user according to the value of the relevance. If the user wishes to view the second page, the link of the web page at the K+1th to 2*K in the sort result is placed in the second page and returned to the user. In other embodiments of the present invention, all searched page results may also be returned to the user at one time. In other embodiments of the present invention, the result pages corresponding to the compound words in the search terms input by the user are ranked first.
为了理解本发明搜索引擎的搜索系统 10, 还需要介绍的是, 与建立 索引同时进行的还有 "链接信息提取处理", 即把网页链接信息 (包括 锚文本、 链接本身等信息)保存在一个链接数据库(图 1中未示出)中, 为网页评级模块(图 1中未示出 ) 的网页评级提供依据。 在用户进行搜 索时, 搜索模块 500要在索引数据库 400中进行相关网页的查找, 同时 网页评级模块把查询请求和链接信息结合起来对搜索结果进行相关度 的评价, 搜索模块 500再按照相关度进行排序, 并提取检索词条的内容 摘要, 组织最后的页面返回给用户。  In order to understand the search system 10 of the search engine of the present invention, it is also necessary to introduce that "link information extraction processing" is performed simultaneously with indexing, that is, the webpage link information (including the anchor text, the link itself, and the like) is saved in one The link database (not shown in Figure 1) provides a basis for the web page rating of the web page rating module (not shown in Figure 1). When the user performs a search, the search module 500 searches for the relevant webpage in the index database 400, and the webpage rating module combines the query request and the link information to evaluate the relevance of the search result, and the search module 500 performs the correlation degree according to the correlation degree. Sort, and extract the content summary of the search terms, and organize the last page to return to the user.
例如用户输入 "中国人民银行" 进行搜索, 系统就可以将检索词拆 分为 "中国人民" 和 "银行", 进行两次索引查询, 再进行一次求交集 运算, 一次求并集运算即可向用户返回搜索结果, 相对现有技术来说, 减少了求交集和求并集的运算次数, 提高了搜索速度。  For example, if the user inputs "People's Bank of China" to search, the system can split the search terms into "Chinese people" and "banks", perform two index queries, and then perform an intersection calculation, and then perform a union operation. The user returns the search result, which reduces the number of operations for finding the intersection and the union, and improves the search speed.
图 2是本发明实施例中搜索方法在建立索引数据库时的流程图。 如 图 2所示, 本发明所述的搜索方法在建立或更新索引数据库 400时包括 以下步骤:  2 is a flow chart of a search method in establishing an index database in an embodiment of the present invention. As shown in FIG. 2, the search method of the present invention includes the following steps when establishing or updating the index database 400:
步骤 S11 , 读取一个网页, 将正文转换为标准数据格式, 过滤掉脚 本标示符、 广告信息等无关信息;  Step S11, reading a webpage, converting the text into a standard data format, and filtering out irrelevant information such as a script identifier and advertisement information;
步骤 S12, 去停用词、 虚词后进行分词;  Step S12, performing a word segmentation after stopping the word or the function word;
步骤 S13, 对分词得到的有效词条的各种组合进行词频统计; 步骤 S14, 输出频次大于设定阈值的组合词条作为复合词; 步骤 S15 , 对频次大于设定阈值的复合词以及分词得到的所有有效 词条建立索引并保存。 Step S13, performing word frequency statistics on various combinations of valid terms obtained by the word segmentation; Step S14, the combined term whose output frequency is greater than the set threshold is used as a compound word; Step S15, indexing and saving all the valid words obtained by the compound words whose frequency is greater than the set threshold and the word segmentation are saved.
在本发明的实施例中, 当对复合词建立索引后, 还可以定期对建立 的索引进行更新, 如新增一个复合词, 并为新增加的复合词建立索引; 对已存在的复合词索引中的网页信息进行更新; 或者删除一复合词及其 为其建立的索引等等。 其中可以在某一有效词条的组合在网页中出现的 次数从小于设定阀值变为大于设定阀值时, 增加一个复合词; 可以在某 除该复合词。  In the embodiment of the present invention, after the compound word is indexed, the established index may be updated periodically, such as adding a compound word, and indexing the newly added compound word; and searching for the webpage information in the existing compound word index. Update; or delete a compound word and its index and so on. Wherein a compound word may be added when the number of occurrences of a combination of valid terms in the web page changes from less than the set threshold to greater than the set threshold;
本发明实施例还提供了一种搜索的方法, 在收到用户的检索词之后 执行以下步骤:  The embodiment of the invention further provides a search method, after receiving the user's search term, the following steps are performed:
根据有效词条以及复合词表对检索词进行分词, 得到至少一个检索 词条, 这里在对检索词进行分词时, 优先将检索词拆分为复合词, 对于 检索词中未参与构成复合词的有效词条, 则直接作为检索词条。 当检索 词可以被拆分为多个复合词时, 并且其中一个复合词包含另一复合词的 全部时, 则不将该另一复合词作为检索词条, 即作为检索词条的复合词 不会被其他复合词包含。 如当复合词表中存有检索词本身时, 直接将整 个检索词作为一个检索词条。  According to the valid term and the compound vocabulary, the search term is segmented to obtain at least one search term. Here, when the search term is segmented, the search term is preferentially split into compound words, and the valid terms that do not participate in the compound word are not included in the search term. , directly as a search term. When a search term can be split into multiple compound words, and one of the compound words contains all of the other compound words, the other compound word is not used as a search term, that is, the compound word as the search term is not included by other compound words. . For example, when the search term itself exists in the compound word list, the entire search term is directly used as a search term.
在索引数据库中分别对这至少一个检索词条进行索引查询, 得到至 少一个结果集合。  The at least one search term is indexed in the index database to obtain at least one result set.
将得到的至少一个结果集合返回给用户。  Returning at least one result set to the user.
在将结果集合返回给用户前, 可以对结果集合进行排序, 其中对结 果集合的排序可以将所有结果集合的交集排在最前面, 然后是所有结果 集合的并集中非交集的部分排在后面。 下面以检索词是 "中国人民银行" 为例说明以上流程, 图 3是本发 明实施例中搜索方法在收到检索请求后的流程图, 如图 3所示。 Before the result set is returned to the user, the result set can be sorted, wherein the sorting of the result set can rank the intersection of all the result sets first, and then the non-intersecting parts of all the result sets are listed later. The following process is illustrated by taking the search term "People's Bank of China" as an example. FIG. 3 is a flow chart of the search method after receiving the search request in the embodiment of the present invention, as shown in FIG.
步骤 S21 , 首先根据复合词表对检索词进行分词, 得到 "中国人民" 和 "银行"。  Step S21, first classifying the search words according to the compound vocabulary, and obtaining "Chinese people" and "banks".
步骤 S22, 在索引数据库中对 "中国人民" 进行索引查询, 得到结 果集合 R1 ; 对 "银行" 进行索引查询, 得到结果集合 R2。  Step S22: Perform an index query on the "Chinese people" in the index database to obtain a result set R1; perform an index query on the "bank" to obtain a result set R2.
步骤 S23 , 对集合 R1和 R2进行求交集运算, 得到集合 R3;  Step S23, performing intersection calculation on the sets R1 and R2 to obtain a set R3;
步骤 S24, 对集合 R1和 R2进行求并运算, 得到集合 R4;  Step S24, performing a summation operation on the sets R1 and R2 to obtain a set R4;
步骤 S25 ,对结果进行排序后返回给用户,将集合 R3中的网页排在 最前面, 其次是集合 R4中的不在集合 R3的网页。  In step S25, the results are sorted and returned to the user, and the webpages in the set R3 are ranked first, followed by the webpages in the set R4 that are not in the set R3.
本发明方法的其它实施例中, 可在搜索的时候同时对复合词进行拆 分搜索, 以达到结果全面完整地目的。  In other embodiments of the method of the present invention, compound words can be searched for at the same time at the time of searching to achieve a comprehensive and complete result.
以上所述仅为本发明的优选实施例而已, 并不用于限制本发明, 对 于本领域的技术人员来说, 本发明可以有各种更改和变化。 凡在本发明 的精神和原则之内, 所作的任何修改、 等同替换、 改进等, 均应包含在 本发明的权利要求范围之内。  The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. All modifications, equivalents, improvements, etc., made within the spirit and scope of the invention are intended to be included within the scope of the appended claims.

Claims

权利要求书 Claim
1、 一种建立索引的方法, 其特征在于, 包括:  A method for establishing an index, comprising:
从至少一个网页中获取至少两个有效词条;  Obtain at least two valid terms from at least one web page;
确认至少一个复合词, 每一复合词为所获取的有效词条中至少两个 有效词条的组合;  Confirming at least one compound word, each compound word being a combination of at least two valid terms in the valid terms obtained;
为每一复合词建立网页索引。  Create a web page index for each compound word.
2、 如权利要求 1所述的方法, 其特征在于, 进一步包括: 为网页中获得的每一个有效词条建立网页索引; 或  2. The method of claim 1, further comprising: creating a web page index for each valid entry obtained in the web page; or
为每一个未能参与组成复合词的有效词条建立网页索引。  Create a web index for each valid term that fails to participate in the compound word.
3、 如权利要求 1所述的方法, 其特征在于, 进一步包括: 更新为复合词建立的网页索引。  3. The method of claim 1, further comprising: updating a web page index established for the compound word.
4、 如权利要求 1所述的方法, 其特征在于, 进一步包括: 增加至少一个复合词, 并为所增加的每一个复合词建立网页索引; 和 /或,  4. The method of claim 1, further comprising: adding at least one compound word and establishing a web page index for each compound word added; and/or,
删除至少一个复合词以及为所删除的复合词建立的网页索引。 Delete at least one compound word and the index of the page created for the deleted compound word.
5、如权利要求 1所述的方法, 其特征在于, 所述确定至少一个复合 词包括: 5. The method of claim 1 wherein said determining at least one compound word comprises:
统计所述至少两个有效词条的各种组合在网页中出现的次数; 将出现次数大于设定阈值的有效词条的组合确定为复合词。  Counting the number of occurrences of the various combinations of the at least two valid terms in the web page; determining the combination of the valid terms whose number of occurrences is greater than the set threshold is determined as a compound word.
6、 如权利要求 5所述的方法, 其特征在于, 进一步包括: 为大于设定阀值时, 增加一个复合词, 并为所增加的复合词建立网页索 引, 所述增加的复合词为所述至少两个有效词条的组合; 和 /或, 值时, 删除所述复合词, 同时删除为所述复合词建立的网页索引。 6. The method according to claim 5, further comprising: adding a compound word when the threshold is greater than the set threshold, and establishing a webpage index for the added compound word, wherein the added compound word is the at least two a combination of valid terms; and/or, when the value is deleted, the compound word is deleted, and the index of the web page established for the compound word is deleted.
7、一种搜索的方法,其特征在于,为至少一个复合词建立网页索引, 所述复合词为从至少一个网页中获取的有效词条中至少两个有效词条 的组合; 该方法包括: A method for searching, wherein a webpage index is created for at least one compound word, and the compound word is a combination of at least two valid terms in a valid term obtained from at least one webpage; the method comprising:
拆分检索词为至少一个复合词;  Splitting the search term into at least one compound word;
检索得到为检索词拆分得到的每一复合词建立的网页索引。  The index of the web page established for each compound word obtained by splitting the search term is obtained.
8、 如权利要求 7所述的方法, 其特征在于, 在建立索引时, 进一步 包括:  The method according to claim 7, wherein when the index is established, the method further includes:
为网页中获得的每一个有效词条建立网页索引; 或  Create a web page index for each valid term obtained in the web page; or
为每一个未能参与组成复合词的有效词条建立网页索引。  Create a web index for each valid term that fails to participate in the compound word.
9、 如权利要求 8所述的方法, 其特征在于, 在拆分检索词时, 若检 索词中存在至少一个未能参与组成复合词的有效词条, 则进一步包括: 检索得到为检索词中每一个未能参与组成复合词的有效词条建立的 网页索引。  9. The method according to claim 8, wherein when the search term is split, if at least one valid term in the search term fails to participate in the compound word is included, the method further comprises: searching for each of the search terms A web index that fails to participate in the creation of valid terms that make up a compound word.
10、 如权利要求 7所述的方法, 其特征在于, 当所述检索词可以被 拆分为多于一个的复合词时, 所述拆分检索词为至少一个复合词包括: 拆分检索词为不被其他复合词包含的复合词。  10. The method according to claim 7, wherein when the search term can be split into more than one compound word, the split search term is at least one compound word including: split search word is not Compound words that are included in other compound words.
11、 一种用于建立索引的装置, 其特征在于, 包括:  11. An apparatus for establishing an index, comprising:
第一模块, 用于从至少一个网页中获取至少两个有效词条, 并确定 至少一个复合词, 每一复合词为所获取的有效词条中至少两个有效词条 的组合;  a first module, configured to obtain at least two valid terms from at least one webpage, and determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term;
第二模块, 用于为第一模块确定的每一复合词建立网页索引。  The second module is configured to establish a webpage index for each compound word determined by the first module.
12、如权利要求 11所述的装置, 其特征在于, 所述第二模块进一步 用于为网页中获得的每一个有效词条建立网页索引; 或用于为每一个未 能参与组成复合词的有效词条建立网页索引。  12. The apparatus according to claim 11, wherein the second module is further configured to establish a webpage index for each valid term obtained in the webpage; or for each of the failure to participate in forming a compound word effective The entry builds a web page index.
13、 如权利要求 11所述的装置, 其特征在于, 所述第一模块包括: 第一单元, 用于从至少一个网页中获取至少两个有效词条; 第二单元, 用于确定至少一个复合词, 每一复合词为所获取的有效 词条中至少两个有效词条的组合。 The apparatus according to claim 11, wherein the first module comprises: a first unit, configured to obtain at least two valid terms from at least one webpage; a second unit, configured to determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term.
14、 如权利要求 11所述的装置, 其特征在于, 所述第一模块包括: 第三单元, 用于从至少一个网页中获取至少两个有效词条; 第四单元, 用于统计获取的有效词条中至少两个有效的组合在网页 中出现的次数, 并将出现次数大于设定阀值的有效词条组合确认为复合 词。  The device according to claim 11, wherein the first module comprises: a third unit, configured to obtain at least two valid terms from at least one webpage; and a fourth unit, configured to obtain statistics The number of occurrences of at least two valid combinations in a valid term in a web page, and the combination of valid terms whose number of occurrences is greater than a set threshold is confirmed as a compound word.
15、 一种用于搜索的系统, 其特征在于, 包括:  15. A system for searching, comprising:
第一模块, 用于为至少一个复合词建立网页索引, 其中所述复合词 为从至少一个网页中获取的有效词条中至少两个有效词条的组合;  a first module, configured to establish a webpage index for the at least one compound word, wherein the compound word is a combination of at least two valid terms in the valid term obtained from the at least one webpage;
第二模块, 用于拆分检索词为至少一个复合词; 并根据第一模块建 立的网页索引, 检索得到为检索词拆分后得到的每一复合词建立的网页 索引。  The second module is configured to split the search term into at least one compound word; and according to the index of the webpage established by the first module, retrieve a webpage index established for each compound word obtained after the search term is split.
16、 如权利要求 15所述的系统, 其特征在于, 进一步包括: 第三模块, 用于存放第一模块确定的至少一个复合词以及为每一个 复合词建立的网页索引;  The system according to claim 15, further comprising: a third module, configured to store at least one compound word determined by the first module and a webpage index established for each compound word;
所述第二模块通过所述第三模块检索得到为检索词拆分后得到的每 一复合词建立的网页索引。  The second module retrieves, by the third module, a webpage index established for each compound word obtained after the search term is split.
17、 如权利要求 15所述的系统, 其特征在于, 所述第一模块包括: 第一单元, 用于从至少一个网页中获取至少两个有效词条, 并确定 至少一个复合词, 每一复合词为所获取的有效词条中至少两个有效词条 的组合;  The system according to claim 15, wherein the first module comprises: a first unit, configured to acquire at least two valid terms from at least one webpage, and determine at least one compound word, each compound word a combination of at least two valid terms in the valid terms obtained;
第二单元, 用于为第一单元确定的每一复合词建立网页索引。  The second unit is configured to establish a webpage index for each compound word determined by the first unit.
18、 如权利要求 17所述的系统, 其特征在于, 所述第一单元包括: 第一子单元, 用于从至少一个网页中获取至少两个有效词条; 第二子单元, 用于确定至少一个复合词, 每一复合词为所述第一子 单元所获取的有效词条中至少两个有效词条的组合。 18. The system of claim 17, wherein the first unit comprises: a first subunit, configured to obtain at least two valid terms from at least one webpage; a second subunit, configured to determine at least one compound word, each compound word being at least one of valid terms obtained by the first subunit A combination of two valid terms.
19、 如权利要求 15至 18任一项所述的系统, 其特征在于, 所述第 二模块包括:  The system according to any one of claims 15 to 18, wherein the second module comprises:
第三单元, 用于根据第一模块确定的复合词, 拆分检索词为至少一 个复合词;  a third unit, configured to, according to the compound word determined by the first module, split the search term into at least one compound word;
第四单元, 用于接收来自第三单元的每一个复合词, 并根据第一模 块建立的网页索引, 检索得到为所述每一个复合词建立的网页索引。  And a fourth unit, configured to receive each compound word from the third unit, and retrieve an index of the webpage established for each compound word according to the index of the webpage established by the first module.
20、 如权利要求 19所述的系统, 其特征在于, 所述第二模块进一步 包括:  The system of claim 19, wherein the second module further comprises:
第五单元, 用于将第四单元检索出的网页索引中的网页链接返回给 用户。  The fifth unit is configured to return a webpage link in the webpage index retrieved by the fourth unit to the user.
21、 如权利要求 15所述的系统, 其特征在于, 所述第一模块进一步 用于为网页中获得的每一个有效词条建立网页索引; 或用于为每一个未 能参与组成复合词的有效词条建立网页索引;  21. The system according to claim 15, wherein the first module is further configured to establish a webpage index for each valid term obtained in the webpage; or for each of the failure to participate in forming a compound word effective The entry establishes a web page index;
所述第二模块进一步用于当检索词中存在至少一个未能参与构成复 合词的有效词条时, 检索得到为检索词中每一个未能参与组成复合词的 有效词条建立的网页索引。  The second module is further configured to: when there is at least one valid term in the search term that fails to participate in the composition of the composite word, retrieve the webpage index established for each valid term in the search term that fails to participate in the compound word.
PCT/CN2008/070253 2007-02-06 2008-02-02 Method and device for creating index as well as method and system for retrieving WO2008098502A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNB2007100732207A CN100498790C (en) 2007-02-06 2007-02-06 Retrieving method and system
CN200710073220.7 2007-02-06

Publications (1)

Publication Number Publication Date
WO2008098502A1 true WO2008098502A1 (en) 2008-08-21

Family

ID=38906535

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/070253 WO2008098502A1 (en) 2007-02-06 2008-02-02 Method and device for creating index as well as method and system for retrieving

Country Status (2)

Country Link
CN (1) CN100498790C (en)
WO (1) WO2008098502A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377190A (en) * 2012-04-11 2013-10-30 阿里巴巴集团控股有限公司 Trading platform based supplier information searching method and device
CN111368022A (en) * 2020-02-28 2020-07-03 山东汇贸电子口岸有限公司 Method and tool for realizing book screening by using reverse index

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100498790C (en) * 2007-02-06 2009-06-10 腾讯科技(深圳)有限公司 Retrieving method and system
US8468142B2 (en) * 2008-08-06 2013-06-18 Fujitsu Limited Caching query results with binary decision diagrams (BDDs)
CN101789872B (en) * 2009-01-22 2012-12-26 蔡亮华 Information processing method based on internet
CN101876981B (en) * 2009-04-29 2015-09-23 阿里巴巴集团控股有限公司 A kind of method and device building knowledge base
CN102200984A (en) * 2010-03-24 2011-09-28 深圳市腾讯计算机系统有限公司 Search method based on compound words and search engine server
CN102314464B (en) * 2010-07-07 2013-06-05 北京亮点时间科技有限公司 Lyrics searching method and lyrics searching engine
CN101894160B (en) * 2010-07-21 2012-02-08 同方知网(北京)技术有限公司 Intelligent search method
CN101986309A (en) * 2010-11-16 2011-03-16 无敌科技(西安)有限公司 Method and device for inquiring question bank
CN102591865A (en) * 2011-01-06 2012-07-18 北京瑞信在线系统技术有限公司 Method and device for building music vocabulary file index and method and system of music vocabulary searching
CN103246664B (en) * 2012-02-07 2016-05-25 阿里巴巴集团控股有限公司 Web search method and apparatus
KR102072113B1 (en) 2012-10-17 2020-02-03 삼성전자주식회사 User terminal device and control method thereof
CN103309991A (en) * 2013-06-19 2013-09-18 南京邮电大学 Query method based on dangerous goods warehouse management program of university
CN103559313B (en) * 2013-11-20 2018-02-23 北京奇虎科技有限公司 Searching method and device
CN103955449B (en) * 2014-04-21 2018-03-06 安一恒通(北京)科技有限公司 The method and apparatus for positioning target sample
CN105653546B (en) * 2014-11-11 2019-10-25 北大方正集团有限公司 A kind of search method and system of target topic
CN104462553B (en) * 2014-12-25 2019-02-26 北京奇虎科技有限公司 Question and answer page relevant issues recommended method and device
CN105069063A (en) * 2015-07-29 2015-11-18 百度在线网络技术(北京)有限公司 Picture searching method and apparatus
CN108427688B (en) * 2017-02-15 2022-01-28 北京国双科技有限公司 Data query method and device
CN107463655A (en) * 2017-07-27 2017-12-12 无锡雅座在线科技股份有限公司 Inquire about the methods, devices and systems of data
CN107659468A (en) * 2017-10-10 2018-02-02 深圳市吉祥腾达科技有限公司 A kind of method of testing of Router Security reliability
CN108416264A (en) * 2018-01-29 2018-08-17 山东汇贸电子口岸有限公司 A kind of searching method and search module of supporting OCR to input
CN108920697A (en) * 2018-07-16 2018-11-30 佛山市影腾科技有限公司 A kind of recommended method of application program, device and terminal
CN110287881A (en) * 2019-06-26 2019-09-27 上海交通大学 Books identifying system, books recognition methods, electronic device and storage medium
CN113806477A (en) * 2021-08-26 2021-12-17 广东广信通信服务有限公司 Automatic text labeling method, device, terminal and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677402A (en) * 2004-03-31 2005-10-05 奥西-技术有限公司 Apparatus and computerized method for determining constituent words of a compound word
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking
CN1770159A (en) * 2005-10-28 2006-05-10 北大方正集团有限公司 Method for automatically finding network content quotation
CN101079056A (en) * 2007-02-06 2007-11-28 腾讯科技(深圳)有限公司 Retrieving method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1677402A (en) * 2004-03-31 2005-10-05 奥西-技术有限公司 Apparatus and computerized method for determining constituent words of a compound word
US20050251384A1 (en) * 2004-05-05 2005-11-10 Microsoft Corporation Word extraction method and system for use in word-breaking
CN1770159A (en) * 2005-10-28 2006-05-10 北大方正集团有限公司 Method for automatically finding network content quotation
CN101079056A (en) * 2007-02-06 2007-11-28 腾讯科技(深圳)有限公司 Retrieving method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377190A (en) * 2012-04-11 2013-10-30 阿里巴巴集团控股有限公司 Trading platform based supplier information searching method and device
CN111368022A (en) * 2020-02-28 2020-07-03 山东汇贸电子口岸有限公司 Method and tool for realizing book screening by using reverse index

Also Published As

Publication number Publication date
CN101079056A (en) 2007-11-28
CN100498790C (en) 2009-06-10

Similar Documents

Publication Publication Date Title
WO2008098502A1 (en) Method and device for creating index as well as method and system for retrieving
CN103365924B (en) A kind of method of internet information search, device and terminal
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
Chen et al. Web mining for web image retrieval
CN101169780A (en) Semantic ontology retrieval system and method
US8527487B2 (en) Method and system for automatic construction of information organization structure for related information browsing
JP5084858B2 (en) Summary creation device, summary creation method and program
WO2008097856A2 (en) Search result delivery engine
WO2009000174A1 (en) Method and device of web page rank
CN103955529A (en) Internet information searching and aggregating presentation method
WO2008014702A1 (en) Method and system of extracting new words
CN1389811A (en) Intelligent search method of search engine
CN101261629A (en) Specific information searching method based on automatic classification technology
KR20160124079A (en) Systems and methods for in-memory database search
CN105183884A (en) Search engine system and method based on big data technique
CN101000608A (en) Key word dynamic matching generating based on search engine technology
Vavliakis et al. Event Detection via LDA for the MediaEval2012 SED Task.
CN103425742A (en) Method and device for searching website
JP4769822B2 (en) Information search service providing server, method and system using page group
CN103226601A (en) Method and device for image search
CN109948015B (en) Meta search list result extraction method and system
Oyama et al. Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2).
CN114238735A (en) Intelligent internet data acquisition method
JP2006164086A (en) Online knowledge search support system and online knowledge search support method
TWI483129B (en) Retrieval method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08706628

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 3974/CHENP/2009

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: LOSS OF RIGHTS COMMUNICATION (EPO F1205A OF 09.12.09)

122 Ep: pct application non-entry in european phase

Ref document number: 08706628

Country of ref document: EP

Kind code of ref document: A1