WO2008098502A1

WO2008098502A1 - Method and device for creating index as well as method and system for retrieving

Info

Publication number: WO2008098502A1
Application number: PCT/CN2008/070253
Authority: WO
Inventors: Haisong Yang; Zhiyuan Liu; Yunfeng Liu
Original assignee: Tencent Technology (Shenzhen) Company Limited
Priority date: 2007-02-06
Filing date: 2008-02-02
Publication date: 2008-08-21
Also published as: CN101079056A; CN100498790C

Abstract

A method for creating index is disclosed, which includes: acquiring at least two valid words from at least one web page; determining at least one compound word, wherein each of the compound words is a combination of the at least two valid words of the valid words acquired; creating a index for each of the compound words. The invention further discloses a device for creating index. Furthermore, a method and system for retrieving is also disclosed.

Description

Method and device for indexing, search method and system

TECHNICAL FIELD The present invention relates to computer technology, and in particular, to a method and apparatus for establishing an index, a method and system for searching. Background of the invention

With the rapid development of the Internet, various kinds of information have exploded, and users need to find information in the information ocean, just like a needle in a haystack. Every online user faces the problem of information overload and cannot find the information that is needed accurately. Search engines are the technology that emerged to solve this "history" problem. The navigation service provided by search engines has become a very important network service on the Internet, becoming the most important Internet application alongside email. Search Engines Provide users with a "retrieve" service that uses spider programs to classify all information on the Internet to help users search the vast amount of Internet information they need. The principle of search engine mainly includes three steps: 1) crawling web pages from the Internet, 2) building an index database, and 3) searching for sorts in the index database.

Currently, when building an index database, only a single word in a web page, that is, a unary word, is generally indexed, so that when the search engine processes the user request, the user's search term needs to be split.

(participle), then index the searched words separately, and get the search results of each word. For example, when a user searches for "Beijing Gymnasium", the search engine's actions are as follows: 1 Split the user's search request "Beijing Gymnasium" into two words: "Beijing" and "Gallery";

"Beijing" conducts an index query and obtains a result set A; 3 performs an index query on the "sports hall" to obtain a result set B; 4 performs an intersection operation of A and B to obtain an intersection X of AB; 5 performs a joint operation on AB, Get the union of AB and Y; 6 Output the search results to the user. The sort order of the search results is: The pages in the set X are ranked first, followed by the elements in the X that are not in X. If the search term is "People's Bank of China", the first word is split. "China", "People", "Bank", and then three index queries, if you use two or two to cross, two or two summation rules, you must perform three intersections and three unions to get the final search results. The disadvantages of indexing only a single word in a webpage are as follows: the split granularity of the search term is small, the number of index queries and the number of set operations of the search engine are large, the system query efficiency is low, and the search speed is low.

There are also some search engines that index binary words in web pages when indexing databases, but because of the insignificant combination of indexes in the process, space is wasted. For example, the current binary index is to directly index each binary combination regardless of the logical relationship of words. For example, "I see you there", the binary word has "I see".

"See you", "You are", "Where", etc., so there are many meaningless combinations, such as "you are", resulting in users not getting a good search experience. Moreover, this will lead to a sharp expansion of the space, resulting in insufficient indexing. Summary of the invention

An embodiment of the present invention provides a method for establishing an index, including:

Obtain at least two valid terms from at least one web page;

Confirming at least one compound word, each compound word being a combination of at least two valid terms in the valid terms obtained;

Create a web page index for each compound word.

The embodiment of the present invention further provides a search method, which is to establish a web page index for at least one compound word, where the compound word is a combination of at least two valid terms in a valid term obtained from at least one web page; the method includes:

Splitting the search term into at least one compound word;

The index of the web page established for each compound word obtained after the search term is split is obtained.

An embodiment of the present invention further provides an apparatus for establishing an index, including: a first module, configured to obtain at least two valid terms from at least one webpage, and determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term;

The second module is configured to establish a webpage index for each compound word determined by the first module.

An embodiment of the present invention further provides a system for searching, including:

a first module, configured to establish a webpage index for the at least one compound word, wherein the compound word is a combination of at least two valid terms in the valid term obtained from the at least one webpage;

The second module is configured to split the search term into at least one compound word; and retrieve the index of the web page established for each compound word obtained after the search word is split. Strip, separate indexing of these combined terms, reduce the granularity of the search terms in the search, thereby reducing the number of search engine index queries and the number of intersections and unions, greatly improving the search engine search Speed, achieve the purpose of responding quickly to users, and improve user experience. At the same time, because of the selective indexing of multiple terms through probability statistics, the utilization of the index database and the retrieval accuracy of the system are improved. BRIEF DESCRIPTION OF THE DRAWINGS

1 is a system configuration diagram of a search system in an embodiment of the present invention.

2 is a flow chart of a search method in establishing an index database in an embodiment of the present invention. 3 is a flow chart of the search method after receiving a retrieval request in an embodiment of the present invention. Mode for carrying out the invention

The invention will now be further elucidated with reference to the drawings and specific embodiments.

1 is a system configuration diagram of a search system in an embodiment of the present invention. As shown in FIG. 1, the search system 10 includes a webpage crawling module 100, a webpage database 200, and an indexing module that are sequentially connected. 300. Index database 400 and search module 500.

The webpage crawling module 100 is responsible for automatically extracting information from the Internet, and storing the extracted information in the webpage database 200. The general practice is: The webpage crawling module 100 automatically accesses the Internet through a web spider program capable of automatically collecting webpages from the Internet, and jumps to other webpages along each URL (Uniform Resource Locator) in the current webpage, repeating This process collects all of the web pages traversed into the web page database 200. There are two types of automatic information collection functions of search engines. One is regular search, that is, every time (for example, 28 days), the webpage crawling module 100 actively controls the "spider" program to perform Internet sites in a certain IP address range. Search, once a new website is discovered, the "Spider" program will automatically extract the information and URL of the website into the web database 200; the other is to submit the website search, that is, the website owner actively submits the URL to the search engine, and the search engine's web page is crawled. The module 100 periodically controls the "spider" program to scan the websites corresponding to the URLs and store the relevant information in the web database 200 for a certain period of time (for example, ranging from 2 days to several months).

The webpage database 200 is responsible for storing all the webpages obtained by the webpage crawling module 100 for the user to search for.

The index module 300 is responsible for analyzing the webpages stored in the webpage database 200, and extracting relevant webpage information (including the URL of the webpage, the type of the encoding, the keywords included in the page content, the location of the keyword, the generation time, the size, and the link relationship with other webpages). Etc.), performing a large number of complex calculations according to a certain correlation algorithm, obtaining the page content of each web page and the relevance (or importance) of the hyperlink for each term, and then using these related information to establish a term index, and The established term index is stored in the index database 400. In this embodiment, the indexing module 300 includes a document pre-processing unit 301, a word segmentation unit 302, a word frequency statistics unit 303, and an index establishment unit 304.

The document pre-processing unit 301 is responsible for reading a webpage from the webpage database 200, and converting different data formats in the input webpage into a standard data format, such as an HTML page, an electric The sub-mail or PDF file is converted into a text file, and some script identifiers and some useless advertisement information need to be filtered out, and then output to the word segmentation unit 302.

The word segmentation unit 302 is responsible for performing word segmentation processing on the webpage content after the conversion format. In order to improve system efficiency, it is necessary to remove the stop words and function words before the word segmentation (of course, you can stop the words and function words after the word segmentation), leaving only valid terms. In this embodiment, the word segmentation unit 302 is responsible for dividing the body and title of the converted web page into words according to the dictionary. For example, "I saw you there" to stop the word after the word segmentation, divided into "I", "see", "you", "in" "that" five valid terms. The existing word segmentation algorithms can be divided into three categories: segmentation methods based on string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics. In this embodiment, a word segmentation based word segmentation method is employed. This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with the term in a "sufficiently large" machine dictionary according to a certain strategy. If a string is found in the dictionary, the matching is successful ( Identify a word).

The word frequency statistics unit 303 is responsible for performing word frequency statistics and laying the foundation for establishing a compound word index. As the name implies, a compound word is a combination of two or more words (ie, a term) (that is, a combination of more than two), which is a meaning or a certain relationship. For example, "eat apple" is a compound word. It is actually composed of the words "eat" and "apple". For example, "Bank of China" and "ceramic sand" are compound words composed of two terms. The word frequency of a term is the number of times the term appears in the document. For example, the number of times a word appears in a document is thirty, and the frequency of the term for the document is thirty. The word frequency statistics unit 303 first performs various combinations on the valid terms output by the word segmentation unit 302, such as combining the words "Chinese strategic choice of international intellectual property rights and domestic strategic arrangements" into "Chinese knowledge" and "intellectual property rights". "China's intellectual property rights", "property rights international", "international strategy", when all the combined words are counted, sorted according to the frequency, and the combined words whose frequency is greater than the set threshold are output as compound words to the index building unit. 304. Probability of use The compound words that are counted are very close to reality, and do not require manual intervention, which can achieve good results. Of course, other methods can be used to determine a compound word, such as a compound word commonly used in daily life as a compound word, etc., which is not limited by the present invention.

The index establishing unit 304 is responsible for indexing all the valid terms output by the word segmentation unit 302 and the compound words output by the word frequency statistics unit 303, and saving the established index to the index database 400. The index establishing unit 304 may also index the compound words that are not valid for each of the constituent composite words and the compound words output by the word frequency statistics unit 303. The index establishing unit 304 also sends the compound words output by the word frequency statistics unit 303 to the index database 400, which stores all the compound words received in the compound word list (not shown in Fig. 1).

The search module 500 is responsible for splitting the search words after the user inputs the search term search request, and finds all relevant web pages that match the search term from the index database 400, performs calculation, sorts, and returns to the user. The search module 500 includes a search term segmentation unit 501, a search unit 502, and a result processing unit 503.

The term segmentation word unit 501 classifies the search term based on the valid term and the compound word list in the index database 400, and sends the search term obtained after the segmentation to the search unit 502. If the search term is "People's Bank of China", the valid terms are "China", "People" and "Bank". If there is a "Chinese people" in the compound vocabulary, but there is no "Chinese bank" or "People's Bank", the search term will be split into two search terms: "Chinese people" and "bank". If there are "Chinese people", "Bank of China" and "People's Bank" in the compound vocabulary, the search terms will be broken into "Chinese people", "Bank of China" and "People's Bank"; if there is still a compound word list at this time" The People's Bank of China" directly uses the "People's Bank of China" as a search term. The search unit 502 searches the index database 400 for the search term obtained after the search term segmentation, extracts the web page that satisfies the condition, and sends it to the result processing unit 503.

The result processing unit 503 performs the intersection of the received web pages and the union operation to obtain a The result page collection, and then the relevance of the web page and the search term is calculated, and the first K page (K is a natural number, and the link of the K page is placed in one page) is returned to the user according to the value of the relevance. If the user wishes to view the second page, the link of the web page at the K+1th to 2*K in the sort result is placed in the second page and returned to the user. In other embodiments of the present invention, all searched page results may also be returned to the user at one time. In other embodiments of the present invention, the result pages corresponding to the compound words in the search terms input by the user are ranked first.

In order to understand the search system 10 of the search engine of the present invention, it is also necessary to introduce that "link information extraction processing" is performed simultaneously with indexing, that is, the webpage link information (including the anchor text, the link itself, and the like) is saved in one The link database (not shown in Figure 1) provides a basis for the web page rating of the web page rating module (not shown in Figure 1). When the user performs a search, the search module 500 searches for the relevant webpage in the index database 400, and the webpage rating module combines the query request and the link information to evaluate the relevance of the search result, and the search module 500 performs the correlation degree according to the correlation degree. Sort, and extract the content summary of the search terms, and organize the last page to return to the user.

For example, if the user inputs "People's Bank of China" to search, the system can split the search terms into "Chinese people" and "banks", perform two index queries, and then perform an intersection calculation, and then perform a union operation. The user returns the search result, which reduces the number of operations for finding the intersection and the union, and improves the search speed.

2 is a flow chart of a search method in establishing an index database in an embodiment of the present invention. As shown in FIG. 2, the search method of the present invention includes the following steps when establishing or updating the index database 400:

Step S11, reading a webpage, converting the text into a standard data format, and filtering out irrelevant information such as a script identifier and advertisement information;

Step S12, performing a word segmentation after stopping the word or the function word;

Step S13, performing word frequency statistics on various combinations of valid terms obtained by the word segmentation; Step S14, the combined term whose output frequency is greater than the set threshold is used as a compound word; Step S15, indexing and saving all the valid words obtained by the compound words whose frequency is greater than the set threshold and the word segmentation are saved.

In the embodiment of the present invention, after the compound word is indexed, the established index may be updated periodically, such as adding a compound word, and indexing the newly added compound word; and searching for the webpage information in the existing compound word index. Update; or delete a compound word and its index and so on. Wherein a compound word may be added when the number of occurrences of a combination of valid terms in the web page changes from less than the set threshold to greater than the set threshold;

The embodiment of the invention further provides a search method, after receiving the user's search term, the following steps are performed:

According to the valid term and the compound vocabulary, the search term is segmented to obtain at least one search term. Here, when the search term is segmented, the search term is preferentially split into compound words, and the valid terms that do not participate in the compound word are not included in the search term. , directly as a search term. When a search term can be split into multiple compound words, and one of the compound words contains all of the other compound words, the other compound word is not used as a search term, that is, the compound word as the search term is not included by other compound words. . For example, when the search term itself exists in the compound word list, the entire search term is directly used as a search term.

The at least one search term is indexed in the index database to obtain at least one result set.

Returning at least one result set to the user.

Before the result set is returned to the user, the result set can be sorted, wherein the sorting of the result set can rank the intersection of all the result sets first, and then the non-intersecting parts of all the result sets are listed later. The following process is illustrated by taking the search term "People's Bank of China" as an example. FIG. 3 is a flow chart of the search method after receiving the search request in the embodiment of the present invention, as shown in FIG.

Step S21, first classifying the search words according to the compound vocabulary, and obtaining "Chinese people" and "banks".

Step S22: Perform an index query on the "Chinese people" in the index database to obtain a result set R1; perform an index query on the "bank" to obtain a result set R2.

Step S23, performing intersection calculation on the sets R1 and R2 to obtain a set R3;

Step S24, performing a summation operation on the sets R1 and R2 to obtain a set R4;

In step S25, the results are sorted and returned to the user, and the webpages in the set R3 are ranked first, followed by the webpages in the set R4 that are not in the set R3.

In other embodiments of the method of the present invention, compound words can be searched for at the same time at the time of searching to achieve a comprehensive and complete result.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. All modifications, equivalents, improvements, etc., made within the spirit and scope of the invention are intended to be included within the scope of the appended claims.

Claims

Claim

A method for establishing an index, comprising:

Obtain at least two valid terms from at least one web page;

Create a web page index for each compound word.

2. The method of claim 1, further comprising: creating a web page index for each valid entry obtained in the web page; or

Create a web index for each valid term that fails to participate in the compound word.

3. The method of claim 1, further comprising: updating a web page index established for the compound word.

4. The method of claim 1, further comprising: adding at least one compound word and establishing a web page index for each compound word added; and/or,

Delete at least one compound word and the index of the page created for the deleted compound word.

5. The method of claim 1 wherein said determining at least one compound word comprises:

Counting the number of occurrences of the various combinations of the at least two valid terms in the web page; determining the combination of the valid terms whose number of occurrences is greater than the set threshold is determined as a compound word.

6. The method according to claim 5, further comprising: adding a compound word when the threshold is greater than the set threshold, and establishing a webpage index for the added compound word, wherein the added compound word is the at least two a combination of valid terms; and/or, when the value is deleted, the compound word is deleted, and the index of the web page established for the compound word is deleted.

A method for searching, wherein a webpage index is created for at least one compound word, and the compound word is a combination of at least two valid terms in a valid term obtained from at least one webpage; the method comprising:

Splitting the search term into at least one compound word;

The index of the web page established for each compound word obtained by splitting the search term is obtained.

The method according to claim 7, wherein when the index is established, the method further includes:

Create a web page index for each valid term obtained in the web page; or

9. The method according to claim 8, wherein when the search term is split, if at least one valid term in the search term fails to participate in the compound word is included, the method further comprises: searching for each of the search terms A web index that fails to participate in the creation of valid terms that make up a compound word.

10. The method according to claim 7, wherein when the search term can be split into more than one compound word, the split search term is at least one compound word including: split search word is not Compound words that are included in other compound words.

11. An apparatus for establishing an index, comprising:

a first module, configured to obtain at least two valid terms from at least one webpage, and determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term;

12. The apparatus according to claim 11, wherein the second module is further configured to establish a webpage index for each valid term obtained in the webpage; or for each of the failure to participate in forming a compound word effective The entry builds a web page index.

The apparatus according to claim 11, wherein the first module comprises: a first unit, configured to obtain at least two valid terms from at least one webpage; a second unit, configured to determine at least one compound word, each compound word being a combination of at least two valid terms in the obtained valid term.

The device according to claim 11, wherein the first module comprises: a third unit, configured to obtain at least two valid terms from at least one webpage; and a fourth unit, configured to obtain statistics The number of occurrences of at least two valid combinations in a valid term in a web page, and the combination of valid terms whose number of occurrences is greater than a set threshold is confirmed as a compound word.

15. A system for searching, comprising:

The second module is configured to split the search term into at least one compound word; and according to the index of the webpage established by the first module, retrieve a webpage index established for each compound word obtained after the search term is split.

The system according to claim 15, further comprising: a third module, configured to store at least one compound word determined by the first module and a webpage index established for each compound word;

The second module retrieves, by the third module, a webpage index established for each compound word obtained after the search term is split.

The system according to claim 15, wherein the first module comprises: a first unit, configured to acquire at least two valid terms from at least one webpage, and determine at least one compound word, each compound word a combination of at least two valid terms in the valid terms obtained;

The second unit is configured to establish a webpage index for each compound word determined by the first unit.

18. The system of claim 17, wherein the first unit comprises: a first subunit, configured to obtain at least two valid terms from at least one webpage; a second subunit, configured to determine at least one compound word, each compound word being at least one of valid terms obtained by the first subunit A combination of two valid terms.

The system according to any one of claims 15 to 18, wherein the second module comprises:

a third unit, configured to, according to the compound word determined by the first module, split the search term into at least one compound word;

And a fourth unit, configured to receive each compound word from the third unit, and retrieve an index of the webpage established for each compound word according to the index of the webpage established by the first module.

The system of claim 19, wherein the second module further comprises:

The fifth unit is configured to return a webpage link in the webpage index retrieved by the fourth unit to the user.

21. The system according to claim 15, wherein the first module is further configured to establish a webpage index for each valid term obtained in the webpage; or for each of the failure to participate in forming a compound word effective The entry establishes a web page index;

The second module is further configured to: when there is at least one valid term in the search term that fails to participate in the composition of the composite word, retrieve the webpage index established for each valid term in the search term that fails to participate in the compound word.