WO2007085187A1

WO2007085187A1 - Method of data retrieval, method of generating index files and search engine

Info

Publication number: WO2007085187A1
Application number: PCT/CN2007/000244
Authority: WO
Inventors: Pengxi Zhu
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2006-01-25
Filing date: 2007-01-23
Publication date: 2007-08-02
Also published as: CN1858737B; CN1858737A

Abstract

A data retrieval method, a method of generating index files and a search engine are disclosed. In the present invention, a first index file is generated with respect to all information to be retrieved; a second index file is generated with respect to each information individually; and the corresponding information is obtained with respect to the inputted keywords. When processing an advanced retrieval, the method also includes: separating the query condition which is inputted; retrieving a first sub-condition in the first index file to obtain an information identification which satisfies the first sub-condition; and retrieving according to a second sub-condition in the second index file which is corresponding to the information satisfying the first sub-condition. There are plurality of second index files in present invention, so it is easier to query cooperatively by multi-task so as to improve the speed of query. The amount of the second index files, which are invoked, is smaller, so the physical spending will be saved, and it is not necessary to spend more time computing the intersection set of information, so that the searching speed of the advanced retrieval will be improved.

Description

Method for data search, method for generating index file and search engine The application is submitted to China Patent Office on January 25, 2006, application number is 200610002759.9, and the name of the invention is "a method and system for data search". Priority of the patent application, the entire contents of which is incorporated herein by reference. Technical field

The present invention relates to the field of data search, and in particular, to a data search method, an index file generation method, and a search engine.

Background technique

With the rapid development of the Internet (Internet), electronic information is constantly enriched. However, this information is scattered on countless servers as network nodes. For ordinary users, how can they quickly and accurately find the information they need? It is an important issue that needs to be solved. Currently, users can use the search engine (SEARCH ENGINE) to search the data and find the information they need.

Search engines use automatic crawlers, such as web crawlers (WEBCRAWLERS), search spiders (SPIDER), online robots (ROBOT), to traverse nodes on a wide area network or local area network (INTRANET), using full-text search techniques on each node. The captured information is analyzed, indexed, classified, and the corresponding index database is established for the user to query. When the user searches for data, the keyword is entered, the search program reads the information in the index database and matches the user keyword, retrieves the corresponding or related information, and outputs it to the user through a certain organization.

Search engines are generally based on full-text search. Most of the current search engines use B-tree data structures to store index information. Each index node stores information about each word and connects to the word containing the word. The identifier (ID) of the document, for example, the address of the document containing the "word" is sorted by word frequency, and is connected to the index node in the form of a singly linked list or a doubly linked list. The "word" refers to the smallest unit that can express information in a search engine.

FIG. 1 shows the structure of an index file in the prior art in which a search engine completes the completion of some documents. All information for all documents is stored in a B-tree, each node 110 on the B-tree It is a word. The word "I" points to a linked list 121. The linked list 121 holds all the documents containing "I" and the number of occurrences of "I" in the document. The text "10" in the document 1 has the document 10298, the document 786; the word " China, pointing to another linked list 122, the linked list 122 holds all the documents containing "China" and the document "China,, the number of occurrences, the number of words "China" in the document 10298, document 786; The number of occurrences is indicated by the word frequency. The number of occurrences of "I" in document 786 is 4, and the number of occurrences of "China" in document 786 is 1 time.

When the data is searched using the index file generated by the above method, especially in the advanced search, for example, in the search including the NAND operation, the search speed needs to be further improved.

For example: In the above index file mode, the user needs to search for documents containing both "I" and "China,", and enter keywords, then you can retrieve the document containing "I" as 10298 and document 786, containing " China,, has document 786 and document 26543, and the documents containing "I" and "China" are the intersection of (10298, 786) and (786, 26543). The process of calculating the intersection affects the speed of the search. In addition, since the index file includes index information of a plurality of documents, and the amount of data is large, in the prior art, when the advanced search is performed, multiple keywords are used to search in multiple times, which may affect the search speed, and Loading the index file into memory each time brings a lot of physical overhead.

Summary of the invention

Embodiments of the present invention provide a data search method, an index file generation method, and a search engine, which can improve the search speed when performing advanced search.

In an embodiment of the invention, a data search method includes:

Retrieving in the first index file, obtaining an identifier of the information conforming to the first sub-condition; the first index file corresponding to the plurality of information that needs to be retrieved;

And performing a search in the second index file corresponding to the identifier of the information conforming to the first sub-condition according to the second sub-condition, and obtaining a search result; each of the second index files respectively corresponding to one information identifier.

In another embodiment of the present invention, a method for generating an index file includes:

Generating a first index file for all information that needs to be retrieved;

Generating a separate second index file for each information; the first index file and the second cable ? The I file is linked by the identification of the information.

In still another embodiment of the present invention, a search engine includes:

a search module for obtaining information;

An indexing module, configured to generate a first index file for the information acquired by the search module; and generate a second independent index file for each information;

An advanced search module, configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition; according to the second sub-condition, in the second index file corresponding to the identifier of the information that meets the first sub-condition Search to obtain the search results.

Since the embodiment of the present invention creates a first index file for each piece of information, and separately generates a separate second index file for each piece of information, the search in the first index file can be reduced when performing advanced search. The number of times, in combination with the first index file and the second index file for retrieval, since the amount of information of the independent second index file for a single information is small, the search time can be reduced, and it takes a long time without being performed. The process of calculating the intersection of information can improve the search speed in advanced retrieval. In addition, the embodiment of the present invention may only need to load the first index file once and then call the independent second index file for a single information. Since the second index file has a small amount of data, physical cost can be saved. DRAWINGS

1 is a structural diagram of an index file generated by the prior art;

2 is a flow chart of generating a first index file in an embodiment of the present invention;

3 is a flow chart of generating a second index file in an embodiment of the present invention;

4 is a structural diagram of a separate index file generated for a single document in an embodiment of the present invention;

Figure 5 is a flow chart of a data search method in an embodiment of the present invention;

Figure 6 is a block diagram of a search engine in an embodiment of the present invention;

Figure 7 is a block diagram of the query module shown in Figure 6.

detailed description

In the embodiment of the present invention, in addition to establishing a first index file for all the information that needs to be retrieved, a separate second index file is separately generated for each information, and when the data search is performed, the first index file is combined with The second index file is searched to improve data search The speed of the cable.

The information to be retrieved generally includes electronic information distributed on a server of an infinite number of nodes, and actually has many storage methods, such as: information storage methods such as documents, web pages or database records. In the embodiment of the present invention, the storage form of the information that needs to be retrieved by the web page is taken as an example, but the present invention is not limited to the information of this form.

Since it is necessary to repeatedly analyze each web page so that the words contained in each web page are recorded in the first index file, the first index file can generally also be referred to as a large index file.

The index file needs to store the index information with a certain data structure. For example, the general full-text search uses the B-tree. Of course, the present invention does not limit the data structure of the index file, and any other feasible data structure is feasible, such as: Binary Search Tree (BST), Balance Tree (Adelson-Velskii and Landis Tree, AVL Tree) , heap and other data structures.

In order to clearly describe the process of generating an index file for a web page that needs to be retrieved, FIG. 2 takes a B-tree structure as an example, and shows a flow of an embodiment for generating a first index file.

Step S210, determining whether the indexed webpage needs to be built, if yes, proceeding to step S220; otherwise, indicating that all the webpages that need to be retrieved have been analyzed, and the index information of all the webpages has been stored in the first index file, therefore, entering In step S280, the process ends.

Step S220, determining whether the webpage contains the next word, if not, proceeding to step S280, ending the analysis of the webpage; if so, executing step S230;

Step S230, it is determined whether there is a node corresponding to the word in the B tree of the first index file, if yes, step S240 is performed, if not, step S270 is performed;

Step S240, determining whether the identifier (ID) of the webpage exists in the linked list pointed to by the word, if yes, executing step S250, if not, executing step S260;

Step S250, adding 1 to the word corresponding to the webpage in the linked list pointed to by the word;

Step S260, adding the identifier of the webpage in the linked list pointed to by the word, and setting the corresponding word frequency to 1;

Step S270, adding a node corresponding to the word in the B tree of the first index file, and Add the identifier of the web page to the linked list, and set the frequency to 1.

An embodiment of a method of generating a first index file is given above, and of course, it can be implemented by other methods well known to those skilled in the art, and the present invention is not limited thereto. Moreover, the first index file may adopt different data structures, and then it may have different generation steps. Since it is a technology known in the art, it will not be described herein.

In an embodiment of the present invention, it is also necessary to generate an independent second index file for each web page. The second index file is independent of each other, and each web page corresponds to a second index file of its own. A second index file describes and records the words contained in the corresponding web page and the number of occurrences of the word. If the B-tree structure is still used, then a second index file is a B-tree and is associated with the first index file by the identity of the corresponding web page.

Referring to Fig. 3, a B tree is taken as an example to show a flow of generating a separate second index file for each web page.

Step S310, determining whether a webpage contains the next word, if not, ending the generating step of the second index file for the webpage; if so, executing step S320;

Step S320, determining whether there is a node corresponding to the word in the B tree of the second index file, if yes, executing step S330, if not, executing step S360;

Step S330, determining whether the identifier (ID) of the webpage exists in the linked list of the first index file pointed to by the word, if yes, executing step S340, if not, executing step S350; step S340, pointing the word The word corresponding to the webpage in the linked list of the first index file is incremented by one;

Step S350, adding the identifier of the webpage to the linked list of the first index file pointed by the word, and setting the corresponding word frequency to 1;

Step S360: Add a node corresponding to the word in the B tree of the second index file, and add the identifier of the web page in the linked list of the first index file, and the frequency of the word is 1.

With the generation method of the above embodiment, the index information storage structure map shown in Fig. 4 is formed for one web page. Therein, a plurality of index nodes 410 are included, and each index node corresponds to one word.

In the embodiment of the present invention, the first index file or the second index file adopts a B-tree. As a data structure, other data structures well known to those skilled in the art may be used as the data storage form of the index file, which is not limited by the present invention.

In the embodiment of the present invention, on the basis of generating the first index file, a second index file is generated independently for each webpage, so that the first index file with a large amount of data can be avoided from being called multiple times, so the data can be improved. The speed of the search. Also, it is easier to use multiple tasks (multi-process) to query multiple index files simultaneously, especially when performing advanced searches. The advanced search is generally applicable to a more detailed and accurate search of a web page, and the search condition may include a single condition or a comprehensive condition. The general condition is generally a combination of multiple sub-conditions, each of which is a single condition, and the sub-conditions can be connected by "and", "or", and "not".

Referring to FIG. 5, in an embodiment of the data search method of the present invention, performing data retrieval based on the generated index file includes the following steps:

Step S510: Generate a first index file for all the information that needs to be retrieved.

Step S520, generating an independent second index file for each information.

Then, according to the input keyword, the first index file and the second index file are combined to retrieve the related information. The search may be a general search or an advanced search:

When performing a general search, the first index file is retrieved according to the input query conditions to obtain the relevant information required. In addition, the search results can be sorted based on the number of times the keyword appears in the message and the result is output.

When performing advanced searches, include:

Step S530, decomposing the input query condition;

Step S540: Retrieving the first sub-condition in the first index file, and acquiring an identifier of the information that satisfies the first sub-condition;

Step S550, according to the second sub-condition, performing a search in the second index file corresponding to the information satisfying the first condition.

If the information satisfying the first sub-condition is greater than one, the multi-task may be used to perform the search in the second index file corresponding to the information. Of course, the second index file corresponding to the information satisfying the previous sub-condition may also be retrieved in sequence.

Taking the B tree shown in FIG. 4 as an example, when performing an advanced search, the input query condition can be first decomposed into multiple sub-conditions, and the first index is retrieved in the large index file (first index file). Sub-conditions, the identifiers of all the documents satisfying the first sub-condition are obtained; on this basis, the retrieval for the second sub-conditions is performed sequentially or simultaneously in the small index file (second index file) corresponding to the document identifier. And so on, if there is a next sub-condition, according to the next sub-condition, the search is performed in the second index file corresponding to the document satisfying the previous sub-condition, thereby obtaining a more accurate query result.

For example, when searching for documents that contain both "I" and "China", the search engine first finds all the documents contained in the file, ie, documents 10298, 786, and then starts 2 tasks (multi-process or multi-thread) Searching for the keywords of "China" in the index files of document 10298 and document 786 respectively, for a single document 10298, etc., the index information is relatively small, the search time is small, and it can be quickly determined that the document 10298 does not contain "China" "The word, document 786 contains "China,,, so both "I" and "China," the document is 786.

The existing search method has only one index file, so when searching for two or more sub-conditions, only the sub-conditions can be queued, the index file is called in turn, and each sub-condition is calculated. The intersection of the information sets can be used to obtain the result of satisfying the comprehensive query condition. The first index file with a large amount of data and the intersection of the calculation information sets cause the search speed to be slow. Since the embodiment of the present invention changes the prior art to have only one index file, a plurality of second index files are provided, so that each time the index file is loaded is small, it is not necessary to load the first data amount each time. The index file does not occupy too much memory, and the process of combining and retrieving the document IDs retrieved by each condition is omitted, so that the retrieval time can be saved and the retrieval speed can be improved. Moreover, multiple processes can be easily searched to further speed up the retrieval.

Step S560, the data search method may further sort the search results according to the number of times the keyword appears in the information and output the result.

In order to better meet the needs of user data retrieval, it is not enough to output only the information that meets the query conditions. It is also necessary to evaluate the relevance of documents and keywords, and sort the results to be output. User feedback mechanism. Establish an inverted file in word units, continuously match with the search term, determine the relevance of the query according to the frequency and probability of the keyword appearing in the information, and the information including the search term. Sort and output the search results. Referring to FIG. 6, an embodiment of the search engine of the present invention may include the following units:

The search module 610 is configured to obtain information. In a specific implementation, the search module 610 can discover and collect information on the network through the corresponding code of the program, and perform crawling analysis according to the webpage link to join the database to realize the acquisition of the electronic information.

The indexing module 620 is configured to generate a first index file for the information acquired by the search module; and generate a separate second index file for each information.

In a specific implementation, the indexing module 620 is mainly used to understand the information collected by the search module 610 and extract the index items therefrom to generate corresponding description and expression information to represent the information, establish an index table of information, and form a unified physical index. Database, which enables the structuring of unstructured information.

In the embodiment of the present invention, the indexing module 620 generates an independent second index file for a single piece of information on the basis of generating an index file, which can reduce the number of times the first index file is called and can be searched by using multiple tasks, thereby improving retrieval. speed.

The query module 630 performs a query according to the input keyword and outputs the query result.

The query module 630 quickly detects the information in the first index file and the second index file according to the user's query, or in the first index file, performs correlation evaluation of the information and the keyword, and sorts the results to be output. Achieve better user feedback mechanisms.

In a specific implementation, the query module 630 scans each word in the document to create an inverted file in units of words, and continuously matches the search words according to the frequency and probability of occurrence of the keyword in the document. The relevance of the query of the document, sorting the documents containing such search terms, and outputting the search results.

The interface unit 640 is configured to input keywords and display query results.

Referring to FIG. 7, in an embodiment, the query module includes:

a query condition decomposition module 631, configured to decompose the input query condition, obtain the first sub-condition and the second sub-condition, and provide the advanced search module to the advanced search module;

The advanced search module 632 is configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition; according to the second sub-condition, the second index file corresponding to the identifier of the information that meets the first sub-condition Search in to obtain the search results;

The general retrieval module 633 is configured to retrieve in the first index file according to the input query condition, Obtaining search results;

The sorting module 634, is for sorting the search results obtained by the advanced search module 632 or the general search module 633.

In a specific implementation, the advanced retrieval module 632 can include:

The first index file retrieval module 6321 is configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition;

a second index file retrieval module 6322, configured to perform, by using a multitasking, respectively, in a second index file corresponding to the identifiers of the plurality of information that meet the first sub-conditions; or sequentially, in the information that meets the first sub-condition The identifier is retrieved corresponding to the second index file.

The data search method, the index file generation method and the search engine provided by the embodiments of the present invention are described in detail. The principles and implementation manners of the present invention are described in the following examples. The description is only for helping to understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scopes. The contents of this specification are not to be construed as limiting the invention.

Claims

Rights request

A method for searching data, characterized in that it comprises:

The method for searching for data according to claim 1, wherein before the retrieving in the first index file, the method further comprises: decomposing the input query condition, and obtaining the first sub-condition and the second sub-condition .

The method for searching for data according to claim 1, wherein when the information conforming to the first sub-condition is plural, the step of performing the searching in the second index file is: using multi-task respectively Searching is performed in the second index file corresponding to the identifiers of the plurality of information conforming to the first sub-condition; or sequentially searching in the second index file corresponding to the identifier of the information conforming to the first sub-condition.

The method for searching for data according to claim 1, wherein when the second sub-condition is plural, the step of performing the searching in the second index file is: When the second index file corresponding to the identifier of the sub-condition information conforms to the previous second sub-condition, the search is performed according to the next second sub-condition.

The method for searching for data according to claim 1, wherein the first index file or the second index file saves data by using a B-tree, a binary search tree, a balance tree, or a heap.

6. The method of searching for data according to claim 1, further comprising: sorting and outputting the search results according to the number of occurrences of the sub-conditions in the information.

7. A method for generating an index file, comprising:

Generating a first index file for all information that needs to be retrieved;

A separate second index file is generated for each of the information; the first index file and the second index file are associated by the identification of the information.

The method for generating an index file according to claim 7, wherein the step of generating an independent second index file for each information is: generating B corresponding to each information An index file in the form of a tree.

The method for generating an index file according to claim 8, wherein the step of generating an index file in the form of a B-tree corresponding to each information comprises:

Determining whether the information contains a word, if not, ending the generation of the second index file; if so, determining whether a node corresponding to the word exists in the B-tree of the second index file; if so, corresponding thereto The word frequency is incremented by one; if not, the node corresponding to the word is added to the B tree of the second index file, and the word frequency is set to 1.

The method for generating an index file according to claim 7, wherein the step of generating an independent second index file for each information is: generating a binary search tree, a balance tree, or An index file in the form of a heap.

11. A search engine, comprising:

a search module for obtaining information;

The search engine according to claim 11, further comprising: a query condition decomposition module, configured to decompose the input query condition, obtain the first sub-condition and the second sub-condition, and provide the advanced condition Retrieve the module.

The search engine according to claim 11, further comprising: a general search module, configured to search in the first index file according to the input query condition, to obtain a search result.

The search engine according to claim 13, further comprising: a sorting module, configured to sort the search results obtained by the advanced search module or the general search module.

The search engine according to claim 11, wherein the advanced search module comprises:

a first index file retrieval module, configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition;

a second index file retrieval module, configured to adopt multitasking respectively in the plurality of matching first sub Searching is performed in the second index file corresponding to the identifier of the condition information; or sequentially searching in the second index file corresponding to the identifier of the information conforming to the first sub-condition.