WO2007085187A1 - Method of data retrieval, method of generating index files and search engine - Google Patents

Method of data retrieval, method of generating index files and search engine Download PDF

Info

Publication number
WO2007085187A1
WO2007085187A1 PCT/CN2007/000244 CN2007000244W WO2007085187A1 WO 2007085187 A1 WO2007085187 A1 WO 2007085187A1 CN 2007000244 W CN2007000244 W CN 2007000244W WO 2007085187 A1 WO2007085187 A1 WO 2007085187A1
Authority
WO
WIPO (PCT)
Prior art keywords
index file
information
condition
sub
search
Prior art date
Application number
PCT/CN2007/000244
Other languages
French (fr)
Chinese (zh)
Inventor
Pengxi Zhu
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2007085187A1 publication Critical patent/WO2007085187A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • the present invention relates to the field of data search, and in particular, to a data search method, an index file generation method, and a search engine.
  • Search engines use automatic crawlers, such as web crawlers (WEBCRAWLERS), search spiders (SPIDER), online robots (ROBOT), to traverse nodes on a wide area network or local area network (INTRANET), using full-text search techniques on each node.
  • WEBCRAWLERS web crawlers
  • SPIDER search spiders
  • ROBOT online robots
  • the captured information is analyzed, indexed, classified, and the corresponding index database is established for the user to query.
  • the search program reads the information in the index database and matches the user keyword, retrieves the corresponding or related information, and outputs it to the user through a certain organization.
  • Search engines are generally based on full-text search. Most of the current search engines use B-tree data structures to store index information. Each index node stores information about each word and connects to the word containing the word.
  • the identifier (ID) of the document for example, the address of the document containing the "word” is sorted by word frequency, and is connected to the index node in the form of a singly linked list or a doubly linked list.
  • the "word” refers to the smallest unit that can express information in a search engine.
  • FIG. 1 shows the structure of an index file in the prior art in which a search engine completes the completion of some documents. All information for all documents is stored in a B-tree, each node 110 on the B-tree It is a word. The word "I” points to a linked list 121. The linked list 121 holds all the documents containing "I” and the number of occurrences of "I" in the document.
  • the text "10" in the document 1 has the document 10298, the document 786; the word " China, pointing to another linked list 122, the linked list 122 holds all the documents containing "China” and the document "China,, the number of occurrences, the number of words "China” in the document 10298, document 786; The number of occurrences is indicated by the word frequency.
  • the number of occurrences of "I” in document 786 is 4, and the number of occurrences of "China” in document 786 is 1 time.
  • the search speed needs to be further improved.
  • index file mode the user needs to search for documents containing both "I” and “China,”, and enter keywords, then you can retrieve the document containing "I” as 10298 and document 786, containing " China,, has document 786 and document 26543, and the documents containing "I” and “China” are the intersection of (10298, 786) and (786, 26543).
  • the process of calculating the intersection affects the speed of the search.
  • the index file includes index information of a plurality of documents, and the amount of data is large, in the prior art, when the advanced search is performed, multiple keywords are used to search in multiple times, which may affect the search speed, and Loading the index file into memory each time brings a lot of physical overhead.
  • Embodiments of the present invention provide a data search method, an index file generation method, and a search engine, which can improve the search speed when performing advanced search.
  • a data search method includes:
  • the first index file Retrieving in the first index file, obtaining an identifier of the information conforming to the first sub-condition; the first index file corresponding to the plurality of information that needs to be retrieved;
  • a method for generating an index file includes:
  • the I file is linked by the identification of the information.
  • a search engine includes:
  • a search module for obtaining information
  • An indexing module configured to generate a first index file for the information acquired by the search module; and generate a second independent index file for each information;
  • An advanced search module configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition; according to the second sub-condition, in the second index file corresponding to the identifier of the information that meets the first sub-condition Search to obtain the search results.
  • the embodiment of the present invention creates a first index file for each piece of information, and separately generates a separate second index file for each piece of information, the search in the first index file can be reduced when performing advanced search.
  • the number of times, in combination with the first index file and the second index file for retrieval since the amount of information of the independent second index file for a single information is small, the search time can be reduced, and it takes a long time without being performed.
  • the process of calculating the intersection of information can improve the search speed in advanced retrieval.
  • the embodiment of the present invention may only need to load the first index file once and then call the independent second index file for a single information. Since the second index file has a small amount of data, physical cost can be saved.
  • FIG. 2 is a flow chart of generating a first index file in an embodiment of the present invention
  • FIG. 3 is a flow chart of generating a second index file in an embodiment of the present invention.
  • FIG. 4 is a structural diagram of a separate index file generated for a single document in an embodiment of the present invention.
  • Figure 5 is a flow chart of a data search method in an embodiment of the present invention.
  • Figure 6 is a block diagram of a search engine in an embodiment of the present invention.
  • Figure 7 is a block diagram of the query module shown in Figure 6.
  • a separate second index file is separately generated for each information, and when the data search is performed, the first index file is combined with The second index file is searched to improve data search The speed of the cable.
  • the information to be retrieved generally includes electronic information distributed on a server of an infinite number of nodes, and actually has many storage methods, such as: information storage methods such as documents, web pages or database records.
  • information storage methods such as documents, web pages or database records.
  • the storage form of the information that needs to be retrieved by the web page is taken as an example, but the present invention is not limited to the information of this form.
  • the first index file can generally also be referred to as a large index file.
  • the index file needs to store the index information with a certain data structure.
  • the general full-text search uses the B-tree.
  • the present invention does not limit the data structure of the index file, and any other feasible data structure is feasible, such as: Binary Search Tree (BST), Balance Tree (Adelson-Velskii and Landis Tree, AVL Tree) , heap and other data structures.
  • BST Binary Search Tree
  • Balance Tree Adelson-Velskii and Landis Tree
  • AVL Tree Address-Velskii and Landis Tree
  • heap other data structures.
  • FIG. 2 takes a B-tree structure as an example, and shows a flow of an embodiment for generating a first index file.
  • Step S210 determining whether the indexed webpage needs to be built, if yes, proceeding to step S220; otherwise, indicating that all the webpages that need to be retrieved have been analyzed, and the index information of all the webpages has been stored in the first index file, therefore, entering In step S280, the process ends.
  • Step S220 determining whether the webpage contains the next word, if not, proceeding to step S280, ending the analysis of the webpage; if so, executing step S230;
  • Step S230 it is determined whether there is a node corresponding to the word in the B tree of the first index file, if yes, step S240 is performed, if not, step S270 is performed;
  • Step S240 determining whether the identifier (ID) of the webpage exists in the linked list pointed to by the word, if yes, executing step S250, if not, executing step S260;
  • Step S250 adding 1 to the word corresponding to the webpage in the linked list pointed to by the word;
  • Step S260 adding the identifier of the webpage in the linked list pointed to by the word, and setting the corresponding word frequency to 1;
  • Step S270 adding a node corresponding to the word in the B tree of the first index file, and Add the identifier of the web page to the linked list, and set the frequency to 1.
  • first index file An embodiment of a method of generating a first index file is given above, and of course, it can be implemented by other methods well known to those skilled in the art, and the present invention is not limited thereto.
  • the first index file may adopt different data structures, and then it may have different generation steps. Since it is a technology known in the art, it will not be described herein.
  • each web page it is also necessary to generate an independent second index file for each web page.
  • the second index file is independent of each other, and each web page corresponds to a second index file of its own.
  • a second index file describes and records the words contained in the corresponding web page and the number of occurrences of the word. If the B-tree structure is still used, then a second index file is a B-tree and is associated with the first index file by the identity of the corresponding web page.
  • a B tree is taken as an example to show a flow of generating a separate second index file for each web page.
  • Step S310 determining whether a webpage contains the next word, if not, ending the generating step of the second index file for the webpage; if so, executing step S320;
  • Step S320 determining whether there is a node corresponding to the word in the B tree of the second index file, if yes, executing step S330, if not, executing step S360;
  • Step S330 determining whether the identifier (ID) of the webpage exists in the linked list of the first index file pointed to by the word, if yes, executing step S340, if not, executing step S350; step S340, pointing the word The word corresponding to the webpage in the linked list of the first index file is incremented by one;
  • Step S350 adding the identifier of the webpage to the linked list of the first index file pointed by the word, and setting the corresponding word frequency to 1;
  • Step S360 Add a node corresponding to the word in the B tree of the second index file, and add the identifier of the web page in the linked list of the first index file, and the frequency of the word is 1.
  • the index information storage structure map shown in Fig. 4 is formed for one web page. Therein, a plurality of index nodes 410 are included, and each index node corresponds to one word.
  • the first index file or the second index file adopts a B-tree.
  • a data structure other data structures well known to those skilled in the art may be used as the data storage form of the index file, which is not limited by the present invention.
  • a second index file is generated independently for each webpage, so that the first index file with a large amount of data can be avoided from being called multiple times, so the data can be improved.
  • the speed of the search it is easier to use multiple tasks (multi-process) to query multiple index files simultaneously, especially when performing advanced searches.
  • the advanced search is generally applicable to a more detailed and accurate search of a web page, and the search condition may include a single condition or a comprehensive condition.
  • the general condition is generally a combination of multiple sub-conditions, each of which is a single condition, and the sub-conditions can be connected by "and", "or", and "not".
  • performing data retrieval based on the generated index file includes the following steps:
  • Step S510 Generate a first index file for all the information that needs to be retrieved.
  • Step S520 generating an independent second index file for each information.
  • the search may be a general search or an advanced search:
  • the first index file is retrieved according to the input query conditions to obtain the relevant information required.
  • the search results can be sorted based on the number of times the keyword appears in the message and the result is output.
  • Step S530 decomposing the input query condition
  • Step S540 Retrieving the first sub-condition in the first index file, and acquiring an identifier of the information that satisfies the first sub-condition;
  • Step S550 according to the second sub-condition, performing a search in the second index file corresponding to the information satisfying the first condition.
  • the multi-task may be used to perform the search in the second index file corresponding to the information.
  • the second index file corresponding to the information satisfying the previous sub-condition may also be retrieved in sequence.
  • the input query condition when performing an advanced search, can be first decomposed into multiple sub-conditions, and the first index is retrieved in the large index file (first index file). Sub-conditions, the identifiers of all the documents satisfying the first sub-condition are obtained; on this basis, the retrieval for the second sub-conditions is performed sequentially or simultaneously in the small index file (second index file) corresponding to the document identifier. And so on, if there is a next sub-condition, according to the next sub-condition, the search is performed in the second index file corresponding to the document satisfying the previous sub-condition, thereby obtaining a more accurate query result.
  • the search engine when searching for documents that contain both "I” and “China", the search engine first finds all the documents contained in the file, ie, documents 10298, 786, and then starts 2 tasks (multi-process or multi-thread) Searching for the keywords of "China” in the index files of document 10298 and document 786 respectively, for a single document 10298, etc., the index information is relatively small, the search time is small, and it can be quickly determined that the document 10298 does not contain "China” "The word, document 786 contains "China,,, so both "I” and “China,” the document is 786.
  • the existing search method has only one index file, so when searching for two or more sub-conditions, only the sub-conditions can be queued, the index file is called in turn, and each sub-condition is calculated.
  • the intersection of the information sets can be used to obtain the result of satisfying the comprehensive query condition.
  • the first index file with a large amount of data and the intersection of the calculation information sets cause the search speed to be slow. Since the embodiment of the present invention changes the prior art to have only one index file, a plurality of second index files are provided, so that each time the index file is loaded is small, it is not necessary to load the first data amount each time.
  • the index file does not occupy too much memory, and the process of combining and retrieving the document IDs retrieved by each condition is omitted, so that the retrieval time can be saved and the retrieval speed can be improved. Moreover, multiple processes can be easily searched to further speed up the retrieval.
  • Step S560 the data search method may further sort the search results according to the number of times the keyword appears in the information and output the result.
  • an embodiment of the search engine of the present invention may include the following units:
  • the search module 610 is configured to obtain information.
  • the search module 610 can discover and collect information on the network through the corresponding code of the program, and perform crawling analysis according to the webpage link to join the database to realize the acquisition of the electronic information.
  • the indexing module 620 is configured to generate a first index file for the information acquired by the search module; and generate a separate second index file for each information.
  • the indexing module 620 is mainly used to understand the information collected by the search module 610 and extract the index items therefrom to generate corresponding description and expression information to represent the information, establish an index table of information, and form a unified physical index. Database, which enables the structuring of unstructured information.
  • the indexing module 620 generates an independent second index file for a single piece of information on the basis of generating an index file, which can reduce the number of times the first index file is called and can be searched by using multiple tasks, thereby improving retrieval. speed.
  • the query module 630 performs a query according to the input keyword and outputs the query result.
  • the query module 630 quickly detects the information in the first index file and the second index file according to the user's query, or in the first index file, performs correlation evaluation of the information and the keyword, and sorts the results to be output. Achieve better user feedback mechanisms.
  • the query module 630 scans each word in the document to create an inverted file in units of words, and continuously matches the search words according to the frequency and probability of occurrence of the keyword in the document. The relevance of the query of the document, sorting the documents containing such search terms, and outputting the search results.
  • the interface unit 640 is configured to input keywords and display query results.
  • the query module includes:
  • a query condition decomposition module 631 configured to decompose the input query condition, obtain the first sub-condition and the second sub-condition, and provide the advanced search module to the advanced search module;
  • the advanced search module 632 is configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition; according to the second sub-condition, the second index file corresponding to the identifier of the information that meets the first sub-condition Search in to obtain the search results;
  • the general retrieval module 633 is configured to retrieve in the first index file according to the input query condition, Obtaining search results;
  • the sorting module 634 is for sorting the search results obtained by the advanced search module 632 or the general search module 633.
  • the advanced retrieval module 632 can include:
  • the first index file retrieval module 6321 is configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition;
  • a second index file retrieval module 6322 configured to perform, by using a multitasking, respectively, in a second index file corresponding to the identifiers of the plurality of information that meet the first sub-conditions; or sequentially, in the information that meets the first sub-condition The identifier is retrieved corresponding to the second index file.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data retrieval method, a method of generating index files and a search engine are disclosed. In the present invention, a first index file is generated with respect to all information to be retrieved; a second index file is generated with respect to each information individually; and the corresponding information is obtained with respect to the inputted keywords. When processing an advanced retrieval, the method also includes: separating the query condition which is inputted; retrieving a first sub-condition in the first index file to obtain an information identification which satisfies the first sub-condition; and retrieving according to a second sub-condition in the second index file which is corresponding to the information satisfying the first sub-condition. There are plurality of second index files in present invention, so it is easier to query cooperatively by multi-task so as to improve the speed of query. The amount of the second index files, which are invoked, is smaller, so the physical spending will be saved, and it is not necessary to spend more time computing the intersection set of information, so that the searching speed of the advanced retrieval will be improved.

Description

一种数据搜索的方法、 索引文件生成的方法和搜索引擎 本申请要求于 2006 年 1 月 25 日提交中国专利局、 申请号为 200610002759.9、发明名称为 "一种数据搜索的方法和系统"的中国专利申 请的优先权, 其全部内容通过引用结合在本申请中。 技术领域  Method for data search, method for generating index file and search engine The application is submitted to China Patent Office on January 25, 2006, application number is 200610002759.9, and the name of the invention is "a method and system for data search". Priority of the patent application, the entire contents of which is incorporated herein by reference. Technical field
本发明涉及数据搜索领域, 特别是涉及一种数据搜索的方法、 索引文 件生成的方法和搜索引擎。  The present invention relates to the field of data search, and in particular, to a data search method, an index file generation method, and a search engine.
背景技术 Background technique
随着 INTERNET (互联网)的迅速发展, 电子信息不断丰富起来, 然 而这些信息却是散布在无数个作为网络结点的服务器上, 对于普通用户来 说,如何能迅速准确的找到自己需要的信息,是需要解决的一项重要问题。 目前, 用户可以采用搜索引擎(SEARCH ENGINE )来进行数据搜索, 查 找自己需要的信息。  With the rapid development of the Internet (Internet), electronic information is constantly enriched. However, this information is scattered on countless servers as network nodes. For ordinary users, how can they quickly and accurately find the information they need? It is an important issue that needs to be solved. Currently, users can use the search engine (SEARCH ENGINE) to search the data and find the information they need.
搜索引擎利用自动抓取程序, 例如网络爬虫 ( WEBCRAWLERS )、 搜 索蜘蛛( SPIDER )、网上机器人 ( ROBOT ),在广域网或局域网( INTRANET ) 上遍历各个结点, 使用全文检索技术对在各结点上抓取的信息进行分析, 并进行索引、 分类、 建立相应的索引数据库, 以备用户查询。 当用户进行 数据搜索的时候, 输入关键字, 检索程序就在索引数据库中读取信息与用 户关键字进行匹配, 检索出相应或相关的信息, 通过一定的組织方式将其 输出给用户。  Search engines use automatic crawlers, such as web crawlers (WEBCRAWLERS), search spiders (SPIDER), online robots (ROBOT), to traverse nodes on a wide area network or local area network (INTRANET), using full-text search techniques on each node. The captured information is analyzed, indexed, classified, and the corresponding index database is established for the user to query. When the user searches for data, the keyword is entered, the search program reads the information in the index database and matches the user keyword, retrieves the corresponding or related information, and outputs it to the user through a certain organization.
搜索引擎一般以全文检索为基础, 目前的搜索引擎中较多的采用 B树 的数据结构来存储索引信息, 每个索引节点存储每个 "词" 的信息, 并连 接到含有该 "词" 的文档的标识(ID ), 例如地址上, 含有该 "词" 的文 档的 ID按照词频排序, 以单向链表或者双向链表的形式连接到索引节点 上。 所述 "词" 是指在搜索引擎中, 能表达信息的最小单元。  Search engines are generally based on full-text search. Most of the current search engines use B-tree data structures to store index information. Each index node stores information about each word and connects to the word containing the word. The identifier (ID) of the document, for example, the address of the document containing the "word" is sorted by word frequency, and is connected to the index node in the form of a singly linked list or a doubly linked list. The "word" refers to the smallest unit that can express information in a search engine.
图 1 示出现有技术中搜索引擎对一些文档建立完成的索引文件的结 构。 所有文档的所有信息都保存在一个 B树中, B树上的每一个节点 110 就是一个词。 其中词 "我" 指向一个链表 121 , 链表 121保存含有 "我" 的所有的文档以及该文档中 "我" 出现的次数, 图 1中含有词 "我" 的有 文档 10298、 文档 786; 词 "中国,,指向另外一个链表 122, 链表 122保存 含有 "中国" 的所有的文档以及该文档中 "中国,, 出现的次数, 图 1中含 有词 "中国" 的有文档 10298、 文档 786; 所述出现的次数釆用词频表示, 文档 786中 "我" 出现的次数为 4次, 文档 786中 "中国" 出现的次数为 1次。 FIG. 1 shows the structure of an index file in the prior art in which a search engine completes the completion of some documents. All information for all documents is stored in a B-tree, each node 110 on the B-tree It is a word. The word "I" points to a linked list 121. The linked list 121 holds all the documents containing "I" and the number of occurrences of "I" in the document. The text "10" in the document 1 has the document 10298, the document 786; the word " China, pointing to another linked list 122, the linked list 122 holds all the documents containing "China" and the document "China,, the number of occurrences, the number of words "China" in the document 10298, document 786; The number of occurrences is indicated by the word frequency. The number of occurrences of "I" in document 786 is 4, and the number of occurrences of "China" in document 786 is 1 time.
采用上述方法生成的索引文件进行数据搜索时, 尤其在高级检索中, 例如在含有与或非等运算的检索中, 搜索的速度有待进一步提高。  When the data is searched using the index file generated by the above method, especially in the advanced search, for example, in the search including the NAND operation, the search speed needs to be further improved.
例如: 在上面的索引文件方式下, 用户需要搜索既含有 "我", 又含 有 "中国,,的文档, 输入关键字, 则可以检索到含有 "我"的文档为 10298 和文档 786, 含有 "中国,, 的有文档 786和文档 26543 , 同时含有 "我" 和 "中国" 的文档为 ( 10298、 786 )和 ( 786、 26543 ) 的交集, 计算该交 集的过程会影响搜索的速度。 此外, 由于所述索引文件包含多个文档的索 引信息, 数据量较大, 现有技术进行高级检索时需要采用多个关键字多次 在其中进行检索, 会给搜索速度带来影响, 并且由于每次都加载索引文件 到内存, 会带来较大的物理开销。  For example: In the above index file mode, the user needs to search for documents containing both "I" and "China,", and enter keywords, then you can retrieve the document containing "I" as 10298 and document 786, containing " China,, has document 786 and document 26543, and the documents containing "I" and "China" are the intersection of (10298, 786) and (786, 26543). The process of calculating the intersection affects the speed of the search. In addition, since the index file includes index information of a plurality of documents, and the amount of data is large, in the prior art, when the advanced search is performed, multiple keywords are used to search in multiple times, which may affect the search speed, and Loading the index file into memory each time brings a lot of physical overhead.
发明内容 Summary of the invention
本发明的实施例提供一种数据搜索的方法、 索引文件的生成方法和搜 索引擎, 可以提高在进行高级检索时的搜索速度。  Embodiments of the present invention provide a data search method, an index file generation method, and a search engine, which can improve the search speed when performing advanced search.
本发明的一个实施例中, 一种数据搜索的方法, 包括:  In an embodiment of the invention, a data search method includes:
在第一索引文件中检索, 获取符合第一子条件的信息的标识; 所述第 一索引文件对应于需要检索的多个信息;  Retrieving in the first index file, obtaining an identifier of the information conforming to the first sub-condition; the first index file corresponding to the plurality of information that needs to be retrieved;
根据第二子条件, 在所述符合第一子条件的信息的标识对应的第二索 引文件中进行检索, 获得检索结果; 每个所述的第二索引文件分别对应于 一个信息标识。  And performing a search in the second index file corresponding to the identifier of the information conforming to the first sub-condition according to the second sub-condition, and obtaining a search result; each of the second index files respectively corresponding to one information identifier.
本发明的另一实施例中, 一种索引文件的生成方法, 包括:  In another embodiment of the present invention, a method for generating an index file includes:
针对需要检索的全部信息生成第一索引文件;  Generating a first index file for all information that needs to be retrieved;
针对每个信息生成独立的第二索引文件; 所述第一索引文件和第二索 ? I文件通过信息的标识相联。 Generating a separate second index file for each information; the first index file and the second cable ? The I file is linked by the identification of the information.
本发明的再一个实施例中, 一种搜索引擎, 包括:  In still another embodiment of the present invention, a search engine includes:
搜索模块, 用于获取信息;  a search module for obtaining information;
索引模块, 用于针对所述搜索模块获取的信息生成第一索引文件; 针 对每个信息生成独立的第二索引文件;  An indexing module, configured to generate a first index file for the information acquired by the search module; and generate a second independent index file for each information;
高级检索模块, 用于在第一索引文件中检索, 获取符合第一子条件的 信息的标识; 根据第二子条件, 在所述符合第一子条件的信息的标识对应 的第二索引文件中进行检索, 获得检索结果。  An advanced search module, configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition; according to the second sub-condition, in the second index file corresponding to the identifier of the information that meets the first sub-condition Search to obtain the search results.
由于本发明的实施例除了针对所有的信息建立一个第一索引文件外, 同时针对每个信息都单独生成独立的第二索引文件, 在进行高级检索时, 可以减少在第一索引文件中检索的次数, 代之以结合第一索引文件和第二 索引文件进行检索, 由于针对单个信息的独立的第二索引文件的信息量较 少, 所以搜索的时间可以減少, 而且不用进行需要耗费时间较长的计算信 息交集的过程, 故可以提高高级检索时的搜索速度。 此外, 本发明的实施 例可以只需要加载一次第一索引文件, 然后调用针对单个信息的独立的第 二索引文件, 由于第二索引文件数据量较小, 故可以节省物理开销。 附图说明  Since the embodiment of the present invention creates a first index file for each piece of information, and separately generates a separate second index file for each piece of information, the search in the first index file can be reduced when performing advanced search. The number of times, in combination with the first index file and the second index file for retrieval, since the amount of information of the independent second index file for a single information is small, the search time can be reduced, and it takes a long time without being performed. The process of calculating the intersection of information can improve the search speed in advanced retrieval. In addition, the embodiment of the present invention may only need to load the first index file once and then call the independent second index file for a single information. Since the second index file has a small amount of data, physical cost can be saved. DRAWINGS
图 1是现有技术生成的索引文件的结构图;  1 is a structural diagram of an index file generated by the prior art;
图 2是本发明的实施例中生成第一索引文件的流程图;  2 is a flow chart of generating a first index file in an embodiment of the present invention;
图 3是本发明的实施例中生成第二索引文件的流程图;  3 is a flow chart of generating a second index file in an embodiment of the present invention;
图 4是本发明的实施例中针对单个文档生成的独立的索引文件的结构 图;  4 is a structural diagram of a separate index file generated for a single document in an embodiment of the present invention;
图 5是本发明的实施例中数据搜索方法的流程图;  Figure 5 is a flow chart of a data search method in an embodiment of the present invention;
图 6是本发明的实施例中搜索引擎的框图;  Figure 6 is a block diagram of a search engine in an embodiment of the present invention;
图 7是图 6中所示的查询模块的框图。  Figure 7 is a block diagram of the query module shown in Figure 6.
具体实施方式 detailed description
本发明的实施例中, 除了针对所有的需要检索的信息建立一个第一索 引文件外, 同时针对每个信息都单独生成独立的第二索引文件, 在进行数 据搜索时, 结合第一索引文件和第二索引文件进行检索, 从而提高数据搜 索的速度。 In the embodiment of the present invention, in addition to establishing a first index file for all the information that needs to be retrieved, a separate second index file is separately generated for each information, and when the data search is performed, the first index file is combined with The second index file is searched to improve data search The speed of the cable.
所述的需要检索的信息一般包括散布在无数个结点的服务器上的电 子信息, 实际中具有很多的存储方式, 例如: 文档、 网页或者数据库记录 等信息存储方式。 本发明的实施例中, 以网页为需要检索的信息的存储形 式为例进行说明, 但本发明并不仅限于此种形式的信息。  The information to be retrieved generally includes electronic information distributed on a server of an infinite number of nodes, and actually has many storage methods, such as: information storage methods such as documents, web pages or database records. In the embodiment of the present invention, the storage form of the information that needs to be retrieved by the web page is taken as an example, but the present invention is not limited to the information of this form.
由于需要重复对每个网页进行分析, 从而将每个网页含有的词的情况 都记录在第一索引文件中, 所以一般也可以将第一索引文件称之为大索引 文件。  Since it is necessary to repeatedly analyze each web page so that the words contained in each web page are recorded in the first index file, the first index file can generally also be referred to as a large index file.
索引文件需要采用一定的数据结构对索引信息进行存储, 例如, 一般 的全文检索都采用 B树。 当然, 本发明并不限定索引文件的数据结构, 其 他任何可行的数据结构都是可行的,例如:二分查找树( Binary Search Tree, BST )、 平衡树(Adelson-Velskii and Landis Tree, AVL Tree ), 堆等数据结 构。  The index file needs to store the index information with a certain data structure. For example, the general full-text search uses the B-tree. Of course, the present invention does not limit the data structure of the index file, and any other feasible data structure is feasible, such as: Binary Search Tree (BST), Balance Tree (Adelson-Velskii and Landis Tree, AVL Tree) , heap and other data structures.
为了清楚描述针对需要检索的网页生成索引文件的过程,图 2以 B树 结构为例, 示出了一种生成第一索引文件的实施例的流程。  In order to clearly describe the process of generating an index file for a web page that needs to be retrieved, FIG. 2 takes a B-tree structure as an example, and shows a flow of an embodiment for generating a first index file.
步驟 S210, 判断是否需要建索引的网页, 如果有, 则进入步骤 S220; 否则, 说明所有需要检索的网页都已经经过分析, 所有网页的索引信息都 已经存储在第一索引文件中, 因此, 进入步骤 S280, 结束流程。  Step S210, determining whether the indexed webpage needs to be built, if yes, proceeding to step S220; otherwise, indicating that all the webpages that need to be retrieved have been analyzed, and the index information of all the webpages has been stored in the first index file, therefore, entering In step S280, the process ends.
步驟 S220, 判断该网页是否含有下一个单词, 如果没有, 则进入步骤 S280, 结束针对该网页的分析; 如果含有, 则执行步驟 S230;  Step S220, determining whether the webpage contains the next word, if not, proceeding to step S280, ending the analysis of the webpage; if so, executing step S230;
步驟 S230 , 判断在第一索引文件的 B树中是否存在该词对应的节点, 如果存在, 则执行步驟 S240, 如果不存在, 则执行步骤 S270;  Step S230, it is determined whether there is a node corresponding to the word in the B tree of the first index file, if yes, step S240 is performed, if not, step S270 is performed;
步骤 S240, 判断该词指向的链表中是否存在该网页的标识(ID ), 如 果存在, 则执行步驟 S250, 如果不存在, 则执行步骤 S260;  Step S240, determining whether the identifier (ID) of the webpage exists in the linked list pointed to by the word, if yes, executing step S250, if not, executing step S260;
步骤 S250, 将该词指向的链表中该网页所对应的词频加 1;  Step S250, adding 1 to the word corresponding to the webpage in the linked list pointed to by the word;
步骤 S260,在该词指向的链表中增加该网页的标识, 并将其对应的词 频设为 1;  Step S260, adding the identifier of the webpage in the linked list pointed to by the word, and setting the corresponding word frequency to 1;
步骤 S270, 在该第一索引文件的 B树中增加对应该词的节点, 并在 链表中增加该网页的标识, 置词频为 1。 Step S270, adding a node corresponding to the word in the B tree of the first index file, and Add the identifier of the web page to the linked list, and set the frequency to 1.
上面给出了一种生成第一索引文件的方法的实施例, 当然, 采用其他 本领域技术人员熟知的方法也可以实现,本发明并不对此加以限定。而且, 所述第一索引文件可以采用不同的数据结构, 则也就会具有不同的生成步 骤, 由于属于本领域已知技术, 在此不再赘述。  An embodiment of a method of generating a first index file is given above, and of course, it can be implemented by other methods well known to those skilled in the art, and the present invention is not limited thereto. Moreover, the first index file may adopt different data structures, and then it may have different generation steps. Since it is a technology known in the art, it will not be described herein.
本发明的实施例中, 还需要针对每个网页生成独立的第二索引文件。 所述第二索引文件是相互独立的, 每个网页对应一个自己的第二索引文 件, 一个第二索引文件就是对相应的网页中含有的单词以及该词出现的次 数进行描述和记录。 如果仍然采用 B树结构, 则一个第二索引文件就是一 个 B树, 并和第一索引文件通过对应的网页的标识关联起来。  In an embodiment of the present invention, it is also necessary to generate an independent second index file for each web page. The second index file is independent of each other, and each web page corresponds to a second index file of its own. A second index file describes and records the words contained in the corresponding web page and the number of occurrences of the word. If the B-tree structure is still used, then a second index file is a B-tree and is associated with the first index file by the identity of the corresponding web page.
参照图 3 , 就是以 B树为例, 示出针对每个网页生成独立的第二索引 文件的流程。  Referring to Fig. 3, a B tree is taken as an example to show a flow of generating a separate second index file for each web page.
步骤 S310,判断一个网页是否含有下一个单词,如果没有则结束针对 该网页的第二索引文件的生成步骤; 如果含有, 则执行步骤 S320;  Step S310, determining whether a webpage contains the next word, if not, ending the generating step of the second index file for the webpage; if so, executing step S320;
步驟 S320, 判断在该第二索引文件的 B树中是否存在该词对应的节 点, 如果存在, 则执行步骤 S330, 如果不存在, 则执行步骤 S360;  Step S320, determining whether there is a node corresponding to the word in the B tree of the second index file, if yes, executing step S330, if not, executing step S360;
步驟 S330,判断该词指向的第一索引文件的链表中是否存在该网页的 标识(ID ), 如果存在, 则执行步驟 S340, 如果不存在, 则执行步骤 S350; 步骤 S340,将该词指向的第一索引文件的链表中该网页所对应的词频 加 1 ;  Step S330, determining whether the identifier (ID) of the webpage exists in the linked list of the first index file pointed to by the word, if yes, executing step S340, if not, executing step S350; step S340, pointing the word The word corresponding to the webpage in the linked list of the first index file is incremented by one;
步骤 S350 , 在该词指向的第一索引文件的链表中增加该网页的标识, 并将其对应的词频设为 1 ;  Step S350, adding the identifier of the webpage to the linked list of the first index file pointed by the word, and setting the corresponding word frequency to 1;
步骤 S360, 在该第二索引文件的 B树中增加对应该词的节点, 并在 第一索引文件的链表中增加该网页的标识, 置词频为 1。  Step S360: Add a node corresponding to the word in the B tree of the second index file, and add the identifier of the web page in the linked list of the first index file, and the frequency of the word is 1.
通过上述实施例的生成方法, 针对一个网页形成图 4所示的索引信息 存储结构图。 其中, 包括多个索引节点 410, 每个索引节点与一个词相对 应。  With the generation method of the above embodiment, the index information storage structure map shown in Fig. 4 is formed for one web page. Therein, a plurality of index nodes 410 are included, and each index node corresponds to one word.
本发明的实施例中,所述的第一索引文件或者第二索引文件采用 B树 作为数据结构, 当然也可以采用其他本领域技术人员熟知的数据结构作为 索引文件的数据存储形式, 本发明对此并不加以限定。 In the embodiment of the present invention, the first index file or the second index file adopts a B-tree. As a data structure, other data structures well known to those skilled in the art may be used as the data storage form of the index file, which is not limited by the present invention.
本发明的实施例中, 在生成第一索引文件的基础上, 针对每个网页还 生成自己独立的第二索引文件, 从而可以避免多次调用数据量巨大的第一 索引文件, 所以可以提高数据检索的速度。 并且, 可以更加容易地采用多 任务(多进程) 同时对多个索引文件进行查询, 尤其当进行高级检索时。 所述高级检索一般适用于对网页进行较细致精确的检索, 检索条件可以包 括单一条件或者综合条件。 综合条件一般是多个子条件的组合, 其中每个 子条件都是一个单一条件, 子条件之间可以通过"与"、 "或"、 "非',来连接。  In the embodiment of the present invention, on the basis of generating the first index file, a second index file is generated independently for each webpage, so that the first index file with a large amount of data can be avoided from being called multiple times, so the data can be improved. The speed of the search. Also, it is easier to use multiple tasks (multi-process) to query multiple index files simultaneously, especially when performing advanced searches. The advanced search is generally applicable to a more detailed and accurate search of a web page, and the search condition may include a single condition or a comprehensive condition. The general condition is generally a combination of multiple sub-conditions, each of which is a single condition, and the sub-conditions can be connected by "and", "or", and "not".
参照图 5, 本发明数据搜索方法的实施例中, 基于上述生成的索引文 件进行数据检索, 包括以下步骤:  Referring to FIG. 5, in an embodiment of the data search method of the present invention, performing data retrieval based on the generated index file includes the following steps:
步骤 S510, 针对需要检索的全部信息生成第一索引文件。  Step S510: Generate a first index file for all the information that needs to be retrieved.
步骤 S520, 针对每个信息生成独立的第二索引文件。  Step S520, generating an independent second index file for each information.
随后, 根据输入的关键字, 结合第一索引文件和第二索引文件进行检 索, 获取相关信息。 所述检索可以是一般检索, 也可以是高级检索:  Then, according to the input keyword, the first index file and the second index file are combined to retrieve the related information. The search may be a general search or an advanced search:
当进行一般检索时, 根据输入的查询条件, 检索第一索引文件得到需 要的相关信息。 此外, 还可以根据关键字在信息中出现的次数对检索结果 进行排序并输出该结果。  When performing a general search, the first index file is retrieved according to the input query conditions to obtain the relevant information required. In addition, the search results can be sorted based on the number of times the keyword appears in the message and the result is output.
当进行高级检索时, 包括:  When performing advanced searches, include:
步骤 S530, 分解输入的查询条件;  Step S530, decomposing the input query condition;
步骤 S540,在第一索引文件中检索第一个子条件,获取满足第一个子 条件的信息的标识;  Step S540: Retrieving the first sub-condition in the first index file, and acquiring an identifier of the information that satisfies the first sub-condition;
步骤 S550,根据第二个子条件,在满足第一个条件的信息所对应的第 二索引文件中进行检索。  Step S550, according to the second sub-condition, performing a search in the second index file corresponding to the information satisfying the first condition.
如果满足第一个子条件的信息大于一个时, 可以采用多任务分别在所 述信息对应的第二索引文件中进行检索。 当然, 也可以依次对满足上一个 子条件的信息对应的第二索引文件进行检索。  If the information satisfying the first sub-condition is greater than one, the multi-task may be used to perform the search in the second index file corresponding to the information. Of course, the second index file corresponding to the information satisfying the previous sub-condition may also be retrieved in sequence.
以图 4所示的 B树为例, 在进行高级检索的时候, 可以先将输入的查 询条件分解成多个子条件, 在大索引文件(第一索引文件) 中检索第一个 子条件, 得到满足第一个子条件的所有文档的标识; 在此基础上, 依次或 者同时在文档标识对应的小索引文件(第二索引文件) 中进行针对第二个 子条件的检索。依此类推,如果存在下一个子条件, 则根据下一个子条件, 在满足上一个子条件的文档所对应的第二索引文件中进行检索, 从而得到 较为精确的查询结果。 Taking the B tree shown in FIG. 4 as an example, when performing an advanced search, the input query condition can be first decomposed into multiple sub-conditions, and the first index is retrieved in the large index file (first index file). Sub-conditions, the identifiers of all the documents satisfying the first sub-condition are obtained; on this basis, the retrieval for the second sub-conditions is performed sequentially or simultaneously in the small index file (second index file) corresponding to the document identifier. And so on, if there is a next sub-condition, according to the next sub-condition, the search is performed in the second index file corresponding to the document satisfying the previous sub-condition, thereby obtaining a more accurate query result.
例如在搜索同时含有 "我" 和 "中国" 的文档时, 搜索引擎先在大索. 引文件中找到含有 的所有文档, 即文档 10298、 786, 然后起动 2 个任务 (多进程或多线程)分别在文档 10298和文档 786的索引文件中搜 索 "中国" 的关键字, 对于单个文档 10298等等来说, 索引信息比较少, 搜索的时间较少, 可以很快确定文档 10298中不含有 "中国" 一词, 文档 786中含有 "中国,,, 所以同时含有 "我" 和 "中国,, 的文档为 786。  For example, when searching for documents that contain both "I" and "China", the search engine first finds all the documents contained in the file, ie, documents 10298, 786, and then starts 2 tasks (multi-process or multi-thread) Searching for the keywords of "China" in the index files of document 10298 and document 786 respectively, for a single document 10298, etc., the index information is relatively small, the search time is small, and it can be quickly determined that the document 10298 does not contain "China" "The word, document 786 contains "China,,, so both "I" and "China," the document is 786.
现有的搜索方法由于只存在一个索引文件, 所以当针对两个或两个以 上的子条件进行检索的时候, 只能对子条件进行排队, 依次调用该索引文 件, 并且需要计算满足每个子条件的信息集合的交集, 才能得到满足综合 查询条件的结果, 多次调用数据量巨大的第一索引文件以及计算信息集合 的交集导致搜索速度较慢。 由于本发明的实施例改变现有技术仅具有一个 索引文件的情况, 提供了多个第二索引文件, 这样, 每次加载的索引文件 较小,不需要每次都加载数据量大的第一索引文件,不用占用太大的内存, 并且省略了对各个条件检索到的文档 ID进行合并取交集的过程, 故可以 节约检索时间,提高检索速度。并且,可以很容易的采用多进程进行检索, 以进一步加快检索速度。  The existing search method has only one index file, so when searching for two or more sub-conditions, only the sub-conditions can be queued, the index file is called in turn, and each sub-condition is calculated. The intersection of the information sets can be used to obtain the result of satisfying the comprehensive query condition. The first index file with a large amount of data and the intersection of the calculation information sets cause the search speed to be slow. Since the embodiment of the present invention changes the prior art to have only one index file, a plurality of second index files are provided, so that each time the index file is loaded is small, it is not necessary to load the first data amount each time. The index file does not occupy too much memory, and the process of combining and retrieving the document IDs retrieved by each condition is omitted, so that the retrieval time can be saved and the retrieval speed can be improved. Moreover, multiple processes can be easily searched to further speed up the retrieval.
步骤 S560,所述数据搜索的方法还可以根据关键字在信息中出现的次 数对检索结果进行排序并输出该结果。  Step S560, the data search method may further sort the search results according to the number of times the keyword appears in the information and output the result.
为更好的满足用户数据检索的需要, 仅仅输出符合查询条件的信息可 能在有些场合还是不够的 , 还需要对文档和关键字进行相关度评价, 并对 将要输出的结果进行排序, 实现较好的用户反馈机制。 建立一个以词为单 位的倒排文件, 用检索词不断进行匹配, 根据用户查询的关键词在信息中 出现的频率和概率, 确定出信息的查询的相关度, 对包含这类检索词的信 息进行排序, 输出检索结果。 请参阅图 6, 本发明搜索引擎的实施例可以包括以下单元: In order to better meet the needs of user data retrieval, it is not enough to output only the information that meets the query conditions. It is also necessary to evaluate the relevance of documents and keywords, and sort the results to be output. User feedback mechanism. Establish an inverted file in word units, continuously match with the search term, determine the relevance of the query according to the frequency and probability of the keyword appearing in the information, and the information including the search term. Sort and output the search results. Referring to FIG. 6, an embodiment of the search engine of the present invention may include the following units:
搜索模块 610, 用于获取信息。 在具体实现时, 搜索模块 610可以通 过程序的相应代码实现在网上发现和搜集信息, 并根据网页链接进行抓取 分析加入到数据库中, 实现电子信息的获取。  The search module 610 is configured to obtain information. In a specific implementation, the search module 610 can discover and collect information on the network through the corresponding code of the program, and perform crawling analysis according to the webpage link to join the database to realize the acquisition of the electronic information.
索引模块 620, 用于针对搜索模块获取的信息生成第一索引文件; 针 对每个信息生成独立的第二索引文件。  The indexing module 620 is configured to generate a first index file for the information acquired by the search module; and generate a separate second index file for each information.
在具体实现时, 索引模块 620主要用于理解搜索模块 610所搜集到的 信息并从中抽取出索引项生成相应的描述和表达信息来表示所述信息, 建 立信息的索引表, 形成统一的物理索引数据库, 从而可以实现非结构化信 息的结构化。  In a specific implementation, the indexing module 620 is mainly used to understand the information collected by the search module 610 and extract the index items therefrom to generate corresponding description and expression information to represent the information, establish an index table of information, and form a unified physical index. Database, which enables the structuring of unstructured information.
本发明的实施例中, 索引模块 620在生成一个索引文件的基础上, 针 对单个信息生成独立的第二索引文件, 可以减少调用第一索引文件的次数 以及可以采用多任务进行检索, 从而提高检索的速度。  In the embodiment of the present invention, the indexing module 620 generates an independent second index file for a single piece of information on the basis of generating an index file, which can reduce the number of times the first index file is called and can be searched by using multiple tasks, thereby improving retrieval. speed.
查询模块 630, 根据输入的关键字进行查询, 并输出查询结果。  The query module 630 performs a query according to the input keyword and outputs the query result.
查询模块 630根据用户的查询在第一索引文件和第二索引文件中, 或 者在第一索引文件中迅速检出信息, 进行信息与关键字的相关度评价, 并 对将要输出的结果进行排序, 实现较好的用户反馈机制。  The query module 630 quickly detects the information in the first index file and the second index file according to the user's query, or in the first index file, performs correlation evaluation of the information and the keyword, and sorts the results to be output. Achieve better user feedback mechanisms.
在具体实现时, 查询模块 630通过扫描文档中每一个词, 建立一个以 词为单位的倒排文件, 用检索词不断进行匹配, 根据用户查询的关键词在 文档中出现的频率和概率, 确定出文档的查询的相关度, 对包含这类检索 词的文档进行排序, 输出检索结果。  In a specific implementation, the query module 630 scans each word in the document to create an inverted file in units of words, and continuously matches the search words according to the frequency and probability of occurrence of the keyword in the document. The relevance of the query of the document, sorting the documents containing such search terms, and outputting the search results.
接口单元 640, 用于输入关键字以及显示查询结果。  The interface unit 640 is configured to input keywords and display query results.
请一并参阅图 7, 在一个实施例中, 所述查询模块包括:  Referring to FIG. 7, in an embodiment, the query module includes:
查询条件分解模块 631 , 用于分解输入的查询条件, 获得所述第一子 条件和第二子条件并提供给所述高级检索模块;  a query condition decomposition module 631, configured to decompose the input query condition, obtain the first sub-condition and the second sub-condition, and provide the advanced search module to the advanced search module;
高级检索模块 632, 用于在第一索引文件中检索, 获取符合第一子条 件的信息的标识; 根据第二子条件, 在所述符合第一子条件的信息的标识 对应的第二索引文件中进行检索, 获得检索结果;  The advanced search module 632 is configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition; according to the second sub-condition, the second index file corresponding to the identifier of the information that meets the first sub-condition Search in to obtain the search results;
一般检索模块 633 ,用于根据输入的查询条件在第一索引文件中检索, 获得检索结果; The general retrieval module 633 is configured to retrieve in the first index file according to the input query condition, Obtaining search results;
排序模块 634, 用于对所述高级检索模块 632或一般检索模块 633获 得的检索结果进行排序。  The sorting module 634, is for sorting the search results obtained by the advanced search module 632 or the general search module 633.
在一种具体实现中, 所述高级检索模块 632可以包括:  In a specific implementation, the advanced retrieval module 632 can include:
第一索引文件检索模块 6321 , 用于在第一索引文件中检索, 获取符合 第一子条件的信息的标识;  The first index file retrieval module 6321 is configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition;
第二索引文件检索模块 6322,用于采用多任务分别在所述多个符合第 一子条件的信息的标识对应的第二索引文件中进行检索; 或者依次在所述 符合第一子条件的信息的标识对应的第二索引文件中进行检索。  a second index file retrieval module 6322, configured to perform, by using a multitasking, respectively, in a second index file corresponding to the identifiers of the plurality of information that meet the first sub-conditions; or sequentially, in the information that meets the first sub-condition The identifier is retrieved corresponding to the second index file.
以上对本发明实施例所提供的一种数据搜索的方法、 索引文件生成的 方法和搜索引擎进行了详细介绍, 本文中应用了具体个例对本发明的原理 及实施方式进行了阐述, 以上实施例的说明只是用于帮助理解本发明的方 法及其核心思想; 同时,对于本领域的一般技术人员,依据本发明的思想, 在具体实施方式及应用范围上均会有改变之处, 综上所述, 本说明书内容 不应理解为对本发明的限制。  The data search method, the index file generation method and the search engine provided by the embodiments of the present invention are described in detail. The principles and implementation manners of the present invention are described in the following examples. The description is only for helping to understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scopes. The contents of this specification are not to be construed as limiting the invention.

Claims

权 利 要 求 Rights request
1、 一种搜索数据的方法, 其特征在于, 包括:  A method for searching data, characterized in that it comprises:
在第一索引文件中检索, 获取符合第一子条件的信息的标识; 所述第 一索引文件对应于需要检索的多个信息;  Retrieving in the first index file, obtaining an identifier of the information conforming to the first sub-condition; the first index file corresponding to the plurality of information that needs to be retrieved;
根据第二子条件, 在所述符合第一子条件的信息的标识对应的第二索 引文件中进行检索, 获得检索结果; 每个所述的第二索引文件分别对应于 一个信息标识。  And performing a search in the second index file corresponding to the identifier of the information conforming to the first sub-condition according to the second sub-condition, and obtaining a search result; each of the second index files respectively corresponding to one information identifier.
2、 根据权利要求 1 所述的搜索数据的方法, 其特征在于, 所述在第 一索引文件中检索之前, 还包括: 分解输入的查询条件, 获得所述第一子 条件和第二子条件。  The method for searching for data according to claim 1, wherein before the retrieving in the first index file, the method further comprises: decomposing the input query condition, and obtaining the first sub-condition and the second sub-condition .
3、 根据权利要求 1 所述的搜索数据的方法, 其特征在于, 在符合第 一子条件的信息为多个时, 所述在第二索引文件中进行检索的步驟为: 采 用多任务分別在所述多个符合第一子条件的信息的标识对应的第二索引 文件中进行检索; 或者依次在所述符合第一子条件的信息的标识对应的第 二索引文件中进行检索。  The method for searching for data according to claim 1, wherein when the information conforming to the first sub-condition is plural, the step of performing the searching in the second index file is: using multi-task respectively Searching is performed in the second index file corresponding to the identifiers of the plurality of information conforming to the first sub-condition; or sequentially searching in the second index file corresponding to the identifier of the information conforming to the first sub-condition.
4、 根据权利要求 1 所述的搜索数据的方法, 其特征在于, 在所述第 二子条件为多个时, 所述在第二索引文件中进行检索的步骤为: 在所述符 合第一子条件的信息的标识对应的第二索引文件符合上一第二子条件时, 根据下一第二子条件在其中进行检索。  The method for searching for data according to claim 1, wherein when the second sub-condition is plural, the step of performing the searching in the second index file is: When the second index file corresponding to the identifier of the sub-condition information conforms to the previous second sub-condition, the search is performed according to the next second sub-condition.
5、 如权利要求 1 所述的搜索数据的方法, 其特征在于, 所述的第一 索引文件或第二索引文件采用 B树、 二分查找树、 平衡树或堆保存数据。  The method for searching for data according to claim 1, wherein the first index file or the second index file saves data by using a B-tree, a binary search tree, a balance tree, or a heap.
6、 如权利要求 1 所述的搜索数据的方法, 其特征在于, 还包括: 根 据子条件在信息中出现的次数对检索结果进行排序并输出。  6. The method of searching for data according to claim 1, further comprising: sorting and outputting the search results according to the number of occurrences of the sub-conditions in the information.
7、 一种索引文件的生成方法, 其特征在于, 包括:  7. A method for generating an index file, comprising:
针对需要检索的全部信息生成第一索引文件;  Generating a first index file for all information that needs to be retrieved;
针对每个信息生成独立的第二索引文件; 所述第一索引文件和第二索 引文件通过信息的标识相联。  A separate second index file is generated for each of the information; the first index file and the second index file are associated by the identification of the information.
8、 如权利要求 7所述的索引文件的生成方法, 其特征在于, 所述针 对每个信息生成独立的第二索引文件的步骤为: 生成对应于每个信息的 B 树形式的索引文件。 The method for generating an index file according to claim 7, wherein the step of generating an independent second index file for each information is: generating B corresponding to each information An index file in the form of a tree.
9、 如权利要求 8所述的索引文件的生成方法, 其特征在于, 所述生 成对应于每个信息的 B树形式的索引文件的步骤包括:  The method for generating an index file according to claim 8, wherein the step of generating an index file in the form of a B-tree corresponding to each information comprises:
判断该信息是否含有单词, 如果没有则结束该第二索引文件的生成; 如果含有, 则判断在该第二索引文件的 B树中是否存在该词对应的节点; 如果存在, 则将其对应的词频加 1; 如果不存在, 则在该第二索引文件的 B树中增加对应该词的节点, 并置词频为 1。  Determining whether the information contains a word, if not, ending the generation of the second index file; if so, determining whether a node corresponding to the word exists in the B-tree of the second index file; if so, corresponding thereto The word frequency is incremented by one; if not, the node corresponding to the word is added to the B tree of the second index file, and the word frequency is set to 1.
10、 如权利要求 7所述的索引文件的生成方法, 其特征在于, 所述针 对每个信息生成独立的第二索引文件的步骤为: 生成对应于每个信息的二 分查找树、 平衡树或堆形式的索引文件。  The method for generating an index file according to claim 7, wherein the step of generating an independent second index file for each information is: generating a binary search tree, a balance tree, or An index file in the form of a heap.
11、 一种搜索引擎, 其特征在于, 包括:  11. A search engine, comprising:
搜索模块, 用于获取信息;  a search module for obtaining information;
索引模块, 用于针对所述搜索模块获取的信息生成第一索引文件; 针 对每个信息生成独立的第二索引文件;  An indexing module, configured to generate a first index file for the information acquired by the search module; and generate a second independent index file for each information;
高级检索模块, 用于在第一索引文件中检索, 获取符合第一子条件的 信息的标识; 根据第二子条件, 在所述符合第一子条件的信息的标识对应 的第二索引文件中进行检索, 获得检索结果。  An advanced search module, configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition; according to the second sub-condition, in the second index file corresponding to the identifier of the information that meets the first sub-condition Search to obtain the search results.
12、 如权利要求 11 所述的搜索引擎, 其特征在于, 还包括: 查询条 件分解模块, 用于分解输入的查询条件, 获得所述第一子条件和第二子条 件并提供给所述高级检索模块。  The search engine according to claim 11, further comprising: a query condition decomposition module, configured to decompose the input query condition, obtain the first sub-condition and the second sub-condition, and provide the advanced condition Retrieve the module.
13、 如权利要求 11 所述的搜索引擎, 其特征在于, 还包括: 一般检 索模块,用于根据输入的查询条件在第一索引文件中检索,获得检索结果。  The search engine according to claim 11, further comprising: a general search module, configured to search in the first index file according to the input query condition, to obtain a search result.
14、 如权利要求 13 所述的搜索引擎, 其特征在于, 还包括: 排序模 块, 用于对所述高级检索模块或一般检索模块获得检索结果进行排序。  The search engine according to claim 13, further comprising: a sorting module, configured to sort the search results obtained by the advanced search module or the general search module.
15、 如权利要求 11 所述的搜索引擎, 其特征在于, 所述高级检索模 块包括:  The search engine according to claim 11, wherein the advanced search module comprises:
第一索引文件检索模块, 用于在第一索引文件中检索, 获取符合第一 子条件的信息的标识;  a first index file retrieval module, configured to retrieve, in the first index file, an identifier of the information that meets the first sub-condition;
第二索引文件检索模块, 用于采用多任务分别在所述多个符合第一子 条件的信息的标识对应的第二索引文件中进行检索; 或者依次在所述符合 第一子条件的信息的标识对应的第二索引文件中进行检索。 a second index file retrieval module, configured to adopt multitasking respectively in the plurality of matching first sub Searching is performed in the second index file corresponding to the identifier of the condition information; or sequentially searching in the second index file corresponding to the identifier of the information conforming to the first sub-condition.
PCT/CN2007/000244 2006-01-25 2007-01-23 Method of data retrieval, method of generating index files and search engine WO2007085187A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200610002759.9 2006-01-25
CN2006100027599A CN1858737B (en) 2006-01-25 2006-01-25 Method and system for data searching

Publications (1)

Publication Number Publication Date
WO2007085187A1 true WO2007085187A1 (en) 2007-08-02

Family

ID=37297646

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/000244 WO2007085187A1 (en) 2006-01-25 2007-01-23 Method of data retrieval, method of generating index files and search engine

Country Status (2)

Country Link
CN (1) CN1858737B (en)
WO (1) WO2007085187A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981933A (en) * 2012-12-18 2013-03-20 武汉邮电科学研究院 Method and system for data increment backup of sensing layer of Internet of Things
CN109783508A (en) * 2018-12-29 2019-05-21 亚信科技(南京)有限公司 Data query method, apparatus, computer equipment and storage medium
CN111190893A (en) * 2018-11-15 2020-05-22 华为技术有限公司 Method and device for establishing feature index
CN112100399A (en) * 2020-09-09 2020-12-18 杭州凡闻科技有限公司 Knowledge graph model creating method based on knowledge system and graph retrieval method
CN113535788A (en) * 2021-07-12 2021-10-22 中国海洋大学 Retrieval method, system, equipment and medium for marine environment data

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833511B (en) * 2010-03-29 2012-06-06 瑞斯康达科技发展股份有限公司 Data management method, device and system
CN101930472A (en) * 2010-09-09 2010-12-29 南京中兴特种软件有限责任公司 Parallel query method for distributed database
CN102567415B (en) * 2010-12-31 2013-11-06 百度在线网络技术(北京)有限公司 Control method and device of database
CN102831174B (en) * 2012-07-27 2015-06-17 福建富士通信息软件有限公司 Method and system for rapidly checking structured information
CN104536962A (en) * 2014-11-11 2015-04-22 珠海天琴信息科技有限公司 Data query method and data query device used in embedded system
CN105653654B (en) * 2015-12-28 2021-11-30 上海瀚之友信息技术服务有限公司 Lottery qualification indexing system and method
CN106411530B (en) * 2016-10-25 2021-08-06 国家电网公司 Quick comparison and search method for weak password
CN107766414B (en) * 2017-09-06 2020-06-12 北京三快在线科技有限公司 Multi-document intersection acquisition method, device and equipment and readable storage medium
CN107704599A (en) * 2017-10-16 2018-02-16 杭州安恒信息技术有限公司 A kind of method and apparatus for realizing EDR data quick-searchings
CN111666370B (en) * 2020-07-28 2022-04-22 中国人民解放军国防科技大学 Semantic indexing method and device for multi-source heterogeneous space data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1302412A (en) * 1997-07-22 2001-07-04 微软公司 Apparatus and methods for information retrieval system that employs natural language processing of search results to improve overall precision
CN1430165A (en) * 2001-12-29 2003-07-16 财团法人资讯工业策进会 Key association system and method for searching engine
WO2005022402A1 (en) * 2003-08-29 2005-03-10 Vortaloptics, Inc. Method, device and software for querying and presenting search results

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1302412A (en) * 1997-07-22 2001-07-04 微软公司 Apparatus and methods for information retrieval system that employs natural language processing of search results to improve overall precision
CN1430165A (en) * 2001-12-29 2003-07-16 财团法人资讯工业策进会 Key association system and method for searching engine
WO2005022402A1 (en) * 2003-08-29 2005-03-10 Vortaloptics, Inc. Method, device and software for querying and presenting search results

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981933A (en) * 2012-12-18 2013-03-20 武汉邮电科学研究院 Method and system for data increment backup of sensing layer of Internet of Things
CN102981933B (en) * 2012-12-18 2015-05-13 武汉邮电科学研究院 Method and system for data increment backup of sensing layer of Internet of Things
CN111190893A (en) * 2018-11-15 2020-05-22 华为技术有限公司 Method and device for establishing feature index
CN111190893B (en) * 2018-11-15 2023-05-16 华为技术有限公司 Method and device for establishing feature index
CN109783508A (en) * 2018-12-29 2019-05-21 亚信科技(南京)有限公司 Data query method, apparatus, computer equipment and storage medium
CN109783508B (en) * 2018-12-29 2021-04-09 亚信科技(南京)有限公司 Data query method and device, computer equipment and storage medium
CN112100399A (en) * 2020-09-09 2020-12-18 杭州凡闻科技有限公司 Knowledge graph model creating method based on knowledge system and graph retrieval method
CN112100399B (en) * 2020-09-09 2023-12-22 杭州凡闻科技有限公司 Knowledge system-based knowledge graph model creation method and graph retrieval method
CN113535788A (en) * 2021-07-12 2021-10-22 中国海洋大学 Retrieval method, system, equipment and medium for marine environment data
CN113535788B (en) * 2021-07-12 2024-03-05 中国海洋大学 Ocean environment data-oriented retrieval method, system, equipment and medium

Also Published As

Publication number Publication date
CN1858737B (en) 2010-06-02
CN1858737A (en) 2006-11-08

Similar Documents

Publication Publication Date Title
WO2007085187A1 (en) Method of data retrieval, method of generating index files and search engine
Li et al. Pfp: parallel fp-growth for query recommendation
US7930288B2 (en) Knowledge extraction for automatic ontology maintenance
CN104794242B (en) Searching method
US9275128B2 (en) Method and system for document indexing and data querying
US20100121790A1 (en) Method, apparatus and computer program product for categorizing web content
CN105183884A (en) Search engine system and method based on big data technique
US20110264997A1 (en) Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text
Cheng et al. A web services discovery approach based on mining underlying interface semantics
Elshater et al. godiscovery: Web service discovery made efficient
CN110162522A (en) A kind of distributed data search system and method
Adamu et al. A survey on big data indexing strategies
JP4207438B2 (en) XML document storage / retrieval apparatus, XML document storage / retrieval method used therefor, and program thereof
CN112883030A (en) Data collection method and device, computer equipment and storage medium
US20120130999A1 (en) Method and Apparatus for Searching Electronic Documents
CN103226601B (en) A kind of method and apparatus of picture searching
Carta et al. Iterative zero-shot llm prompting for knowledge graph construction
CN107229714B (en) Full-text search engine based on distributed database
Cheng et al. MISDA: web services discovery approach based on mining interface semantics
US8484221B2 (en) Adaptive routing of documents to searchable indexes
US10380195B1 (en) Grouping documents by content similarity
CN111625570B (en) List data resource retrieval method and device
Ren et al. A summary of research on web data acquisition methods based on distributed crawler
Huang et al. LiveIndex: A distributed online index system for temporal microblog data
CN113032450A (en) Data storage and retrieval method, system, storage medium and processing terminal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07702173

Country of ref document: EP

Kind code of ref document: A1