TW201131400A - Information indexing method and system thereof - Google Patents

Information indexing method and system thereof Download PDF

Info

Publication number
TW201131400A
TW201131400A TW99106912A TW99106912A TW201131400A TW 201131400 A TW201131400 A TW 201131400A TW 99106912 A TW99106912 A TW 99106912A TW 99106912 A TW99106912 A TW 99106912A TW 201131400 A TW201131400 A TW 201131400A
Authority
TW
Taiwan
Prior art keywords
document
word
index data
keyword
text
Prior art date
Application number
TW99106912A
Other languages
Chinese (zh)
Other versions
TWI485570B (en
Inventor
Yi Luo
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to TW099106912A priority Critical patent/TWI485570B/en
Publication of TW201131400A publication Critical patent/TW201131400A/en
Application granted granted Critical
Publication of TWI485570B publication Critical patent/TWI485570B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses information indexing method and its system, which is applied to the information indexing system in which every text corresponds to the forward index data to solve the problem of poor indexing efficiency in the existing information indexing technology. The disclosed method includes: receiving inquiry phrases and obtaining the keyword contained in the inquiry phrases through the word segmentation process; looking up the text that matches with said keyword and the forward indexing data corresponding to said text through the backward index data of the information indexing system; ascertaining the abstract of said text according to the forward indexing data corresponding to said text and carrying out the output by using abstract of said text and the text information as the indexing result. By adopting the present invention, the efficiency of information indexing can be increased and simultaneously the accuracy of indexing can be guaranteed to some extent.

Description

201131400 六、發明說明: 【發明所屬之技術領域】 本申請案係有關通信領域中的全文資料檢索 其有關一種資訊檢索方法和資訊檢索系統。 【先前技術】 伴隨著網際網路搜索引擎技術的飛速普及以 路搜索企業的快速發展,資訊檢索系統(也被稱 擎)已經成爲越來越多的人使用網際網路時必不 具。 人們在使用搜索引擎的時候,通常的使用場 輸入査詢詞,然後從搜索引擎後端獲取需要的搜 而組成搜索結果的三大要素爲:標題、摘要和網 業界俗稱TAU,爲Title、Abstract' Url三個英文 字母縮寫)。而摘要(Abstract )在這三大要素 訊量而言,其所含資訊量最大;從最終頁面顯示 ,其佔據最大篇幅;從最終用戶感受而言,其肯g 程度上決定搜索結果的正確與否,使用戶能夠根 提供的資訊而判斷搜索的結果是否爲用戶所需要 ,一套高性能、可伸縮、易定制並且人機界面 產生系統,是每一個搜索引擎(亦即,資訊檢 可或缺的重要組成部分。 傳統的摘要產生方法是根據用戶所輸入的 檢索全文資料’並在全文資料的檢索結果的基 技術,尤 及網際網 爲搜索引 可少的工 景是透過 索結果。 址鏈結( 單詞的首 中,就資 效果而言 夠在很大 據摘要所 者。因此 秀的摘要 系統)不 詢詞即時 上,透過 -5- 201131400 計算詞頻、詞距以及其他參數,經由文本匹配和加權記分 等演算法,而提取出與查詢詞最相匹配的文本段落作爲摘 要,然後將包含有標題、摘要和網址鏈結的檢索結果返回 給檢索用戶端,以供最終展現。 由於利用傳統的搜索引擎進行資訊檢索時,需要在整 個全文資料中進行匹配檢索,摘要的產生也是依據全文資 料,因爲全文資料通常資訊量大,所以會導致檢索時間較 長、檢索效率較低。 【發明內容】 本申請案實施例提供一種資訊檢索方法及其系統,用 以解決現有資訊檢索技術中檢索效率低的問題。 本申請案所提供的資訊檢索方法,應用於每個文檔對 應有正排索引資料的資訊檢索系統,其中,每個文檔的正 排索引資料以該文檔中的各單詞做爲索引,以記錄各單詞 在該文檔中的位置,該方法包括以下步驟: 接收査詢詞,透過分詞處理而獲得到該查詢詞中所包 含的關鍵字; 透過所述資訊檢索系統的倒排索引資料來査找與所述 關鍵字匹配的文檔,及與所述文檔對應的正排索引資料; 以及 根據與所述文檔對應的正排索引資料而確定出所述文 檔的摘要,將所述文檔的摘要及文檔的資訊作爲檢索結果 而進行輸出。 -6- 201131400 本申請案所提供的資訊檢索系統,包括儲存有該系統 倒排索引資料和每個文檔對應的正排索引資料的儲存模組 ,其中’每個文檔的正排索引資料以該文檔中的各單詞做 爲索引,以記錄各單詞在該文檔中的位置,該系統還包括 輸入模組,用以接收輸入到該系統的查詢詞; 檢索模組’用以對所述查詢詞進行分詞處理而獲得到 其所包含的關鍵字,透過所述倒排索引資料來查找與所述 關鍵字匹配的文檔,及與所述文檔對應的正排索引資料; 根據與所述文檔對應的正排索引資料而確定出所述文檔的 摘要;以及 輸出模組’用以將所述文檔的摘要及文檔的資訊作爲 檢索結果而進行輸出。 本申請案的上述實施例,利用倒排索引資料進行資訊 檢索,獲得到與查詢詞匹配的文檔後,利用文檔的正排索 引資料而產生文檔的摘要,從而充分利用了文檔的正排索 引資料比全文資料的資料量少但內容描述具有一定準確性 的特點’與傳統資訊檢索系統透過全文檢索所獲得到的檢 索結果相比,可提高進行查詢詞匹配檢索的效率,以及提 局摘要產生的效率’同時還能在一定程度上確保檢索結果 以及產生的摘要的準確性。 【實施方式】 下面結合附圖而對本申請案實施例進行詳細描述。 201131400 本申請案實施例所提供的資訊檢索方法應用於資訊檢 索系統’該資訊檢索系統中不僅要保存文檔(亦即全文資 料,以下同)集合的倒排索引資料,還要能夠提供每個文 檔對應的獨ϋ的正排索引資料。目前的通用資訊檢索系統 中’通常包含有文檔集合的倒排索引資料,同時包含每篇201131400 VI. Description of the invention: [Technical field to which the invention pertains] This application relates to a full-text data retrieval in the field of communication, relating to an information retrieval method and an information retrieval system. [Prior Art] With the rapid spread of Internet search engine technology and the rapid development of the road search enterprise, the information retrieval system (also known as the engine) has become more and more popular when people use the Internet. When people use search engines, they usually use the field to input query words, and then get the search results from the search engine backend to form the search results. The three main elements are: title, abstract and web industry commonly known as TAU, which is Title, Abstract' Url three English alphabet abbreviation). Abstract, in terms of the three major factors, it contains the largest amount of information; from the final page, it takes up the largest amount; from the end user experience, its degree determines the correctness of the search results. No, to enable the user to determine the results of the search for the user's needs, a set of high-performance, scalable, easy-to-customize and human-machine interface generation system, is every search engine (ie, information check or The important part of the abstract is that the traditional summary production method is based on the user's input of the full-text data of the search and the basic technology of the search results of the full-text data, especially the Internet-based search for less work is the result of the search. The link (the first in the word, in terms of the effect is enough to summarize the summary. So the show's summary system) does not ask for the word on the instant, through the -5- 201131400 calculation of word frequency, word spacing and other parameters, via text Algorithms such as matching and weighted scoring, and extracting the text paragraph that best matches the query word as a summary, and then will contain the title, abstract and The search result of the address link is returned to the search client for final presentation. Since the traditional search engine is used for information retrieval, matching search is needed in the entire full-text data, and the summary is generated based on the full-text data, because the full-text data is usually The information volume is large, so the retrieval time is long and the retrieval efficiency is low. SUMMARY OF THE INVENTION The embodiments of the present application provide an information retrieval method and system thereof, which are used to solve the problem of low retrieval efficiency in the existing information retrieval technology. The information retrieval method provided by the application is applied to an information retrieval system in which each document corresponds to a positive index data, wherein the positive index data of each document is indexed by each word in the document to record each word. At the location in the document, the method includes the steps of: receiving a query word, obtaining a keyword included in the query word through word segmentation processing; searching for the key through the inverted index data of the information retrieval system a word matching document, and a positive index data corresponding to the document; Determining a summary of the document according to the positive index data corresponding to the document, and outputting the summary of the document and the information of the document as a retrieval result. -6- 201131400 Information retrieval system provided by the application The storage module includes the inverted index data of the system and the positive index data corresponding to each document, wherein 'the index data of each document is indexed by each word in the document to record each word. At the location in the document, the system further includes an input module for receiving a query word input to the system; the search module is configured to perform word segmentation on the query word to obtain a keyword included therein, Searching, by the inverted index data, a document matching the keyword, and a positive index data corresponding to the document; determining a summary of the document according to the positive index data corresponding to the document; And the output module 'is used to output the summary of the document and the information of the document as a search result. The above embodiment of the present application uses the inverted index data to perform information retrieval, obtains a document matching the query word, and uses the positive index data of the document to generate a summary of the document, thereby fully utilizing the positive index data of the document. Compared with the full-text data, the content description has certain accuracy characteristics. Compared with the search results obtained by the traditional information retrieval system through full-text search, the efficiency of query matching search can be improved, and the summary of the summary is generated. Efficiency' also ensures the accuracy of the search results and the resulting summary to a certain extent. [Embodiment] Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings. 201131400 The information retrieval method provided by the embodiment of the present application is applied to an information retrieval system. The information retrieval system not only saves the inverted index data of the document (that is, the full-text data, the same below), but also provides each document. Corresponding independent index data. In the current general information retrieval system, 'usually contains inverted index data of a collection of documents, including each

文檔對應的獨立的正排索引資料(例如通用的postgresQL ’開源軟體的全文檢索系統)。針對暫不包含每篇文檔獨 立的正排索引資料的資訊檢索系統,可以採用各種技術方 案而從整個文檔集合的倒排索引資料中獲得到每個文檔獨 立的正排索引資料’該過程可以在離線狀態下完成,但不 會對資訊檢索系統線上提供資訊檢索服務的性能造成影響 〇 所謂文檔集合的倒排索引資料就是根據單詞建立的文 檔索引,例如,每個單詞在資料庫中是一條記錄,單詞作 爲關鍵字’後面跟著文檔標識ID、位置等資訊。假設有3 篇文檔:filel、file2和file3,文檔內容如下: filel (單詞1,單詞2,單詞3,單詞4...·) file2 (單詞a ’單詞b,單詞c,單詞d,單詞a,單詞c ,單詞 d·.··) file3 (單詞1,單詞a,單詞3,單詞d.._.) 上述文檔所組成的文檔集合的倒排索引資料包括: 單詞 1 ( filel,file3 ),單詞 2 ( filel ),單詞 3 ( filel ,file3 ),單詞 a ( fi 1 e2,fi 1 e3 )等等。 所謂文檔的正排索引資料就是根據該文檔中包括的單 -8- 201131400 詞建立的單詞位置在該文檔的索引。例如,對於上述file2 文檔’如果各單詞在nle2出現的位置依次記爲p〇sl,p〇s2 …,則其正排索引資料爲: file2 (單詞 a : ps〇l,pos5 ;單詞 b : p〇s2 ;單詞 c · pos3,pos6 ;單詞 d : pos4,p o s 7 ··) 可見’根據文檔及其倒排索引資料,可以獲得到各文 檔各自的正排索引資料。 爲了便於資料維護和管理’文槽集合及其倒排索引資 料’以及各文檔獨立的正排索引資料可以以資料庫的形式 來予以組織’當然也可以以其他形式來予以組織,諸如, 檔案形式。本申請案中的實施例按照以資料庫形式所組織 的文檔集合及各文檔獨立的正排序索引資料爲例來進行描 述。 參見圖1 ’爲本申請案實施例中的資訊檢索系統來爲 用戶提供資訊檢索服務的流程示意圖。以下流程中的資訊 檢索系統,包括儲存有文檔全文資料的全文資料庫,以及 儲存有倒排索引資料的倒排索引庫和儲存有各文檔各自對 應的正排索引資料的正排索引庫,全文資料庫與正排索引 庫’以及全文資料庫與倒排索引庫之間透過定義的索引( 諸如’文檔ID )關聯,使各文檔的正排索引與其文檔—— 對應》 基於上述資訊檢索系統的資訊檢索流程包括: 步驟101 ’資訊檢索系統接受用戶提交的查詢詞。 資訊檢索系統可爲用戶提供資訊檢索介面,該介面可 -9 - 201131400 爲用戶提供輸入查詢詞以及提交查詢詞到資訊檢索系統的 操作命令。用戶輸入的查詢詞可以是單詞,也可以是由多 個單詞組成短語,還可以是透過關係運算符(and、or等 )連接的多個單詞(或短語)。 步驟102,資訊檢索系統對用戶提交的査詢詞進行分 詞處理,獲得到査詢詞中包含的所有關鍵字,然後透過倒 排索引庫中的倒排索引資料而找到與這些關鍵字匹配的文 檔,並根據找到的各文檔從正排索引庫中找到這些文檔所 對應的正排索引資料,並分別根據每個文檔的正排索引資 料產生各文檔的摘要,將產生的文檔摘要以及對應文檔的 標題、Url等資訊作爲檢索結果。 在該步驟中,資訊檢索系統確定出與關鍵字匹配的文 檔後,可獲得這些文檔各自對應的正排索引資料。由於資 訊檢索系統檢索出的文檔可爲多個,爲了提高資訊檢索結 果對於用戶的可用性以及提高用戶感受,資訊檢索系統可 以按照文檔與關鍵字的匹配程度從高到低對檢索到的文檔 進行排序,從而得到對應的文檔ID序列。 步驟1 03,資訊檢索系統根據產生的檢索結果以及文 檔ID序列而輸出檢索結果,檢索結果可呈現在資訊檢索系 統提供的檢索結果介面上,檢索結果可包括:文檔的標題 、Url和文檔摘要。文檔摘要通常都包含有關鍵字,較佳 地’可以是文檔中與關鍵字匹配程度高的那部分文本內容 〇 在該步驟中’資訊檢索系統可根據上個步驟而獲得到 -10- 201131400 的文檔ID序列,從全文資料庫中獲得對應文檔的標題、 Url等資訊(在通常的情況下,全文索引庫可使用文檔ID 、標題、Url等作爲文檔全文內容的索引資料),當然, 如果正排索引庫中也包含有文檔標題和Url等資訊,也可 以直接從正排索引庫中獲得這些資訊,然後按照該文檔ID 序列的順序來進行輸出或顯示。資訊檢索系統還可對輸出 的檢索結果的格式和文檔摘要的長度進行規定。 上述資訊檢索系統,如圖2所示,可包括以下功能模 組:輸入模組2 1、檢索模組22和輸出模組23,還包括全文 資料庫24、倒排索引庫25和正排索引庫26。其中,全文檢 索資料庫24中儲存有各文檔的全文資料,可以用文檔ID、 文檔標題、Url作爲索引來儲存文檔的全文資料;倒排索 引庫25中儲存有倒排索引資料;正排索引庫26中儲存有各 文檔各自的正排索引資料。各模組所實現的功能可分別與 上述流程中的相應步驟對應。其中,檢索模組22可進一步 包括檢索子模組22 1、摘要產生子模組222和結果提交子模 組 2 2 3 ° 下面結合圖2所示的資訊檢索系統,對資訊檢索的流 程做進一步詳細描述。 輸入模組2 1接收用戶提交的查詢詞後,將該查詢詞提 交到檢索模組22 ;檢索模組22對查詢詞進行分詞處理後獲 得到關鍵字,根據該關鍵字在倒排索引庫2 5中進行檢索, 以檢索出與關鍵字匹配的文檔,然後根據正排索引庫26中 的相應文檔的正排索引資料而產生該文檔的摘要,然後將 -11 - 201131400 包含有文檔摘要的檢索結果提交給輸出模組2 3,其中,如 果需要輸出文檔的標題和Url,還可進一步從全文資料庫 24或倒排索引資料庫25中獲取文檔標題和Url ;輸出模組 23接收到檢索結果後輸出該檢索結果。 對於每個與關鍵字匹配的文檔,檢索模組22可透過掃 描正排索引庫26中相應正排索引資料中的每一個單詞,建 立所有單詞的正排序列和僅包括關鍵字的正排索引序列, 然後根據這兩個序列而確定出文檔摘要文本段在該正排序 列或全文資料中的起始和終止位置,從而獲得到文檔摘要 。檢索模組22的資訊檢索功能的具體實現過程需要—些變 數,這些變數可包括: 文檔ID序列:用以儲存檢索到的文檔的ID ; 八1*1:”_4陣列:用以儲存檢索到的文檔所包含的所有 單詞的正排序列,該正排序列按序記錄了各單詞文本及其 在該文檔全文資料中出現的位置,可以用單詞文本與位置 鏈表的形式表現,例如:The document corresponds to a separate positive index data (such as the generic postgresQL' open source software full-text search system). For an information retrieval system that does not include a positive index data for each document, a variety of technical solutions can be used to obtain an independent index data of each document from the inverted index data of the entire document collection. It is done offline, but it does not affect the performance of the information retrieval service provided by the information retrieval system. The inverted index data of the so-called document collection is the index of the document created based on the word. For example, each word is a record in the database. , the word as a keyword 'followed by the document identification ID, location and other information. Suppose there are 3 documents: filel, file2, and file3. The contents of the document are as follows: filel (word 1, word 2, word 3, word 4...·) file2 (word a 'word b, word c, word d, word a , the word c, the word d·.··) file3 (word 1, word a, word 3, word d.._.) The inverted index data of the document set consisting of the above documents includes: word 1 ( filel, file3 ) , word 2 ( filel ), word 3 ( filel , file3 ), word a ( fi 1 e2, fi 1 e3 ) and so on. The so-called index data of the document is the index of the word position based on the word -8-201131400 included in the document. For example, for the above file2 document 'if the position where each word appears in nle2 is denoted by p〇sl,p〇s2 ..., then its positive index data is: file2 (word a: ps〇l, pos5; word b: p 〇s2; the word c · pos3, pos6; the word d: pos4, pos 7 ··) Visible 'According to the document and its inverted index data, you can get the index data of each document. In order to facilitate the data maintenance and management of the 'slot collection and its inverted index data' and the independent index data of each document can be organized in the form of a database. Of course, it can also be organized in other forms, such as file format. . The embodiments in the present application are described by taking a collection of documents organized in the form of a database and independent indexed index data of each document as an example. Referring to FIG. 1 ′ is a schematic flowchart of an information retrieval system in an embodiment of the present application for providing an information retrieval service for a user. The information retrieval system in the following process includes a full-text database storing the full-text data of the document, and an inverted index library storing the inverted index data and a positive index library storing the corresponding index data of each document, the full text The database and the positive index library 'and the full-text database and the inverted index library are associated with each other through a defined index (such as 'document ID'), so that the positive index of each document and its document - corresponding" based on the above information retrieval system The information retrieval process includes: Step 101 'The information retrieval system accepts the query words submitted by the user. The information retrieval system can provide users with an information retrieval interface. The interface can provide users with input query words and operation commands for submitting query words to the information retrieval system. The query word input by the user may be a word, a phrase composed of a plurality of words, or a plurality of words (or phrases) connected by a relational operator (and, or, etc.). Step 102: The information retrieval system performs word segmentation on the query words submitted by the user, obtains all the keywords included in the query words, and then finds the documents matching the keywords by inverting the index data in the inverted index database, and According to the found documents, the positive index data corresponding to the documents are found from the positive index library, and the summary of each document is generated according to the positive index data of each document, and the generated document summary and the title of the corresponding document, Information such as Url is used as a search result. In this step, after the information retrieval system determines the documents matching the keywords, the corresponding index data of the respective documents can be obtained. Since the information retrieval system can retrieve a plurality of documents, in order to improve the usability of the information retrieval result for the user and improve the user experience, the information retrieval system can sort the retrieved documents according to the matching degree of the document and the keyword from high to low. , thereby obtaining a corresponding sequence of document IDs. Step 103: The information retrieval system outputs the search result according to the generated search result and the document ID sequence, and the search result may be presented in a search result interface provided by the information retrieval system, and the search result may include: a title, a Url, and a document summary of the document. The document abstract usually contains keywords, preferably 'can be the part of the document that matches the keyword with a high degree of content. In this step, the information retrieval system can be obtained according to the previous step to -10 201131400. Document ID sequence, obtain the title of the corresponding document, Url and other information from the full-text database (in the normal case, the full-text index library can use the document ID, title, Url, etc. as the index data of the full-text content of the document), of course, if The index library also contains information such as document title and Url, and can also be obtained directly from the front index library, and then output or display in the order of the document ID sequence. The information retrieval system also specifies the format of the output search results and the length of the document summary. The information retrieval system, as shown in FIG. 2, may include the following functional modules: an input module 2 1 , a retrieval module 22 , and an output module 23 , and further includes a full-text database 24 , an inverted index library 25 , and a positive index library 26. The full-text search database 24 stores the full-text data of each document, and can store the full-text data of the document by using the document ID, the document title, and the Url as an index; the inverted index database 25 stores the inverted index data; The library 26 stores the respective positive index data of each document. The functions implemented by each module can correspond to the corresponding steps in the above process. The search module 22 may further include a search sub-module 22 1 , a summary generation sub-module 222 , and a result submission sub-module 2 2 3 ° . The information retrieval system shown in FIG. 2 is further combined with the information retrieval process. A detailed description. After receiving the query word submitted by the user, the input module 2 1 submits the query word to the search module 22; the search module 22 performs word segmentation on the query word to obtain a keyword, and the inverted index database 2 according to the keyword Searching in 5 to retrieve the document matching the keyword, and then generating a summary of the document according to the positive index data of the corresponding document in the index library 26, and then including -11 - 201131400 with the document summary search The result is submitted to the output module 23, wherein if the title and Url of the document need to be output, the document title and Url can be further obtained from the full-text database 24 or the inverted index database 25; the output module 23 receives the search result. The search result is output. For each document matching the keyword, the retrieval module 22 can create a positive sequence of all words and a positive index including only keywords by scanning each word in the corresponding positive index data in the positive index library 26. The sequence, and then determining the start and end positions of the document summary text segment in the positive or full-text data based on the two sequences, thereby obtaining a document summary. The specific implementation process of the information retrieval function of the retrieval module 22 requires some variables, and the variables may include: a document ID sequence: an ID for storing the retrieved document; eight 1*1: "_4 array: used to store and retrieve The positive sequence of all the words contained in the document. The positive sequence records the text of each word and its position in the full-text data of the document. It can be expressed in the form of word text and location list, for example:

Array_A記錄有(posl:wordl,pos2:word2,pos3:word3, pso4:wordl......) 其中,pos表示相應單詞文本的第一個字元在全文資 料中的字元位置,word表示單詞文本。亦即,Array_A* 以單詞位置做爲索引來記錄各單詞的文本。The Array_A record has (posl:wordl, pos2:word2, pos3:word3, pso4:wordl...) where pos represents the character position of the first character of the corresponding word text in the full-text data, and word represents Word text. That is, Array_A* records the text of each word with the word position as an index.

Map_A陣列:用以儲存對查詢詞進行分詞處理後獲得 到的各關鍵字的正排索引序列,該正排索引序列記錄了各 關鍵字在相應文檔全文資料中出現的位置,可以用紅黑樹 -12- 201131400 結構(紅黑樹是一種特定類型的二叉樹,它是在電腦科學 中用來組織資料比如數位的塊的一種結構。所有資料塊都 被儲存在節點中。)組織並儲存關鍵字文本與位置鏈表, 例如,如果上述Array_A中的wordl和word2爲關鍵字,則Map_A array: used to store the positive index sequence of each keyword obtained after the word segmentation processing of the query word, the positive index sequence records the position of each keyword appearing in the full text of the corresponding document, and can use the red black tree -12- 201131400 Structure (Red-black tree is a specific type of binary tree, a structure used in computer science to organize data such as digital blocks. All data blocks are stored in nodes.) Organize and store keywords Text and location linked list, for example, if wordl and word2 in Array_A above are keywords, then

Map_A記錄有(word 1 :pos 1,pos4; w〇rd2 :p〇s2......) 其中,pos表示相應關鍵字文本的第一個字元在全文 資料中的字元位置,keyword表示關鍵字文本。亦即, Map_A中以單詞文本(該單詞是與關鍵字匹配的單詞)做 爲索引來記錄各單詞的位置。The Map_A record has (word 1 : pos 1, pos4; w〇rd2 : p〇s2...) where pos represents the character position of the first character of the corresponding keyword text in the full-text data, keyword Represents the keyword text. That is, in Map_A, the word text (the word is a word that matches the keyword) is used as an index to record the position of each word.

Res_Beg、Res_End:用以表示摘要文本段的起止位置 ,Res_Beg, Res_End: used to indicate the starting and ending position of the summary text segment.

Best_Path :用以表示最佳摘要路徑,其中包括文本起 止位置參數,透過該參數可在正排索引資料或全文資料中 界定出相應的文本段,由該最佳摘要路徑所界定出的文本 段與關鍵字的匹配程度較相應其他文本段要高; RL_賦値爲摘要文本的長度,通常用字數來予以表示 ,可在系統初始化時賦値。 檢索模組22的資訊檢索過程可分爲2個階段:文檔檢 索階段,以檢索出與關鍵字匹配的文檔(諸如,包含有關 鍵字的文檔,或包含有與關鍵字具有等同含義的單詞的文 檔);摘要產生階段,以針對檢索出的文檔產生對應的摘 要。 在文檔檢索階段,檢索子模組2 2 1將用戶提交的查詢 -13- 201131400 詞進行分詞處理,獲得到査詢詞中包含的所有關鍵字。然 後,在倒排索引庫25中進行匹配查找,以查找與關鍵字匹 配的文檔的ID,並按照匹配程度從高到低的順序對文檔ID 進行排序,並將排序後的文檔ID儲存爲文檔ID序列。該階 段所進行的分詞處理可採用多種技術來予以實現,諸如, 現有中文分詞處理技術;在倒排索引庫2 5中進行匹配查找 的操作可採用現有全文檢索技術實現。應該理解到,本申 請案實施例中所採用的分詞技術和全文檢索技術,並不對 本申請案的保護範圍有所限制。 在摘要產生階段,摘要產生子模組222根據文檔ID序 列中記錄的文檔ID,在正排索引庫26中遍歷各文檔ID對應 的正排索引資料。針對每個文檔ID對應的正排索引資料, 對遍歷過的每一個單詞,記錄該單詞的位置以及該單詞的 文本,放入陣列Array_A中(通常是記錄到陣列資料結構 中),如果該單詞與關鍵字相同,則還要在集合Map_A* 記錄該單詞的文本以及該單詞的位置。當遍歷完文檔的正 排索引資料後,如果Array_A所記錄的所有單詞文本的總 長度不超過RL,則最佳摘要路徑爲Array_A中第一個單詞 的起始位置到最後一個單詞的終止位置:否則,遍歷 Map_A中記錄的單詞文本與位置鏈表,以找出符合條件的 最短文本段的起止位置單詞的相應位置Res-Beg和Res_End ,則Array_A中從Res_Beg到Res_End即爲最佳摘要路徑; 在確定出最佳摘要路徑之後,摘要產生子模組222根據確 定出的最佳摘要路徑,從相應正.排索引資料或全文資料中 -14- 201131400 定位出相應的文本段作爲摘要文本,並將其提交給結果提 交子模組2 2 3。上述滿足最佳摘要路徑的條件可包括:Best_Path: used to represent the best summary path, including the text start and stop position parameter, through which the corresponding text segment can be defined in the positive index data or the full text data, and the text segment defined by the best summary path is The matching degree of keywords is higher than that of other text segments; RL_ is the length of the summary text, usually expressed by the number of words, which can be assigned when the system is initialized. The information retrieval process of the retrieval module 22 can be divided into two stages: a document retrieval stage to retrieve a document matching the keyword (such as a document containing a keyword or a word having a meaning equivalent to the keyword). Document); a summary generation stage to generate a corresponding summary for the retrieved document. In the document retrieval phase, the retrieval sub-module 2 2 1 performs word segmentation processing on the query -13-201131400 words submitted by the user, and obtains all the keywords included in the query word. Then, a matching search is performed in the inverted index library 25 to find the IDs of the documents matching the keywords, and the document IDs are sorted in descending order of matching degree, and the sorted document IDs are stored as documents. ID sequence. The word segmentation process performed in this stage can be implemented by various techniques, such as the existing Chinese word segmentation processing technology; the matching search operation in the inverted index library 25 can be implemented by the existing full-text search technology. It should be understood that the word segmentation techniques and full-text search techniques used in the embodiments of the present application do not limit the scope of protection of the present application. In the summary generation stage, the summary generation sub-module 222 traverses the positive index data corresponding to each document ID in the forward index library 26 according to the document ID recorded in the document ID sequence. For each document ID corresponding to the positive index data, for each word traversed, record the position of the word and the text of the word, put into the array Array_A (usually recorded into the array data structure), if the word As with the keyword, the text of the word and the location of the word are also recorded in the collection Map_A*. After traversing the positive index data of the document, if the total length of all the word text recorded by Array_A does not exceed RL, the best summary path is the starting position of the first word in Array_A to the end of the last word: Otherwise, traverse the word text and position list recorded in Map_A to find the corresponding positions Res-Beg and Res_End of the start and end words of the shortest text segment that meet the condition, then the best summary path from Res_Beg to Res_End in Array_A; After determining the best summary path, the summary generation sub-module 222 locates the corresponding text segment as the summary text from the corresponding positive index data or the full-text data according to the determined best summary path, and Submit it to the result submission sub-module 2 2 3 . The above conditions for satisfying the best summary path may include:

Res —Beg到Res_End之間的文本長度爲rl,或者不超 過RL ;Res—The length of the text between Beg and Res_End is rl, or does not exceed RL;

Res_Beg到Res_End之間所包含的關鍵詞數量最多。 如果符合上述2個條件的摘要路徑有多個,則將這些 摘要路徑所界定的摘要文本中,重複關鍵詞數量之和最大 的摘要路徑作爲最佳摘要路徑。 結果提交子模組223根據文檔ID序列中記錄的各文檔 ID,從全文資料庫24或倒排索引庫25中查詢到對應的文檔 標題、Url等資訊,並連同摘要文本一起提交給輸出模組 23 ° 輸出模組23中可配置有摘要格式化參數Fmt_Arg ( Format Argument的英文縮寫)。輸出模組23根據參數 Fmt_Arg對摘要文本和文檔標題、Url等資訊進行格式處理 ’然後按照文檔ID序列中的文檔ID排列順序,將對應文檔 的相關資訊(如包括標題、Url、摘要)以規定格式顯示 輸出,較佳地,以全球資訊網通用的HTML (超文本標記 語言)頁面標記語言展現,爲用戶提供醒目、易辨認的檢 索結果,以使用戶獲得更好的用戶體驗。 摘要產生子模組222遍歷文檔的正排索引資料的流程 可如圖3A所示,遍歷完成後產生最佳摘要路徑的流程可如 圖3B所示。 摘要產生子模組222對文檔ID序列中的每個文檔ID所 -15- 201131400 對應的正排索引資料分別進行遍歷,如圖3 A所示,當摘要 產生子模組222從文檔ID序列獲取到一個文檔ID後,從正 排索引庫26中找到對應的正排索引資料,並執行以下步驟 步驟301,逐個單詞遍歷目前正排索引資料; 步驟302,是否遍歷到單詞,如果是,則執行步驟3〇3 ;否則,執行步驟3 06。 該步驟中,遍歷不到單詞的情況有兩種:已經遍歷到 目前正排索引資料的結尾,亦即,已經完成對目前正排_ 引資料的遍歷;或者,目前正排索引資料爲空的。 步驟3 03,將該單詞的文本以及目前正排索引資料中 所記錄的該單詞的目前位置記錄到Array__A中,透過 Array_ A中的記錄,可以確定各單詞的起止位置;例如, 對於“電腦”這個中文單詞,其起始位置爲“計”字所對 應的位置,其終止位置爲“機”字所對應的位置;而對於 英文或其他中間包含單詞間隔符的文字,則可透過單胃司胃 隔符來確定一個單詞的起始和終止位置; 步驟3 04,該單詞是否與關鍵字相同,如果是,則執 行步驟3 0 5 ;否則,返回步驟3 0 1,繼續遍歷該單詞後的其 他單詞; 步驟3 05,將該單詞的文本及目前正排索引資料中所 記錄該單詞的目前位置記錄到Map_A中。 當摘要產生子模組222遍歷完成正排索引資料後,可 根據遍歷過程所記錄的Array_A和Map_AM確定出最佳摘 -16- 201131400 要路徑,並將最佳摘要路徑參數賦値給Best_Path ,其過程 可如圖3B所示,包括以下步驟: 步驟310,將變數N賦値爲0,將Best_Path賦値爲空; 步驟311,根據Map — A,取其中一單詞文本所對應的 位置値而賦値給Res_Beg,然後根據RL從Array_A確定一 單詞文本的位置値,使從ReS_Beg開始到該確定出的位置 値所界定出的長度爲RL的文本段內包含有關鍵字,如果能 取到這樣的位置値,則繼續執行後續步驟3 1 2 ;如果不能 取到這樣的位置値,則執行步驟3 1 6 ; 步驟312,將確定出的位置値賦値給Res_End,使 Res _B eg到Re s_ End兩者之間的文本長度不超過RL,較佳 地,根據Array_A的記錄,使Res_Beg所對應的位置是 Array_A中記錄的單詞的開始位置(亦即,單詞文本的第 —個字元的位置),Res_End所對應的位置是Array_A中記 錄的單詞的結尾位置(亦即,單詞文本的最後一個字元的 位置),以保證Res_Beg和Res_End之間的文本段內容清晰 完整; 步驟313,根據Map_A而確定Res_Beg到Res_End之間 的文本段中所包含的關鍵字的數量,並將該數量値賦値給 變數η ; 步驟3 1 4,η是否大於Ν,如果是,則執行步驟3 1 5 ;否 則,返回步驟31 1,以確定下一個不同的Res_Beg ; 步驟3 1 5,將η的値賦値給N,將η清零,將目前 Res_Beg和Res_End_記錄到Best Path中,並返回步驟3U ’ 201131400 以確定下一個不同的Res_Beg ; 步驟3 16,Best_Path的目前値即爲最佳摘要路徑,輸 出該 Best_Path。 透過圖3 B所示的流程可以看出,摘要產生子模組2 2 2 透過多次迴路操作’每次取不同的位置値賦値給Res_B eg ’並且在目前Res_Beg和Res_End界定出的文本段中包含的 關鍵目司數量最多時,記錄目前Res_Beg和Res_End到 Best_Path中,這樣,最終Best_Path中記錄的Res_Beg和 Res_End所界定出的文本段中包含的關鍵字最多,從而得 到最佳摘要路徑。 需要說明的是:在最佳摘要路徑確定過程中,可以在 每遍歷完成一個正排索引資料後,利用遍歷結果而立即確 定該正排索引資料的最佳摘要路徑;也可以在遍歷完所有 正排索引資料後’根據正排索引資料各自的遍歷結果而分 別確定各自的最佳摘要路徑。 下面以應用於網際網路的資訊檢索系統爲例,透過一 具體實例而進一步對本申請案實施例的實現過程進行描述 〇 當在網際網路中發佈新網頁時,將該網頁內容的正排 索引資料儲存到資訊檢索系統的正排索引庫中,並建立與 該網頁標識的對應關係。在本實例中,新發佈的網頁及其 對應的正排索引資料分別爲: 網頁文檔1 : ID= 1〇〇,標題爲"應用於電腦的安全技 術"; -18- 201131400 對應的正排索引資料爲:(電腦:2 ' 5 0、90 ;安全 :25......),其長度爲1〇〇個字的長度;表示在網頁文檔1 的全文文本的第2、50、90個字元的位置出現“電腦”一 詞’在第2 5個字元的位置出現“安全” 一詞; 網頁文檔2 : ID = 2 00,標題爲“如何提高電腦的安全 性”; 對應的正排索引資料:(電腦:1 〇、70 :安全:15… …)’其長度爲100個字的長度;表示在網頁文檔2的全文 文本的第10、70個字元的位置出現“電腦”一詞,在第15 個字元的位置出現“安全” 一詞。 倒排索引資料包括:電腦(ID 1 00,ID 200 ),安全 (ID 100,ID 200 )...... 資訊檢索系統規定的檢索結果的資料格式爲HTML格 式,摘要文本的長度不超過5 0字。 當系統接收到用戶提交的查詢詞"電腦的安全”後, 將其進行分詞處理,獲得到關鍵字“電腦”和“安全”; 根據倒排索引資料匹配查詢,查找到包含有上述關鍵字的 網頁文檔爲ID=100和ID = 200的網頁文檔,然後根據該網頁 文檔而確定出對應的正排索引資料,由於關鍵字在ID=1〇〇 的網頁文檔中的出現次數爲4,在ID = 200的網頁文檔中出 現的次數爲3 ’則認爲前者與査詢詞的匹配程度更高;然 後,分別遍歷這兩個正排索引資料: 對網頁文檔ID = 1 00的正排索引資料的遍歷後,可獲得 到: -19- 201131400The number of keywords included between Res_Beg and Res_End is the largest. If there are multiple summary paths that meet the above two conditions, the summary path with the largest sum of repeated keywords in the summary text defined by these summary paths is used as the best summary path. The result submission sub-module 223 queries the corresponding document title, Url, etc. from the full-text database 24 or the inverted index library 25 according to each document ID recorded in the document ID sequence, and submits the information to the output module together with the summary text. The 23 ° output module 23 can be configured with a summary formatting parameter Fmt_Arg (English abbreviation for Format Argument). The output module 23 formats the summary text, the document title, the Url, and the like according to the parameter Fmt_Arg. Then, according to the order of the document IDs in the document ID sequence, the related information of the corresponding document (including the title, Url, and abstract) is specified. The format display output, preferably, is displayed in the HTML (Hypertext Markup Language) page markup language common to the World Wide Web, providing users with eye-catching and recognizable search results for a better user experience. The flow of the summary generation sub-module 222 traversing the positive index data of the document can be as shown in FIG. 3A, and the flow of generating the best summary path after the traversal is completed can be as shown in FIG. 3B. The summary generation sub-module 222 traverses the positive index data corresponding to each document ID -15-201131400 in the document ID sequence, as shown in FIG. 3A, when the summary generation sub-module 222 obtains from the document ID sequence. After a document ID, the corresponding positive index data is found from the positive index library 26, and the following step 301 is performed to traverse the current index data by word by word; step 302, whether to traverse the word, if yes, execute Step 3〇3; otherwise, go to step 3 06. In this step, there are two cases in which the word is not traversed: it has been traversed to the end of the current index data, that is, the traversal of the current data has been completed; or, the current index data is empty. . Step 3 03, the text of the word and the current position of the word recorded in the current index data are recorded into Array__A, and the start and end positions of each word can be determined through the record in Array_A; for example, for "computer" The Chinese word has a starting position corresponding to the position of the word "calculation", and its ending position is the position corresponding to the word "machine"; and for words in English or other words containing a word separator, it can be passed through the single stomach The stomach separator determines the start and end position of a word; Step 3 04, whether the word is the same as the keyword, if yes, then perform step 3 0 5; otherwise, return to step 3 0 1, continue to traverse the word Other words; Step 3 05, record the text of the word and the current position of the word recorded in the current index data into Map_A. After the digest generation sub-module 222 traverses the completion of the positive index data, the best excerpt-16-201131400 path is determined according to the Array_A and Map_AM recorded in the traversal process, and the best summary path parameter is assigned to the Best_Path. The process may be as shown in FIG. 3B, including the following steps: Step 310, assigning the variable N to 0, and assigning the Best_Path to null; Step 311, according to Map-A, taking the position corresponding to one of the word texts値 give Res_Beg, and then determine the position of a word text from Array_A according to RL, so that the text segment of length RL defined from ReS_Beg to the determined position 包含 contains a keyword, if such a position can be obtained値, proceed to the subsequent step 3 1 2; if such a position cannot be obtained, then step 3 1 6 is performed; in step 312, the determined position endowment is given to Res_End, so that Res _B eg to Re s_ End The length of the text between the two does not exceed RL. Preferably, according to the record of Array_A, the position corresponding to Res_Beg is the start position of the word recorded in Array_A (that is, the first character of the word text). Position), the position corresponding to Res_End is the end position of the word recorded in Array_A (that is, the position of the last character of the word text) to ensure that the text segment content between Res_Beg and Res_End is clear and complete; Step 313, Determine the number of keywords included in the text segment between Res_Beg and Res_End according to Map_A, and assign the quantity to the variable η; Step 3 1 4, η is greater than Ν, and if yes, perform step 3 1 5; otherwise, return to step 31 to determine the next different Res_Beg; Step 3 1 5, η η end to N, η to zero, record the current Res_Beg and Res_End_ to the Best Path, and return Step 3U '201131400 to determine the next different Res_Beg; Step 3 16, the current best path of Best_Path is the best summary path, and output the Best_Path. As can be seen from the flow shown in FIG. 3B, the digest generating sub-module 2 2 2 passes through multiple loop operations 'each time taking a different position endowment to Res_B eg ' and the text segment defined in the current Res_Beg and Res_End When the number of key items included in the maximum number is recorded, the current Res_Beg and Res_End are recorded to the Best_Path, so that the text segments defined by Res_Beg and Res_End recorded in the final Best_Path contain the most keywords, thereby obtaining the best summary path. It should be noted that in the process of determining the best summary path, after completing a positive index data per traversal, the traversal result can be used to immediately determine the best summary path of the index data; or all the positives can be traversed. After indexing the data, 'the best summary path is determined according to the traversal results of the positive index data. The following takes the information retrieval system applied to the Internet as an example, and further describes the implementation process of the embodiment of the present application through a specific example. When a new webpage is published in the Internet, the content of the webpage is indexed. The data is stored in the positive index database of the information retrieval system, and a correspondence relationship with the webpage identifier is established. In this example, the newly published webpage and its corresponding positive index data are: Webpage document 1: ID=1〇〇, titled "Security technology applied to the computer"; -18- 201131400 Corresponding positive The index data is: (computer: 2 ' 5 0, 90; security: 25...), the length of which is 1〇〇 word length; indicates the 2nd and 50th of the full text of the web document 1 At the position of 90 characters, the word "computer" appears in the word "safe" at the position of the 25th character; Web document 2: ID = 2 00, titled "How to improve the security of the computer"; Corresponding positive index data: (computer: 1 〇, 70: security: 15...) 'The length is 100 words; it means the 10th, 70th character position of the full text of the web page 2 The word “computer” appears as the word “safe” at the 15th character position. The inverted index data includes: computer (ID 1 00, ID 200), security (ID 100, ID 200)... The data format of the search result specified by the information retrieval system is HTML format, and the length of the summary text does not exceed 50 words. When the system receives the query word "computer security submitted by the user, it performs word segmentation processing to obtain the keywords "computer" and "security"; according to the inverted index data matching query, it finds that the above keyword is included The webpage document is a webpage document with ID=100 and ID=200, and then the corresponding positive index data is determined according to the webpage document, because the number of occurrences of the keyword in the webpage document with ID=1〇〇 is 4, The number of occurrences of the webpage document with ID = 200 is 3 ', the former is considered to match the query word more; then, the two positive index data are traversed separately: the index data of the web document ID = 1 00 After the traversal, you can get: -19- 201131400

Array_A :(…2 :電腦;…25 :安全;… …90 :電腦;…)Array_A :(...2 : computer;...25: security; ...90: computer;...)

Map_A :(電腦:2,50,90 ;安全:25 ) 根據獲得到的Array_A和Map_AU以進一 始位置爲2到終止位置爲52的文本段包含有3個 其他長度爲50的文本段所包含的關鍵字都多, 本的起止位置[2,5 2]作爲最佳的摘要文本路徑 同理,對網頁文檔ID = 200的正排索引資料 可以確定出最佳的摘要文本路徑爲[1,50]; 然後,該系統根據確定出的最佳摘要文本 摘要文本,並按照文檔與查詢詞匹配程度ί HTML格式,將網頁ID=100和網頁ID = 200的網 Url和摘要作爲檢索結果呈現給該用戶。 本申請案實施例還提供一種上述技術方案 ,亦即,在產生摘要時,不是透過遍歷正排索 得到摘要路徑,而是透過遍歷文檔的全文資料 要路徑,並相應地根據所獲得到的摘要路徑而 中提取出摘要文本,具體遍歷過程以及摘要路 與前述描述相似,在此不再贅述。 將本申請案實施例提供的技術方案與傳統 技術方案相比,僅需要對查詢詞進行分詞處理 式需要對査詢詞和全文資料都要進行分詞處理 請案實施例提供的技術方案的查詢效率會比傳 提高;本申請案實施例提供的技術方案利用文 50 :電腦; 步得出從起 關鍵字,比 則將該段文 * 的遍歷結果 路徑而產生 均高低,以 頁的標題、 的替代方案 引資料而獲 來獲得到摘 從全文文本 徑確定方式 的資訊檢索 ,而傳統方 ,可見本申 統方式有所 檔的正排索 -20- 201131400 引資料來產生文檔的摘要,而傳統方式需要利用文檔的全 文資料來產生文檔的摘要,而文檔的正排索引資料一方面 比文檔的全文資料所含資料量少,因而可以提高效率,另 一方面’利用文檔的正排索引資料可以較爲方便地統計單 詞出現的頻率、次數等’因而可以方便而準確地確定出文 檔的摘要’並能較爲準確和全面的槪括對應全文資料的內 容’因此,利用文檔的正排索引資料來產生文檔摘要,可 以一定程度上保證檢索結果的合理性和準確性》 綜合而言’由於本申請案實施例充分利用了資訊檢索 系統現有的全文索引結構,以及充分考慮到現有的檢索結 果展現形式,因此能夠在產生摘要時,利用更精確且更有 針對性的資料,可以提高產生效率,並且提高最終結果的 用戶滿意度。本申請案實施例提供的資訊檢索系統具有高 內聚、松耦合的特點,容易與現有的各種資訊檢索系統整 合’例外該系統還具有高性能、可伸縮、易定制的特點。 本申請案是參照根據本發明實施例的方法、設備(系 統)、和電腦程式產品的流程圖和/或方塊圖來加以描述 的。應理解可由電腦程式指令來實現流程圖和/或方塊圖 中的每一流程和/或方塊、以及流程圖和/或方塊圖中的 流程和/或方塊的結合。可提供這些電腦程式指令到通用 電腦、專用電腦'嵌入式處理機或其他可編程資料處理設 備的處理器以產生一個機器,使得透過電腦或其他可編程 資料處理設備的處理器執行的指令而產生用於實現在流程 圖一個流程或多個流程和/或方塊圖一個方塊或多個方塊 -21 - 201131400 中指定的功能的裝置。 這些電腦程式指令也可被儲存在能夠引導電腦或其他 可編程資料處理設備以特定方式操作的電腦可讀取記憶體 中,使得儲存在該電腦可讀取記憶體中的指令產生包括指 令裝置的製造品,該指令裝置實現在流程圖一個流程或多 個流程和/或方塊圖一個方塊或多個方塊中指定的功能。 這些電腦程式指令也可裝載到電腦或其他可編程資料 處理設備上,使得在電腦或其他可編程設備上執行一系列 操作步驟以產生電腦實現的處理,從而在電腦或其他可編 程設備上執行的指令提供用以實現在流程圖一個流程或多 個流程和/或方塊圖一個方塊或多個方塊中指定的功能的 步驟。 顯然’本領域的技術人員可以對本申請案進行各種修 改和變型而不脫離本申請案的精神和範圍。這樣,倘若本 申請案的這些修改和變型屬於本申請案申請專利範圍及其 等同技術的範圍之內,則本申請案也意圖包含這些修改和 變型在內。 【圖式簡單說明】 圖1爲本申請案實施例中的資訊檢索的流程示意圖; 圖2爲本申請案實施例中的資訊檢索系統的結構示意 圖; 圖3A、圖3B爲本申請案實施例中的資訊檢索系統的檢 索模組的處理流程示意圖。 -22- 201131400 【主要元件符號說明】 2 1 :輸入模組 22 :檢索模組 2 3 :輸出模組 24 :全文資料庫 25 :倒排索引庫 26 :正排索引庫 221 :檢索子模組 222 :摘要產生子模組 223 :結果提交子模組Map_A : (computer: 2, 50, 90; security: 25) According to the obtained Array_A and Map_AU, the text segment with the start position 2 and the end position 52 contains 3 other text segments of length 50. There are many keywords, the starting and ending position of the book [2, 5 2] is the same as the best summary text path. The index data of the web document ID = 200 can determine the best summary text path is [1, 50 Then, the system presents the webpage Url and the abstract of the webpage ID=100 and the webpage ID=200 as the search result according to the determined summary text of the best summary text and according to the degree of matching of the document with the query word ί HTML format. user. The embodiment of the present application further provides the foregoing technical solution, that is, when the digest is generated, instead of traversing the positive line, the summary path is obtained, and the path of the full-text data of the document is traversed, and correspondingly according to the obtained abstract. The summary text is extracted from the path, and the specific traversal process and the summary path are similar to the foregoing description, and are not described herein again. Compared with the conventional technical solution, the technical solution provided by the embodiment of the present application only needs to perform word segmentation processing on the query word, and needs to perform word segmentation processing on the query word and the full text data. The specific technical solution provided by the embodiment of the present application utilizes the text 50: the computer; the step is derived from the keyword, and the traversal result path of the segment* is generated to generate an average height, and the title of the page is replaced by The program has obtained the information retrieval from the method of determining the full-text path. However, the traditional method can be used to generate the abstract of the document, but the traditional way The full-text data of the document needs to be used to generate the abstract of the document, and the index data of the document is less than the amount of data contained in the full-text data of the document, so that the efficiency can be improved, and on the other hand, the information of the positive index of the document can be used. In order to conveniently count the frequency, number of times, etc. of words, it is convenient and accurate to determine the summary of the document' It is more accurate and comprehensive to include the content of the corresponding full-text data. Therefore, using the positive index data of the document to generate the document abstract can guarantee the rationality and accuracy of the search results to a certain extent. The example makes full use of the existing full-text index structure of the information retrieval system, and fully takes into account the existing form of retrieval results, so that it can improve the production efficiency by using more accurate and more targeted data when generating the abstract, and Improve user satisfaction for the end result. The information retrieval system provided by the embodiment of the present application has the characteristics of high cohesion and loose coupling, and is easy to integrate with various existing information retrieval systems. Exceptionally, the system is also characterized by high performance, scalability, and customization. The present application is described with reference to flowchart illustrations and/or block diagrams of a method, apparatus (system), and computer program product according to an embodiment of the invention. It will be understood that each flow and/or block of the flowcharts and/or <RTIgt; These computer program instructions can be provided to a processor of a general purpose computer, a special purpose computer 'embedded processor or other programmable data processing device to generate a machine that is generated by instructions executed by a processor of a computer or other programmable data processing device. A device for implementing the functions specified in a flow or a flow of a flow chart and/or a block diagram of one block or multiple blocks - 21 - 201131400. The computer program instructions can also be stored in a computer readable memory capable of directing a computer or other programmable data processing device to operate in a particular manner such that instructions stored in the computer readable memory are generated including the instruction device. Manufactured, the instruction means implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart. These computer program instructions can also be loaded onto a computer or other programmable data processing device to perform a series of operational steps on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flows of the flowchart or in a block or blocks of the flowchart. It will be apparent to those skilled in the art that various modifications and changes can be made in the present application without departing from the spirit and scope of the application. Accordingly, it is intended that the present invention cover the modifications and variations of the invention, and the scope of the invention is intended to be included in the scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic flow chart of information retrieval in an embodiment of the present application; FIG. 2 is a schematic structural diagram of an information retrieval system in an embodiment of the present application; FIG. 3A and FIG. 3B are embodiments of the present application; Schematic diagram of the processing flow of the retrieval module of the information retrieval system. -22- 201131400 [Key component symbol description] 2 1 : Input module 22 : Search module 2 3 : Output module 24 : Full-text database 25 : Inverted index library 26 : Positive index library 221 : Search sub-module 222: Summary generation sub-module 223: result submission sub-module

Claims (1)

201131400 七、申請專利範圍: 1.—種資訊檢索方法,應用於每個文檔對應有正排索 引資料的資訊檢索系統,其中,每個文檔的正排索引資料 以該文檔中的各單詞做爲索引,以記錄各單詞在該文檔中 的位置,其特徵在於,該方法包括以下步驟: 接收査詢詞,透過分詞處理而獲得到該查詢詞中所包 含的關鍵字: 透過該資訊檢索系統的倒排索引資料來查找與該關鍵 字匹配的文檔,及與該文檔對應的正排索引資料:以及 根據與該文檔對應的正排索引資料而確定出該文檔的 摘要,將該文檔的摘要及文檔的資訊作爲檢索結果而進行 輸出。 2 .如申請專利範圍第1項所述的方法,其中,根據正 排索引資料所確定出的摘要,具體爲: 根據正排索引資料所確定出的所有不超過規定長度的 文本段中,該關鍵字出現次數最多的文本段。 3. 如申請專利範圍第2項所述的方法,其中,若不超 過規定長度、該關鍵字出現次數最多的文本段爲多個,則 將其中該關鍵字重複出現次數最多的文本段作爲摘要。 4. 如申請專利範圍第2項所述的方法,其中,根據與 該文檔對應的正排索引資料而確定出該文檔的摘要的過程 ,具體爲: 針對該文檔中的每個文檔,執行以下步驟: 逐一遍歷該文檔的正排索引資料中的單詞,在第一資 -24- 201131400 料結構中以單詞位置做爲索引來記錄每個遍歷到的單詞, 並當遍歷到的單詞與該關鍵字匹配時,在第二資料結構中 以單詞做爲索引來記錄該單詞的位置; 根據第二資料結構的記錄,確定出摘要的起始位置; 根據規定的摘要長度和第一資料結構的記錄,以確定出摘 要的終止位置;該起始位置和終止位置之間的文本段的長 度不超過規定的摘要長度、該關鍵字在該文本段中的出現 次數最多:以及 根據確定出的摘要的起止位置以及正排索引資料而產 生文檔的摘要。 5·如申請專利範圍第1到4項中任一項所述的方法,其 中,將該文檔的摘要及該文檔的資訊作爲檢索結果而進行 輸出,具體爲: 按照該文檔與該關鍵字匹配程度從高到低的順序,將 該文檔的摘要及該文檔的資訊作爲檢索結果來進行排序, 並根據規定的資料格式而對排序後的檢索結果進行輸出。 6 · —種資訊檢索系統’包括儲存有該系統倒排索引資 料和每個文檔對應的正排索引資料的儲存模組,其中,每 個文檔的正排索引資料以該文檔中的各單詞做爲索引,以 記錄各單詞在該文檔中的位置,其特徵在於,還包括: 輸入模組’用以接收輸入到該系統的查詢詞; 檢索模組’用以對該查詢詞進行分詞處理而獲得到其 所包含的關鍵字’透過該倒排索引資料來查找與該關鍵字 匹配的文檔’及與該文檔對應的正排索引資料;根據該文 -25- 201131400 檔對應的正排索引資料而確定出該文檔的摘要;以及 輸出模組,用以將該文檔的摘要及文檔的資訊作爲檢 索結果而進行輸出。 7 .如申請專利範圍第6項所述的系統,其中,該檢索 模組根據該文檔對應的正排索引資料來確定該文檔的摘要 時,將根據正排索引資料所確定出的所有不超過規定長度 的文本段中,該關鍵字出現次數最多的文本段作爲該文檔 的摘要。 8.如申請專利範圍第7項所述的系統,其中,若該檢 索模組所確定出的不超過規定長度、該關鍵字出現次數最 多的文本段爲多個,則將其中該關鍵字重複出現次數最多 的文本段作爲摘要。 9 .如申請專利範圍第6項所述的系統,其中,該檢索 模組包括: 檢索子模組,用以對該査詢詞進行分詞處理而獲得到 其所包含的關鍵字,透過該倒排索引資料來査找與該關鍵 字匹配的文檔,及與該文檔對應的正排索引資料; 摘要產生子模組,用以針對該文檔中每個文檔對應的 正排索引資料,逐一遍歷其中的每個單詞,在第一資料結 構中以單詞位置做爲索引來記錄每個遍歷到的單詞,並當 遍歷到的單詞與該關鍵字匹配時,在第二資料結構中以單 詞做爲索引來記錄該單詞的位置:以及,根據第二資料結 構的記錄,確定出摘要的起始位置,根據規定的摘要長度 和第一資料結構的記錄,確定出摘要的終止位置,該起始 -26- 201131400 位置和終止位置之間的文本段的長度不超過規定的摘要長 度、該關鍵字在該文本段中的出現次數最多;然後,根據 確定出的摘要的起止位置及從正排索引資料產生文檔的摘 要;以及 提交子模組,用以將產生的該文檔的摘要及該文檔的 資訊提交到該輸出模組。 10.如申請專利範圍第6到9項中任一項所述的系統, 其中,該輸出模組將該文檔的摘要及該文檔的資訊作爲檢 索結果而進行輸出時,按照該文檔與該關鍵字匹配程度從 高到低的順序,將該文檔的摘要及該文檔的資訊作爲檢索 結果來進行排序,並根據規定的資料格式而對排序後的檢 索結果進行輸出。 -27-201131400 VII. Scope of application for patents: 1. An information retrieval method, which is applied to an information retrieval system in which each document corresponds to a positive index data, wherein the positive index data of each document is based on each word in the document. Indexing to record the position of each word in the document, characterized in that the method comprises the steps of: receiving a query word, and obtaining a keyword included in the query word through word segmentation processing: through the information retrieval system Querying the index data to find a document matching the keyword, and the positive index data corresponding to the document: and determining a summary of the document according to the positive index data corresponding to the document, and summarizing the document and the document The information is output as a result of the search. 2. The method of claim 1, wherein the summary determined according to the positive index data is: according to all the text segments determined by the positive index data not exceeding the prescribed length, The text segment with the most occurrences of the keyword. 3. The method of claim 2, wherein if the text segment that does not exceed the specified length and the keyword has the most occurrences is plural, the text segment in which the keyword is repeated most frequently is used as a summary . 4. The method of claim 2, wherein the process of determining the digest of the document based on the positive index data corresponding to the document is specifically: for each document in the document, performing the following Steps: traverse the words in the index data of the document one by one, and record each traversed word in the first capital -24-201131400 material structure with the word position as an index, and when traversing the word and the key When the word matches, the position of the word is recorded by using the word as an index in the second data structure; determining the starting position of the digest according to the record of the second data structure; according to the specified length of the digest and the record of the first data structure Determining the end position of the digest; the length of the text segment between the start position and the end position does not exceed the specified digest length, the keyword appears most frequently in the text segment: and according to the determined digest A summary of the document is generated from the start and end positions and the index data. The method of any one of claims 1 to 4, wherein the summary of the document and the information of the document are output as a retrieval result, specifically: matching the keyword according to the document The order of the document is sorted from the highest to the lowest, and the information of the document is sorted as a search result, and the sorted search result is output according to the specified data format. 6 - an information retrieval system includes a storage module storing the inverted index data of the system and the positive index data corresponding to each document, wherein the positive index data of each document is made with each word in the document Indexing to record the position of each word in the document, further comprising: an input module 'for receiving a query word input to the system; a search module' for performing word segmentation on the query word Obtaining the keyword 'through the inverted index data to find the document matching the keyword' and the positive index data corresponding to the document; according to the positive index data corresponding to the file - 25- 201131400 file And determining a summary of the document; and an output module for outputting the summary of the document and the information of the document as a retrieval result. 7. The system of claim 6, wherein the retrieval module determines, according to the positive index data corresponding to the document, the summary of the document, and the total determined according to the positive index data does not exceed In a text segment of a specified length, the text segment with the most occurrence of the keyword is used as a summary of the document. 8. The system of claim 7, wherein if the search module determines that the text segment does not exceed the specified length and the keyword has the most occurrences, the keyword is repeated. The most frequently occurring text segment is used as a summary. 9. The system of claim 6, wherein the search module comprises: a search sub-module for performing word segmentation on the query word to obtain a keyword contained therein, through the inverted row Indexing data to find a document matching the keyword, and a positive index data corresponding to the document; a summary generating sub-module for traversing each of the positive index data corresponding to each document in the document Words, in the first data structure, using the position of the word as an index to record each traversed word, and when the traversed word matches the keyword, the word is recorded as an index in the second data structure. The position of the word: and, according to the record of the second data structure, determining the starting position of the digest, determining the end position of the digest according to the specified length of the digest and the record of the first data structure, the start -26- 201131400 The length of the text segment between the position and the end position does not exceed the specified summary length, and the keyword appears most frequently in the text segment; A summary of the documents produced and the starting and ending position from the forward index data Abstract; and the submission of sub-module for the resulting document is a summary of the information and the documents submitted to the output module. 10. The system according to any one of claims 6 to 9, wherein the output module outputs the summary of the document and the information of the document as a retrieval result, according to the document and the key The order of the word matching is from high to low, the summary of the document and the information of the document are sorted as search results, and the sorted search results are output according to the specified data format. -27-
TW099106912A 2010-03-10 2010-03-10 Information retrieval method and its system TWI485570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW099106912A TWI485570B (en) 2010-03-10 2010-03-10 Information retrieval method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW099106912A TWI485570B (en) 2010-03-10 2010-03-10 Information retrieval method and its system

Publications (2)

Publication Number Publication Date
TW201131400A true TW201131400A (en) 2011-09-16
TWI485570B TWI485570B (en) 2015-05-21

Family

ID=50180363

Family Applications (1)

Application Number Title Priority Date Filing Date
TW099106912A TWI485570B (en) 2010-03-10 2010-03-10 Information retrieval method and its system

Country Status (1)

Country Link
TW (1) TWI485570B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI603211B (en) * 2012-10-09 2017-10-21 Alibaba Group Services Ltd Construction of inverted index system based on Lucene, data processing method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030050927A1 (en) * 2001-09-07 2003-03-13 Araha, Inc. System and method for location, understanding and assimilation of digital documents through abstract indicia
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
TWI329427B (en) * 2006-09-28 2010-08-21 Univ Nat Chiao Tung Data query method and data coding method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI603211B (en) * 2012-10-09 2017-10-21 Alibaba Group Services Ltd Construction of inverted index system based on Lucene, data processing method and device

Also Published As

Publication number Publication date
TWI485570B (en) 2015-05-21

Similar Documents

Publication Publication Date Title
JP5638616B2 (en) Information retrieval method and system
Cafarella et al. Data integration for the relational web
JP6461980B2 (en) Coherent question answers in search results
JP5616444B2 (en) Method and system for document indexing and data querying
CN110515896B (en) Model resource management method, model file manufacturing method, device and system
JP2008515061A (en) A method for searching data elements on the web using conceptual and contextual metadata search engines
US20150215271A1 (en) Generating suggested domain names by locking slds, tokens and tlds
EP2929469A2 (en) Query and index over documents
CN101727447A (en) Generation method and device of regular expression based on URL
KR20160124079A (en) Systems and methods for in-memory database search
WO2014040521A1 (en) Searching method, system and storage medium
JP2004220215A (en) Operation guide and support system and operation guide and support method using computer
US20140337357A1 (en) Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries
US8700624B1 (en) Collaborative search apps platform for web search
US20150347423A1 (en) Methods for completing a user search
US9990444B2 (en) Apparatus and method for supporting visualization of connection relationship
CN111966940B (en) Target data positioning method and device based on user request sequence
WO2024087832A1 (en) Search result display method and apparatus, and search result determination method and apparatus
CN117149804A (en) Data processing method, device, electronic equipment and storage medium
TW201131400A (en) Information indexing method and system thereof
WO2013097078A1 (en) Video search method and video search system
JP5127553B2 (en) Information processing apparatus, information processing method, program, and recording medium
JP2005128872A (en) Document retrieving system and document retrieving program
JP2016018279A (en) Document file search program, document file search device, document file search method, document information output program, document information output device, and document information output method
CN111680072B (en) System and method for dividing social information data