TW201131390A - Method and system for implementing a search service - Google Patents

Method and system for implementing a search service Download PDF

Info

Publication number
TW201131390A
TW201131390A TW99106626A TW99106626A TW201131390A TW 201131390 A TW201131390 A TW 201131390A TW 99106626 A TW99106626 A TW 99106626A TW 99106626 A TW99106626 A TW 99106626A TW 201131390 A TW201131390 A TW 201131390A
Authority
TW
Taiwan
Prior art keywords
index
data
preset
module
writing
Prior art date
Application number
TW99106626A
Other languages
Chinese (zh)
Inventor
han-fei Yang
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to TW99106626A priority Critical patent/TW201131390A/en
Publication of TW201131390A publication Critical patent/TW201131390A/en

Links

Abstract

The present application provides a method and system for implementing a search service, to avoid the sharing conflict in index writing and low efficiency in searching in the prior art. In an embodiment, data in a data source is divided into categories according to a preset data classifying manner, and a correspondence relationship is established between each category of data obtained from the division and a preset index; each category of data is written to an index corresponding to the category of data according to the correspondence relationship; and when a search instruction is received, an index is determined according to the received search instruction, and data is outputted according to the determined index. With the solution of the invention, the sharing conflict in index writing may be avoided and a higher search efficiency may be obtained.

Description

.201131390 六、發明說明 【發明所屬之技術領域】 本申請案係有關電腦技術領域,特別有關一種實施搜 索服務的方法與系統。 【先前技術】 隨著資訊技術的發展,各種應用場合中的資訊量也在 急速增加,爲了幫助人們從大量的資訊中找到自身所需的 資訊’目前出現了 一些搜索服務,能夠根據用戶提供的關 鍵字或對搜索目標的描述資訊,從資料源中進行全文搜索 並將搜索到的資料提供給用戶。 在實施搜索服務時,需要將資料源中的資料轉變爲索 引並保存起來,這個過程通常稱作索引的寫入。索引是按 一定規則整理的資料’因爲用戶在搜索時通常給出文本格 式的關鍵字,所以相應地索引也通常爲文本格式,以實施 按關鍵字進行全文搜索,有些搜索引擎中,索引的內容包 含在搜索引擎提供的網頁快照(snapshot)中。索引包含了 有關資料源中的各種資訊,對於資料源中的文本,索引可 以包含這些文本,對於資料源中的影像文件以及音頻、視 頻等非文本格式的文件,在索引中可以保存這些文件的來 源的欄位,例如它們在網際網路上的地址欄位。索引的寫 入工作通常由索引伺服器完成。當用戶進行搜索時,由搜 索伺服器接收用戶給出的搜索條件,然後根據該搜索條件 來決定用戶所需資料保存在哪個索引中,再從該索引中進 -5- 201131390 一步進行查找’查找到保存的索引之後將其中的資訊提供 給用戶。 爲了將大量資料轉變爲索引,通常會使用多個索引伺 服器進行索引寫入工作。一個索引文件是由一系列的索引 資料項目組成,一個索引資料項目稱爲一個文檔,在一般 情況下’一個文檔對應於源資料裏的一行記錄。各索引伺 服器將提取到的源資料裏的記錄轉變爲索引,多個索引伺 服器可㈤需要將索引寫入到同一個索引文件中。由於資料 完整性的原因’一個索引文件並不能由多台索引伺服器同 時進行寫入操作,因此,其中一台索引伺服器在寫資料的 過程中’其他索引伺服器處於閒置狀態,必須待該索引伺 服器完成寫資料之後才能對該索引文件進行寫資料;這種 由多台索引伺服器對同一個共用資源(索引文件)的資源 的寫入競爭行爲即爲寫入共用衝突。因此,採用現有技術 的索引寫入方式’由於多台索引伺服器對同一個共用資源 (索引文件)的寫入競爭而可能導致引入性能低下以及共 用衝突的問題。 另外’對於已經形成的索引,如果其容量較大,例如 位元組數超過1 G,則從索引中進一步查找所需內容的時 間較長’從而降低了搜索效率。另—方面,如果索引容量 過小’則在搜索時需打開多個索引,也會降低搜索效率。 因此在目前的搜索服務中存在有索引寫入共用衝突以 及搜索效率較低的問題,需要新的方法來實施搜索服務。 201131390 【發明內容】 本申請案的主要目的在於提供一種實施搜索服務的方 法與系統,以解決現有技術中索引寫入共用衝突以及搜索 效率較低的問題。 爲了解決上述問題’本申請案提供如下的技術方案: 一種實施搜索服務的方法,包括: 根據預設的資料分類方式來劃分資料源中的資料,在 劃分得到的各類資料與預設的索引之間建立對應關係; 根據所述對應關係而將各類資料寫入該類資料對應的 索引中; 當接收到搜索指令時,根據接收到的搜索指令以決定 索引’根據所決定的索引而輸出資料。 所述按所述對應關係而將資料寫入該類資料對應的索 引中包括: 將所述一類資料分配到一個索引寫入裝置; 索引寫入裝置根據各類資料,按所述對應關係而將資 料寫入該類資料對應的索引中。 所述預設的索引中’單個索引的最大容量被預先設 定’並根據所述資料源的容量和預先設定的單個索引的最 大容量來決定所述預設的索引的份數。 一種實施搜索服務的系統,包括: 劃分模組’用以根據預設的資料分類方式來劃分資料 源中的資料,並保存劃分得到的各類資料與預設的索引之 間的對應關係; 201131390 索引寫入模組,在所述系統中設有一個或多個,用以 根據所述對應關係,將各類資料寫入該類資料對應的索引 中; 索引儲存模組,用以保存索引。 所述系統進一步包括: 分配模組,用以按每一類資料分配到一個索引寫入模 組的規則而將所述資料分配到索引寫入模組; 所述索引寫入模組進一步用以根據每一類資料,按所 述對應關係而將資料寫入該類資料對應的索引中。 所述系統進一步包括: 源資料儲存模組,用以保存資料源中的資料; 搜索模組’用以接收搜索指令’以及根據接收到的搜 索指令以決定索引儲存模組中的索引,且根據所決定的索 引而輸出資料β 根據本申請案實施例的技術方案,透過將資料源中的 資料劃分並且使劃分得到的每類資料與索引相對應,在資 料寫入索引時按這種對應關係進行寫入操作,這樣避免了 索引寫入的共用衝突問題,並且在獲取某一類資料的資訊 時只需從該類資料所對應的索引中進行查找,有較高的搜 索效率。並且在本申請案實施例中,透過合適地選擇索引 的容量以使索引的容量不至於太大或太小,能夠避免在過 大的索引中進行查找而導致的效率低下,並且在獲取某些 涉及面較廣的資料時也避免了打開太多的容量較小的索 引’這些都有助於提高搜索效率,從而提高搜索服務的品 -8- 201131390BACKGROUND OF THE INVENTION 1. Field of the Invention The present application relates to the field of computer technology, and more particularly to a method and system for implementing a search service. [Prior Art] With the development of information technology, the amount of information in various applications is rapidly increasing. In order to help people find the information they need from a large amount of information, there are some search services that can be provided according to users. A keyword or description of a search target, a full-text search from the data source and the searched data is provided to the user. When implementing a search service, you need to turn the data in the data source into an index and save it. This process is usually called index write. Index is a material organized according to certain rules. Because users usually give keywords in text format when searching, the corresponding index is usually text format to implement full-text search by keyword. In some search engines, indexed content It is included in the snapshot provided by the search engine. The index contains various information about the data source. For the text in the data source, the index can contain the text. For the image files in the data source and the non-text format files such as audio and video, the files can be saved in the index. Source fields, such as their address fields on the Internet. The writing of the index is usually done by the index server. When the user performs a search, the search server receives the search condition given by the user, and then determines which index the user needs to store in the index according to the search condition, and then searches for the search from the index in the step -5 - 201131390 The information is provided to the user after the saved index. In order to convert large amounts of data into indexes, multiple index servers are often used for index writes. An index file consists of a series of index data items. An index data item is called a document. In general, a document corresponds to a row of records in the source data. Each index server converts the records in the extracted source data into indexes, and multiple index servers can (5) need to write the indexes into the same index file. Due to the integrity of the data, an index file cannot be written by multiple index servers at the same time. Therefore, one of the index servers is in the process of writing data, and the other index servers are idle. After the index server finishes writing the data, the index file can be written; the write competition behavior of the resources of the same shared resource (index file) by multiple index servers is a write sharing conflict. Therefore, the prior art index writing method' may cause poor performance and common conflicts due to write competition of multiple shared index servers for the same shared resource (index file). In addition, for an index that has already been formed, if its capacity is large, for example, the number of bytes exceeds 1 G, the time for further searching for the desired content from the index is longer, thereby reducing the search efficiency. On the other hand, if the index size is too small, then multiple indexes need to be opened during the search, which will also reduce the search efficiency. Therefore, there are problems of index write sharing conflicts and low search efficiency in current search services, and new methods are needed to implement search services. 201131390 SUMMARY OF THE INVENTION The main object of the present application is to provide a method and system for implementing a search service to solve the problem of index write sharing conflict and low search efficiency in the prior art. In order to solve the above problem, the present application provides the following technical solutions: A method for implementing a search service includes: dividing data in a data source according to a preset data classification manner, and classifying the obtained various types of data and a preset index Corresponding relationship is established; according to the corresponding relationship, various types of data are written into the index corresponding to the type of data; when the search instruction is received, the index is determined according to the received index according to the determined index data. The writing the data into the index corresponding to the type of data according to the corresponding relationship comprises: allocating the type of data to an index writing device; and the index writing device according to the various types of data, according to the corresponding relationship The data is written into the index corresponding to this type of data. The maximum capacity of the 'single index' is preset in the preset index and the number of copies of the preset index is determined according to the capacity of the data source and the maximum capacity of a preset single index. A system for implementing a search service, comprising: a partitioning module' for dividing data in a data source according to a preset data classification manner, and storing a correspondence between the divided types of data and a preset index; 201131390 The index writing module is provided with one or more in the system for writing various types of data into the index corresponding to the data according to the corresponding relationship; and an index storage module for storing the index. The system further includes: an allocation module, configured to allocate the data to an index writing module according to a rule that each type of data is allocated to an index writing module; the index writing module is further configured to For each type of data, the data is written into the index corresponding to the data according to the corresponding relationship. The system further includes: a source data storage module for storing data in the data source; a search module 'for receiving a search command' and determining an index in the index storage module according to the received search command, and according to Outputting the data according to the determined index. According to the technical solution of the embodiment of the present application, by dividing the data in the data source and correspondingly classifying each type of data and the index, the correspondence is written when the data is written into the index. The write operation is performed, so that the sharing conflict problem of index writing is avoided, and when the information of a certain type of data is acquired, only the index corresponding to the data of the type is searched, which has high search efficiency. And in the embodiment of the present application, by appropriately selecting the capacity of the index so that the capacity of the index is not too large or too small, it is possible to avoid inefficiency caused by searching in an excessive index, and in obtaining some involved The wide-ranging data also avoids opening too many smaller indexes', which all help to improve search efficiency and thus improve the search service -8-201131390

【實施方式】 下面結合附圖而對本申請案實施例的技術方案進行說 明,附圖作爲理解本申請案技術方案的輔助方式,本申請 案的技術方案在各種實施中並不限於附圖的形式。 本申請案實施例的技術方案可以被應用到圖1所示的 網路結構中’如圖1 ’用戶透過終端設備1 〇來訪問搜索 伺服器1 1,向搜索伺服器發送搜索指令,搜索指令中一 般包含用戶提出的關鍵字,該關鍵字指示了用戶所需資料 的一個範圍或者說一個特徵,例如用戶需要獲取郵件伺服 器中他自己的所有郵件,則關鍵字可以是他的郵件地址; 又如用戶需要獲取商品資訊,則關鍵字可以是商品的型 號。網路中可以存在多台終端設備10,如圖中所示。 搜索伺服器1 1在接收到搜索指令時,從搜索指令中 分析得出關鍵字,根據關鍵字而決定索引。因爲索引是按 照規則整理的資料,該規則一般來說係匹配於用戶的搜索 行爲’例如’對於郵件伺服器保存的大量郵件,可以將郵 件中的資料’包括郵件的文本以及郵件中的其他格式的文 件的鏈結位址寫到預先設置的若干個索引中,每一個索引 包含一個或幾個郵件位址的郵件。並且記錄每個郵件位址 的郵件中的資料被寫到了哪一個索引,於是當搜索伺服器 接收到用於搜索郵件的搜索指令時,該指令一般來說包含 了用戶的郵件位址,於是根據該郵件地址以及上述的記 -9- 201131390 錄,即可決定該郵件位址被整理到的索引。 索引可以被保存在索引儲存裝置12中,它與索引伺 服器13相連接,索引伺服器13的任務是將資料儲存裝置 I4中的資料按一定的規則而寫入索引儲存裝置12中。資 料儲存裝置1 4中的資料量可能不斷地增長,例如網路側 保存的用戶郵件數量可能隨時間的推進而不斷地增加。在 本實施例中,爲了能夠盡可能快地向用戶提供搜索結果, 單個索引的容量被限制在一定的範圍,既不能太大也不能 太小’适是因爲如果單個索引的容量過大,從這種容量過 大的索引中進一步搜索資料就會耗時較長;如果單個索引 的容量太小’則勢必形成大量索引,因爲打開一份索引較 爲耗時’所以索引數量不宜太多。這樣,可以給出單個索 引容量的上限,並且根據具體應用場景下的資料特性以及 搜索伺服器1 1的性能’從資料儲存裝置1 4中劃出一部分 資料作爲索引儲存裝置1 2的資料源,該資料源的容量則 是一定的’將該谷重除以給出的單個索引容量的上限再向 上取整’就得到了一個正整數’該正整數表明該資料源中 的資料經過整理後得到的谷里接近於給出的單個索引容量 上限的索引的份數。至於資料儲存裝置14中的其他資 料,則可以按照同樣的方法來決定這些資料整理得到的索 引的份數。 在實施搜索服務時’可以按圖2所示的流程進行,具 體包括如下步驟: 步驟2 1 :劃分資料源中的資料。 -10- 201131390 步驟22 :在劃分得到的每類資料與預設的索引之間 建立對應關係。 步驟23 :索引寫入裝置將資料寫入索引。 完成步驟21〜23之後就形成了索引,索引中包含了 資料源中資料的資訊。此時可以按索引中的內容向用戶提 供資訊。當接收到來自於用戶的搜索指令時,根據接收到 的搜索指令來決定索引’根據所決定的索引而輸出資料給 用戶。輸出的資料具體可以是資料源中的文本,也可以包 括資料源中其他資料在網際網路中的位址,例如影像的位 址以及音頻或視頻控制項所鏈結的位址。 在步驟2 1中,劃分資料源中的資料時,應當按照一 定的規則進行’這個規則可以從預設的資料分類方式中得 到。這裏的資料分類方式具體根據資料源中資料本身的性 質來決定。例如資料源中包含若干用戶的郵件,那麽就可 以按照郵件位址對資料分類,於是劃分資料時就按郵件位 址來進行劃分。 步驟22中的索引是預先設定的,具體需要設定單個 索引的最大容量,然後根據資料源的容量和單個索引的最 大容量來決定預設的索引的份數。一方面,索引越大,在 該索引中查找資料所需時間越長;另一方面,在打開每一 份索引時需要的時間較長,因此要儘量減少索引的數目, 這樣每份索引就比較大。綜合上述兩個方面,單個索引的 最大容量可以根據系統的處理能力的配置來決定。在一些 實施中,單個索引的最大容量是1G位元組。下面對建立 -11 - 201131390 步驟22中的對應關係的方法作出說明。 設資料源中的資料總容量爲τ ’若決定的單個索引的 最大容量是Μ,則索引的份數是計算Τ —Μ然後向上取 整。在決定了索引的份數之後,將這些索引按正整數而連 續編號,可以將該編號保存在索引文件中或索引檔案名 中。對於步驟21中劃分得到的每類資料,分配唯一的整 數値,亦即,每類資料與其他類資料分配到的整數値各不 相同。爲此,在本實施例中,利用與每類資料唯一對應的 欄位來得到該唯一的整數値。與每類資料唯一對應的欄位 同樣可以從資料的分類方式中得到,例如郵件位址具有唯 一性。以郵件地址爲例,提取各個用戶的郵件,然後將各 個用戶的郵件地址的欄位按同樣的規則映射爲數位,可以 採用欄位中各字元的 ASCII碼進行映射,再使用哈希 (H A S Η )演算法或其他散列演算法對映射得到的數位作進 一步數値§1*算而得到一個整數値。如果對目前的一類資料 §十算得到的唯一的整數値爲Η,以及索引的份數爲Ν,則 用Η%Ν進行計算得到一個數値,將索引編號等於該數値 的索引作爲與目前的這類資料相對應的索引。Η % Ν表示 用Η除以Ν之後取餘數。根據這樣的演算法,多類資料 可以對應於一份索引。這裏對每類資料分配唯一整數値的 方法僅爲舉例’也可以使用其他方式進行,只要保證在建 立資料與索引的對應關係時每一類資料僅對應於一份索引 即可。在建立了一類資料與一份索引的對應關係之後,記 錄該份索引具體對應了哪些類資料,並且可以檢查一下與 -12- 201131390 該份索引對應的各類資料的資料總量是否超出設定的單個 索引最大容量,若超出則可以將該類資料中超出該份索引 的部分與下一份索引進行對應,同樣可以對按索引編號進 行的下一份索引進行同樣的檢查。對於上述超出的部分資 料,應當記錄這些資料是對應了下—份索引。這樣,向用 戶fc供某類資料時,至多從兩份索引中進行查找,有助於 提高查找速度。在各種實施中,因爲每次查找涉及的資料 谷里5S小於單個索引谷量,例如在一份索引中保存了某部 門所有人員的郵件索引,則每個人員的郵件索引大小遠小 於該份索引大小’因此可以在寫索引時作調整,將同一郵 件位址的郵件資料寫到同一份索引中,使每人所有郵件的 索引保存在同一份索引’這樣每一個人員搜索自己郵件時 只在一份索引中進行搜索。 在步驟23中,爲了避免索引寫入共用衝突,可以將 每一類資料僅分配到同一個索引寫入裝置中,索引寫入裝 置在寫索引時’根據分配得到的資料進行索引寫入。這裏 的索引寫入裝置可以是具有寫入功能的電腦,例如圖1中 的索引伺服器13。爲了保證每一類資料僅分配到同一個 索引寫入裝置中’可以使用與步驟22中類似的方法,將 索引寫入裝置按正整數而連續編號’再針對每份索引,將 該索引的編號除以索引寫入裝置的數目然後取餘數,將該 份索引對應的資料分配到編號爲該餘數的索引寫入裝置。 也可以使用其他方式向索引寫入裝置分配資料,但應保證 每一類資料僅分配到同一個索引寫入裝置中。在索引寫入 -13- 201131390 裝置將資料寫入索引時,可以先將資料從資料源中導出, 將資料·中的文本以及其他格式的文件的鏈結位址保存在文 字檔案中’根據每類資料與索引的對應關係,對應於同一 索引的資料保存在同一文件中,並在該文件或檔案名中記 錄索引寫入裝置的標識以及該文件對應的索引標識。可以 使用步驟22中對索引進行的編號作爲索引標識,以及使 用步驟23中對索引寫入裝置進行的編號作爲索引寫入裝 置的標識。這裏的文件可以和資料源的資料儲存在同—台 儲存裝置中。接下來將導出的資料檔案列表記錄到文件狀 態表中’文件狀態表中記錄了每個文件的檔案名以及該文 件狀態,文件狀態表明文件中的資料已寫入索引或未寫入 索引,相應的文件狀態具體資訊可以是“已處理”和“未 處理”。檔案名可以採用如下格式:{資料字 首}_yyyy_mm_dd_hh_MM_ss_k.txt。其中資料字首中可以 添加關於資料的一些說明,後面的 yyyy_mm_dd_hh —MM_ss_k分別表示資料導出時的年份、 月份、日期、小時、分鐘和秒’ k是索引寫入裝置的編 號。文件狀態表可以在每一台索引寫入裝置中保存一份。 索引寫入裝置查詢文件狀態表’根據自身編號從保存上述 文件的儲存裝置中讀取相應的文件,按文件或檔案名中記 錄的索引標識將文件中的資料寫到相應的索引中。索引寫 入裝置可以按時間次序讀取文件。在將資料寫入索引之 後,可以在文件狀態表中記錄該文件的狀態爲“已處 理”。 -14- 201131390 基於本申請案實施例中的方法,下面對本申請案實施 例中的裝置作出說明。在以下的敍述中對各功能模組進行 說明,這些功能模組的總和構成了一種實施搜索服務的系 統。在實施中,各個功能模組可以各自位於單獨的裝置 中’也可以將多個功能模組作爲同一裝置的組成部分。本 申請案實施例中的方法可以使用軟體、硬體或二者相結合 的形式來予以實施,並且該軟體可以保存在光碟、半導體 儲存裝置或其他類型的儲存裝置中。 如圖3所示,搜索服務系統3 0用以實施搜索服務, 它包括劃分模組3 1、索引寫入模組3 2和索引儲存模組 33。其中,索引寫入模組32可以是一個或多個,圖中示 出了多個的情形。劃分模組3 1用以根據預設的資料分類 方式來劃分資料源中的資料,並按每一類資料僅對應於一 份索引的規則保存劃分得到的每類資料與預設的索引之間 的對應關係。索引寫入模組3 2具有上文中所述的索引寫 入裝置的所有功能,具體是根據每類資料,按劃分模組 31保存的對應關係’將每類資料寫入該類資料對應的索 引中。索引儲存模組33用以保存索引。 搜索服務系統3 0中還可以包括分配模組,用以按每 一類資料僅分配到一個索引寫入模組的規則而將資料分配 到索引寫入裝置,這樣’索引寫入模組32可以進—步用 以根據每類資料,按劃分模組3丨保存的對應關係而將資 料寫入該類資料對應的索引中。另外搜索服務系統3〇中 速可以包括源資料儲存模組’用以保存資料源中的資料^ -15- 201131390 並且搜索服務系統3 0還可以包括搜索模組,用以接收搜 索指令,以及根據接收到的搜索指令來決定索引儲存模組 33中的索引’根據所決定的索引而輸出資料。該搜索指 令一般來自於用戶操作的終端設備。 劃分模組31的一種結構可以是包括索引數目決定單 兀' 索引編號單元、劃分單元、特徵値分配單元以及索引 對應單元。索引數目決定單元用以根據資料源的容量和預 先設定的單個索引的容量來決定預設的索引的數目。索引 編號單元用以將預設的索引按正整數而連續編號。劃分單 元用以根據預設的資料分類方式來劃分資料源中的資料。 特徵値分配單元用以針對劃分單元劃分得到的每類資料, 向該類資料分配唯一的整數値。索引對應單元用以針對劃 分單元劃分得到的每類資料,將該類資料分配得到的整數 値除以預設的索引的份數然後取餘數,在該類資料與編號 爲所述餘數的索引之間建立對應關係。 如果劃分模組3 1採用了上述的結構,那麽分配模組 的結構可以是包括裝置編號單元和資料分配單元,其中, 裝置編號單元用以將索引寫入模組按正整數而連續編號β 資料分配單元用以針對每份索引,將該索引的編號除以索 引寫入模組3 2的數目然後取餘數,將該份索引對應的資 料分配到編號爲該餘數的索引寫入模組32。 劃分模組31中的索引對應單元可以進一步用以記錄 每類資料對應的索引的編號。這樣,資料分配單元的一種 結構可以是包括索引決定子單元和寫入子單元,其中,索 -16- 201131390 引決定子單元用以根據索引對應單元記錄的每類資料對應 的索引的編號來決定分配得到的每類資料對應的索引,寫 入子單兀用以根據分靼得到的每類資料,將資料寫入索引 決定子單元所決定的索引中。 根據本申請案實施例的技術方案,透過將資料源中的 資料劃分並且使劃分得到的每類資料與索引相對應,而且 在資料寫入索引時按這種對應關係進行寫入操作,這樣避 免了索引寫入的共用衝突問題,並且在獲取某一類資料的 資訊時只需從該類資料所對應的索引中進行查找,有較高 的搜索效率。並且在本申請案實施例中,透過合適地選擇 索引的容量以使索引的容量不至於太大或太小,能夠避免 在過大的索引中進行查找而導致的效率低下,並且在獲取 某些涉及面較廣的資料時也避免了打開太多的容量較小的 索引’這些都有助於提高搜索效率,從而提高搜索服務的 品質。 爲了描述的方便,以上所述裝置的各部分以功能而分 爲各種模組或單元來分別描述。當然,在實施本發明時可 以把各模組或單元的功能在同一個或多個軟體或硬體中實 施。 顯然,本領域的技術人員可以對本申請案進行各種修 改和變型而不違離本申請案的精神和範圍。這樣,倘若本 申請案的這些修改和變型屬於本申請專利範圍及其等同技 術的範圍之內,則本申請案也意圖包含這些修改和變型在 內。 -17- 201131390 【圖式簡單說明】 圖1爲應用本申請案實施例技術方案的網路的結構示 意圖; 圖2爲本申請案實施例中的方法流程圖; 圖3爲本申請案實施例中的裝置結構示意圖。 【主要元件符號說明】 1 〇 :終端設備 1 1 :搜索伺服器 12 :索引儲存裝置 13 :索引伺服器 1 4 :資料儲存裝置 3 0 :搜索服務系統 3 1 :劃分模組 32 :索引寫入模組 33 :索引儲存模組[Embodiment] The technical solutions of the embodiments of the present application are described below with reference to the accompanying drawings. The accompanying drawings are an auxiliary manner for understanding the technical solutions of the present application. The technical solutions of the present application are not limited to the forms of the drawings in various implementations. . The technical solution of the embodiment of the present application can be applied to the network structure shown in FIG. 1 'FIG. 1' The user accesses the search server 1 through the terminal device 1 to send a search command to the search server, and the search command The user generally includes a keyword suggested by the user, which indicates a range or a feature of the user's required information. For example, if the user needs to obtain all his own mails in the mail server, the keyword may be his email address; If the user needs to obtain product information, the keyword may be the model number of the product. There may be multiple terminal devices 10 in the network as shown in the figure. When receiving the search command, the search server 1 1 analyzes the keyword from the search command and determines the index based on the keyword. Because the index is organized according to the rules, the rule generally matches the user's search behavior 'for example, 'a large amount of mail saved for the mail server, the information in the mail' can include the text of the mail and other formats in the mail. The link address of the file is written to a number of pre-set indexes, each of which contains one or several mail addresses. And record which index the data in the mail of each mail address is written to, so when the search server receives the search instruction for searching the mail, the instruction generally includes the user's mail address, so according to The e-mail address and the above-mentioned -9-201131390 record can determine the index to which the e-mail address is collated. The index can be stored in the index storage device 12, which is coupled to the index server 13. The task of the index server 13 is to write the data in the data storage device I4 into the index storage device 12 according to certain rules. The amount of data in the data storage device 14 may continue to increase, for example, the number of user messages stored on the network side may increase continuously as time progresses. In this embodiment, in order to be able to provide search results to the user as quickly as possible, the capacity of a single index is limited to a certain range, neither too large nor too small 'suitable because if the capacity of a single index is too large, from this Further searching for data in an oversized index takes a long time; if the capacity of a single index is too small, it is bound to form a large number of indexes, because opening an index is time consuming, so the number of indexes should not be too much. In this way, an upper limit of the single index capacity can be given, and a part of the data is extracted from the data storage device 1 as a data source of the index storage device 12 according to the data characteristics in the specific application scenario and the performance of the search server 11. The capacity of the data source is a certain 'divided the valley by the upper limit of the given single index capacity and then rounded up to get a positive integer'. The positive integer indicates that the data in the data source has been collated. The number of copies in the valley close to the index of the given single index capacity. As for other materials in the data storage device 14, the number of copies of the data obtained by these data can be determined in the same manner. When the search service is implemented, it can be performed according to the flow shown in Figure 2, and specifically includes the following steps: Step 2: Divide the data in the data source. -10- 201131390 Step 22: Establish a correspondence between each type of data obtained and the preset index. Step 23: The index writing device writes the data to the index. After completing steps 21~23, an index is formed, and the index contains information about the data in the data source. At this point, you can provide information to the user by the content in the index. When a search command from the user is received, the index is determined based on the received search command to output data to the user based on the determined index. The output data may be the text in the data source, or the address of other data in the data source in the Internet, such as the address of the image and the address of the audio or video control. In step 2, when the data in the data source is divided, it should be performed according to certain rules. This rule can be obtained from the preset data classification method. The method of classifying the data here is determined by the nature of the data itself in the data source. For example, if the data source contains mails of several users, then the data can be classified according to the mail address, so the information is divided according to the mail address. The index in step 22 is preset. Specifically, the maximum capacity of a single index needs to be set, and then the number of copies of the preset index is determined according to the capacity of the data source and the maximum capacity of the single index. On the one hand, the larger the index, the longer it takes to find the data in the index; on the other hand, it takes longer to open each index, so try to reduce the number of indexes, so each index is compared. Big. Combining the above two aspects, the maximum capacity of a single index can be determined according to the configuration of the processing capability of the system. In some implementations, the maximum capacity of a single index is a 1G byte. The following describes the method of establishing the correspondence in step 22 of -11 - 201131390. Let the total capacity of the data in the data source be τ ’. If the maximum capacity of the single index determined is Μ, then the number of copies of the index is calculated Τ — and then rounded up. After determining the number of copies of the index, these indexes are consecutively numbered as positive integers, which can be saved in the index file or in the index file name. For each type of data obtained in step 21, a unique integer number is assigned, that is, each type of data is different from the integer number assigned to other types of data. To this end, in the present embodiment, the unique integer 値 is obtained by using a field uniquely corresponding to each type of material. The unique field corresponding to each type of data can also be obtained from the classification of the data, for example, the mail address is unique. Take the email address as an example, extract the emails of each user, and then map the fields of each user's email address to the same rule according to the same rule. You can use the ASCII code of each character in the field to map, and then use the hash (HAS). Η) The algorithm or other hash algorithm performs a further number §1* on the mapped digits to obtain an integer 値. If the only integer 算 calculated for the current type of data is Η, and the number of copies of the index is Ν, then Η%Ν is used to calculate a number 値, and the index with the index number equal to the number 作为 is used as the current The corresponding index of such information. Η % Ν indicates the remainder after dividing by Η. According to such an algorithm, multiple types of data can correspond to an index. Here, the method of assigning a unique integer 每 to each type of data is only an example. It can also be performed in other ways, as long as each type of data corresponds to only one index when establishing the correspondence between the data and the index. After establishing the correspondence between a type of data and an index, record which type of data corresponds to the index, and check whether the total amount of data corresponding to the index corresponding to -12-201131390 exceeds the set value. The maximum capacity of a single index. If it exceeds, the part of the data that exceeds the index can be associated with the next index. The same check can be performed for the next index by index number. For some of the above-mentioned excess information, it should be recorded that the data corresponds to the next index. In this way, when a certain type of data is supplied to the user fc, at most two indexes are searched, which helps to improve the search speed. In various implementations, because the 5S in the data field involved in each search is smaller than a single index, for example, an index is stored in an index for all personnel in a department, each person's mail index size is much smaller than the index. The size ' can therefore be adjusted when writing the index, the mail information of the same mail address is written to the same index, so that the index of all mails of each person is stored in the same index' so that each person searches for his own mail only in one Search in the index. In step 23, in order to avoid index write sharing conflicts, each type of material can be allocated only to the same index writing device, and the index writing device performs index writing based on the allocated data when writing the index. The index writing means here may be a computer having a writing function, such as the indexing server 13 in Fig. 1. In order to ensure that each type of data is only allocated to the same index writing device, 'the method similar to that in step 22 can be used, the index writing device is consecutively numbered by a positive integer' and then the number of the index is divided for each index. The number of index writing devices is then used to take the remainder, and the data corresponding to the index is assigned to the index writing device numbered as the remainder. Other methods of assigning data to the index writer can also be used, but it should be ensured that each type of material is only assigned to the same index writer. When the index is written to -13, 201131390, when the device writes the data into the index, the data can be exported from the data source, and the text in the data and the link address of the file in other formats are saved in the text file. The correspondence between the class data and the index, the data corresponding to the same index is stored in the same file, and the identifier of the index writing device and the index identifier corresponding to the file are recorded in the file or file name. The number of the index in step 22 can be used as the index identifier, and the number of the index writing device in step 23 can be used as the index of the index writing device. The files here can be stored in the same storage device as the data source. Next, the exported data file list is recorded in the file status table. The file status table records the file name of each file and the status of the file. The file status indicates that the data in the file has been written into the index or not written into the index. The file status specific information can be "processed" and "unprocessed". The file name can be in the following format: {data first}_yyyy_mm_dd_hh_MM_ss_k.txt. Some descriptions of the data can be added to the data prefix, and the following yyyy_mm_dd_hh_MM_ss_k indicates that the year, month, date, hour, minute, and second when the data is exported, 'k is the number of the index writing device. The file status table can be saved in one index writer. The index writing device queries the file status table' to read the corresponding file from the storage device storing the above file according to its own number, and writes the data in the file to the corresponding index according to the index identifier recorded in the file or file name. The index writing device can read the files in chronological order. After the data is written to the index, the status of the file can be recorded as "Processed" in the file status table. -14- 201131390 Based on the method in the embodiment of the present application, the device in the embodiment of the present application will be described below. The functional modules are described in the following description, and the sum of these functional modules constitutes a system for implementing a search service. In practice, the various functional modules may each be located in a separate device' or multiple functional modules may be part of the same device. The method of the embodiments of the present application can be implemented in the form of a combination of software, hardware, or both, and the software can be stored in a compact disc, a semiconductor storage device, or other type of storage device. As shown in FIG. 3, the search service system 30 is configured to implement a search service, which includes a partitioning module 31, an index writing module 3 2, and an index storage module 33. The index writing module 32 may be one or more, and a plurality of cases are shown in the figure. The dividing module 3 1 is configured to divide the data in the data source according to the preset data classification manner, and save each type of data and the preset index between the classified indexes according to the rules of each type of data corresponding to only one index. Correspondence relationship. The index writing module 32 has all the functions of the index writing device described above, specifically, according to each type of data, according to the correspondence relationship saved by the dividing module 31, each type of data is written into the index corresponding to the type of data. in. The index storage module 33 is used to store an index. The search service system 30 may further include an allocation module for allocating data to the index writing device according to the rules of assigning only one index writing module to each type of data, so that the index writing module 32 can enter Steps are used to write data into the index corresponding to the data according to the corresponding relationship of the partitioning module 3丨 according to each type of data. In addition, the search service system may include a source data storage module 'for storing data in the data source ^ -15- 201131390 and the search service system 30 may further include a search module for receiving search commands, and according to The received search command determines that the index in the index storage module 33 'outputs data according to the determined index. The search command is typically from a terminal device operated by the user. A structure of the partitioning module 31 may include an index number determining unit 索引' index number unit, a dividing unit, a feature 値 allocating unit, and an index corresponding unit. The index number determining unit is configured to determine the number of preset indexes based on the capacity of the data source and the capacity of a predetermined single index. Index The numbering unit is used to serially preset preset indexes by positive integers. The dividing unit is used to divide the data in the data source according to a preset data classification method. The feature 値 allocation unit is configured to assign a unique integer 向 to each type of data obtained by dividing the unit. The index corresponding unit is configured to divide each of the obtained data by dividing the unit, divide the integer obtained by the data division by the number of copies of the preset index, and then take the remainder, where the data and the index are the index of the remainder. Establish a correspondence. If the partitioning module 3 1 adopts the above structure, the structure of the distribution module may include a device numbering unit and a data distribution unit, wherein the device numbering unit is configured to serially number the index data by the index writing module by a positive integer. The allocation unit is configured to divide the number of the index by the number of index writing modules 3 2 and then the remainder for each index, and allocate the data corresponding to the index to the index writing module 32 numbered as the remainder. The index corresponding unit in the partitioning module 31 can further be used to record the number of the index corresponding to each type of material. In this way, a structure of the data allocation unit may include an index determining subunit and a writing subunit, wherein the determining unit is used to determine the index of each type of data recorded by the index corresponding unit. The index corresponding to each type of data allocated is written, and the sub-form is used to write the data into the index determined by the index determining sub-unit according to each type of data obtained by the branching. According to the technical solution of the embodiment of the present application, the data in the data source is divided and each type of data obtained by the division is corresponding to the index, and the writing operation is performed according to the corresponding relationship when the data is written into the index, thereby avoiding The sharing conflict problem of index writing, and only need to search from the index corresponding to the data of a certain type of data, has higher search efficiency. And in the embodiment of the present application, by appropriately selecting the capacity of the index so that the capacity of the index is not too large or too small, it is possible to avoid inefficiency caused by searching in an excessive index, and in obtaining some involved The wider data also avoids opening too many smaller indexes', which can help improve search efficiency and improve the quality of search services. For the convenience of description, the various parts of the above described devices are described as functionally divided into various modules or units. Of course, the functions of the various modules or units can be implemented in the same or multiple software or hardware in the practice of the present invention. It is apparent that those skilled in the art can make various modifications and variations to the present application without departing from the spirit and scope of the present application. Thus, it is intended that the present invention cover the modifications and variations of the present invention, and the scope of the present invention is intended to be included within the scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic structural diagram of a network to which the technical solution of the embodiment of the present application is applied; FIG. 2 is a flowchart of a method in the embodiment of the present application; FIG. Schematic diagram of the structure of the device. [Description of main component symbols] 1 〇: Terminal device 1 1 : Search server 12 : Index storage device 13 : Index server 1 4 : Data storage device 3 0 : Search service system 3 1 : Partition module 32 : Index write Module 33: Index Storage Module

Claims (1)

201131390 七、申請專利範圍 ι· 一種實施搜索服務的方法,其特徵在於,包括步 驟: 根據預設的資料分類方式來劃分資料源中的資料,在 劃分得到的各類資料與預設的索引之間建立對應關係; 根據該各類資料,按該對應關係而將資料寫入該類資 料對應的索引中;及 當接收到搜索指令時’根據該接收到的搜索指令以決 疋索引’且根據該決定的索引而輸出資料。 2_如申請專利範圍第1項所述的方法,其中,按該 對應關係而將資料寫入該類資料對應的索引中之該步驟包 括: 將該各類資料分配到一個索引寫入裝置;及 該索引寫入裝置根據該各類資料’按該對應關係而將 資料為入該類資料對應的索引中。 3-如申請專利範圍第丨或2項所述的方法,其中’ 在該預設的索引中,單個索引的最大容量被預先設定,並 根據該資料源的容量和該預先設定的單個索引的最大容量 而決定該預設的索引的數目。 4 ·如申請專利範圍第3項所述的方法,其中,在劃 分得到的各類資料與賴_引之間建S對應關係的該步 驟包括步驟: 將該預設的索引按正整數而連續編號;及 針對該劃分得到的各類資料,向該類畜料分配唯一的 [S] -19- 201131390 整數値,將該整數値除以該預設的索引的數目然後取餘 數,在該類資料與編號爲該餘數的索引之間建立對應關 係。 5. 如申請專利範圍第4項所述的方法,其中,將該 類資料分配到索引寫入裝置的該步驟包括步驟: 將該索引寫入裝置按正整數而連續編號;及 針對每一個索引,將該索引的編號除以該索引寫人_ 置的數目然後取餘數,將該一個索引對應的資料分配到編 號爲該餘數的索引寫入裝置。 6. 如申請專利範圍第5項所述的方法,其中,將該 一個索引對應的資料分配到編號爲該餘數的索引寫人 的該步驟之後進一步包括: 在該一個索引對應的該資料中添加該一個索弓丨的帛 硫, 從各類資料中獲取索引的編號; 根據獲取的編號來決定索引;及 根據該各類資料,將資料寫入該所決定的索引Φ。 7 · —種實施搜索服務的系統,其特徵在於,包括: 劃分模組,用以根據預設的資料分類方式來劃分資料 源中的資料,並保存劃分得到的各類資料與預設的索弓丨之^ 間的對應關係; 索引寫入模組’在該系統中設有一個或多個,用以根 據該各類資料,按該對應關係而將資料寫入該類資料對應 的索引中;及 -20- 201131390 索引儲存模組,用以保存該索引。 8·如申請專利範圍第7項所述的系統,其中,進— 步包括分配模組,用以按該各類資料分配到—個索弓丨寫λ 模組的規則而將該資料分配到該索引寫人模組; 該索引寫入模組進一步用以根據該各類資料,按該對 應關係而將資料寫入該類資料對應的索引中。 9.如申請專利範圍第7或8項所述的系統,其中, 進一步包括: 源資料儲存模組’用以保存資料源中的資料;及 搜索模組’用以接收搜索指令’以及根據該接收到的 搜索指令以決定該索引儲存模組中的索引,且根據該所決 定的索引而輸出資料。 1 〇.如申請專利範圍第7或8項所述的系統,其中, 該劃分模組包括: 索引數目決定單元,用以根據該資料源的容量和預先 設定的單個索引的容量而決定該預設的索引的數目; #引編號單元’用以將該預設的索引按正整數而連續 編號; fij # H ’用以根據預設的資料分類方式來劃分資料 源中的資料; 特徵値分配單元,用以針對該劃分單元劃分得到的各 類資料,向該類資料分配唯一的整數値;及 索弓丨對應單元’用以針對該劃分單元劃分得到的各類 資料’將該類資料分配得到的整數値除以該預設的索引的 -21 - 201131390 數目然後取餘數,在該類資料與編號爲該餘數的索引之間 建立對應關係。 11.如申請專利範圍第1 〇項所述的系統,其中,該 分配模組包括: 裝置編號單元,用以將該索引寫入模組按正整數而連 續編號;及 資料分配單元’用以針對每一個索引,將該一個索引 的編號除以該索引寫入模組的數目然後取餘數,將該一個 索引對應的資料分配到編號爲該餘數的索引寫入模組。 1 2 ·如申請專利範圍第1 1項所述的系統,其中,該 索引對應單元進一步用以記錄各類資料對應的索引的編 號; 該資料分配單元包括: 索引決定子單元,用以根據該索引對應單元記錄的各 類資料對應的索引的編號來決定分配得到的各類資料對應 的索引;及 寫入子單元,用以根據分配得到的各類資料,將資料 寫入該索引決定子單元所決定的索引中。 -22-201131390 VII. Patent application scope ι· A method for implementing a search service, comprising the steps of: dividing the data in the data source according to a preset data classification manner, and dividing the obtained various types of data with a preset index Corresponding relationship is established; according to the various types of materials, the data is written into the index corresponding to the data according to the correspondence; and when the search command is received, 'the index is determined according to the received search command' and according to Output the data by indexing the decision. The method of claim 1, wherein the step of writing the data into the index corresponding to the data according to the correspondence comprises: assigning the various types of data to an index writing device; And the index writing device selects the data into an index corresponding to the type of data according to the type of data. 3. The method of claim 2, wherein 'in the preset index, the maximum capacity of the single index is preset, and according to the capacity of the data source and the preset single index The maximum capacity determines the number of indexes of the preset. 4. The method of claim 3, wherein the step of constructing the S correspondence between the divided types of data and the quotation includes the step of: locating the preset index as a positive integer Number; and for each type of data obtained by the division, assign a unique [S] -19- 201131390 integer 该 to the treasury, divide the integer 値 by the number of the preset index and then take the remainder, in this category The data is associated with an index numbered as the remainder. 5. The method of claim 4, wherein the step of assigning the type of material to the index writing device comprises the steps of: consecutively numbering the index writing device by a positive integer; and for each index And dividing the number of the index by the number of the index writers and then taking the remainder, and assigning the data corresponding to the one index to the index writing device numbered as the remainder. 6. The method of claim 5, wherein the step of assigning the data corresponding to the one index to the index writer whose number is the remainder further comprises: adding in the material corresponding to the one index The sulphur sulphur is obtained from the various types of data; the index is determined according to the obtained number; and according to the various types of data, the data is written into the determined index Φ. 7 - A system for implementing a search service, comprising: a dividing module, configured to divide data in a data source according to a preset data classification manner, and save various types of data and presets Corresponding relationship between the two; the index writing module 'in the system is provided with one or more, according to the various types of data, according to the corresponding relationship, the data is written into the index corresponding to the data. ; and -20- 201131390 Index storage module to save the index. 8. The system of claim 7, wherein the step further comprises a distribution module for assigning the data to the rule of the type of data to the λ module. The index writing module is further configured to write data into the index corresponding to the data according to the correspondence according to the various types of materials. 9. The system of claim 7 or 8, wherein the method further comprises: a source data storage module 'for storing data in the data source; and a search module 'for receiving search commands' and according to the The received search command determines an index in the index storage module, and outputs data according to the determined index. The system of claim 7 or 8, wherein the partitioning module comprises: an index number determining unit, configured to determine the pre-determination according to a capacity of the data source and a preset capacity of a single index. The number of indexes set; #引编号单位' is used to serially number the preset index by a positive integer; fij # H ' is used to divide the data in the data source according to the preset data classification method; a unit for allocating various types of data obtained by the dividing unit, and assigning a unique integer 该 to the type of data; and the corresponding unit of the 丨 丨 ' 用以 用以 用以 用以 ' ' ' 丨 丨The obtained integer is divided by the number of the 21 - 201131390 of the preset index and then the remainder is used, and a correspondence is established between the data of the class and the index numbered as the remainder. 11. The system of claim 1, wherein the distribution module comprises: a device numbering unit for consecutively numbering the index writing module by a positive integer; and a data distribution unit For each index, the number of the one index is divided by the number of the index writing module and then the remainder is used, and the data corresponding to the one index is allocated to the index writing module numbered as the remainder. The system of claim 11, wherein the index corresponding unit is further configured to record a number of an index corresponding to each type of data; the data distribution unit includes: an index determining subunit, The index corresponding to each type of data recorded by the index corresponding unit determines the index corresponding to the various types of data allocated; and the writing subunit is used to write the data into the index determining subunit according to the various types of data obtained by the allocation. In the index determined. -twenty two-
TW99106626A 2010-03-08 2010-03-08 Method and system for implementing a search service TW201131390A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW99106626A TW201131390A (en) 2010-03-08 2010-03-08 Method and system for implementing a search service

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW99106626A TW201131390A (en) 2010-03-08 2010-03-08 Method and system for implementing a search service

Publications (1)

Publication Number Publication Date
TW201131390A true TW201131390A (en) 2011-09-16

Family

ID=50180353

Family Applications (1)

Application Number Title Priority Date Filing Date
TW99106626A TW201131390A (en) 2010-03-08 2010-03-08 Method and system for implementing a search service

Country Status (1)

Country Link
TW (1) TW201131390A (en)

Similar Documents

Publication Publication Date Title
JP5661104B2 (en) Method and system for search using search engine indexing and index
US9582587B2 (en) Real-time content searching in social network
WO2015180432A1 (en) Clustering storage method and device
US20100281077A1 (en) Batching requests for accessing differential data stores
US8572110B2 (en) Textual search for numerical properties
US20110040761A1 (en) Estimation of postings list length in a search system using an approximation table
WO2013078583A1 (en) Method and apparatus for optimizing data access, method and apparatus for optimizing data storage
WO2019080431A1 (en) Electronic device, data query method and system, and storage medium
US10956125B2 (en) Data shuffling with hierarchical tuple spaces
WO2024022180A1 (en) Network disk document indexing method and apparatus, and network disk and storage medium
TW201131390A (en) Method and system for implementing a search service
CN112464049B (en) Method, device and equipment for downloading number detail list
US20130218851A1 (en) Storage system, data management device, method and program
CN110399451B (en) Full-text search engine caching method, system and device based on nonvolatile memory and readable storage medium
CN113641705A (en) Marketing disposal rule engine method based on calculation engine
Habbal et al. BIND: An indexing strategy for big data processing
CN111368178A (en) Information processing method and device and readable storage medium
CN115481298B (en) Graph data processing method and electronic equipment
WO2023141987A1 (en) File reading method and apparatus
US9189488B2 (en) Determination of landmarks
JP7484308B2 (en) File management device and file management program
US7346756B2 (en) System, computer readable medium and method for multi-tiered data access
CN117950577A (en) Cluster capacity management method, device, equipment and storage medium
CN105630903B (en) Method and device for rapidly storing mass data
CN116483813A (en) Method, device, medium and equipment for designing secondary cache in search