JP5657498B2

JP5657498B2 - File search system

Info

Publication number: JP5657498B2
Application number: JP2011217881A
Authority: JP
Inventors: 晃治中山; 康裕桐畑; 晋平西田
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2011-09-30
Filing date: 2011-09-30
Publication date: 2015-01-21
Anticipated expiration: 2031-09-30
Also published as: JP2013077233A

Description

本発明は、大規模なファイル群を対象とした検索用インデクスの効率的な生成・更新・管理技術に関する。 The present invention relates to a technique for efficiently generating, updating, and managing a search index for a large group of files.

近年におけるアプリケーションの多様化やストレージコストの低価格化に伴い、ストレージに保存されるデータ量は爆発的に増加している。これに伴い、企業内で扱うドキュメントデータのデータ量も膨大になっている。このため、大量に存在するデータを有効活用するための検索システムの重要性が増している。 With the recent diversification of applications and lower storage costs, the amount of data stored in storage has increased explosively. Along with this, the amount of document data handled in the company has become enormous. For this reason, the importance of a search system for effectively utilizing a large amount of data is increasing.

通常、検索対象とするドキュメントの数が膨大である場合、検索インデクス（索引データ）の事前の生成により、検索パフォーマンスの向上が図られている。この他、同じ検索インデクスを複数の検索サーバに設置して負荷を分散する方法や、複数の検索サーバ上に検索インデクスを分割配置し、検索処理を分散する方法等も、検索パフォーマンスの向上を図る方法として一般に採用されている。 Usually, when the number of documents to be searched is enormous, search performance is improved by generating search indexes (index data) in advance. In addition, methods such as installing the same search index on multiple search servers to distribute the load and distributing search processing by dividing search indexes on multiple search servers to improve search performance Generally used as a method.

このような技術背景において、検索インデクスの生成方法についても、様々な技術が提案されている。例えば特許文献１には、分割された検索インデクスのサイズの偏りをなるべく低減する手法が開示されている。 In such a technical background, various techniques have been proposed as a search index generation method. For example, Patent Document 1 discloses a technique for reducing the size deviation of the divided search indexes as much as possible.

特開２０１１−７０２５７号公報JP 2011-70257 A

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Webhttp://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdfConsistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Webhttp: //www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf

しかし、現在のＩＴ情勢を考慮すると、検索対象となるデータ量は、今後ますます肥大化すると考えられる。また、検索サーバ数や分割インデクス数も膨大になることが容易に予想される。従って、今後は、分割インデクスを高速に生成できる仕組みが必要になると発明者らは考える。 However, in consideration of the current IT situation, the amount of data to be searched is considered to increase further in the future. In addition, the number of search servers and the number of divided indexes are easily expected to become enormous. Therefore, the inventors think that a mechanism capable of generating a split index at high speed will be required in the future.

そこで、発明者らは、前述した課題のうち分割インデクスの高速生成を目的として、各検索サーバに割り当てる分割インデクスの生成対象となるファイルパスのリストを生成する第１の処理機能部と、前記リストに基づいて分割インデクスを生成する第２の処理機能部と、生成された分割インデクスを検索サーバに配置する第３の処理機能部と、前記第１〜第３の処理機能部間における処理動作をパイプライン処理により実現する第４の処理機能部とを有するファイル検索システムを提案する。 Therefore, the inventors have a first processing function unit that generates a list of file paths to be generated for the split index to be assigned to each search server for the purpose of high-speed generation of the split index among the above-described problems, and the list A processing function between the second processing function unit that generates the split index based on the third processing function unit that places the generated split index in the search server, and the first to third processing function units. A file search system having a fourth processing function unit realized by pipeline processing is proposed.

本発明によれば、分割インデクスを高速に生成することができる。上述した以外の課題、構成及び効果は、以下の実施の形態の説明により明らかにされる。 According to the present invention, a split index can be generated at high speed. Problems, configurations, and effects other than those described above will become apparent from the following description of embodiments.

実施の形態に係る検索システムの概念構成を示す図。The figure which shows the conceptual structure of the search system which concerns on embodiment. 検索サーバの機能構成例を示す図。The figure which shows the function structural example of a search server. 分散処理サーバの機能構成例を示す図。The figure which shows the function structural example of a distributed processing server. 管理サーバの機能構成例を示す図。The figure which shows the function structural example of a management server. インデクスＩＤテーブルのデータ構造例を示す図。The figure which shows the example of a data structure of an index ID table. 検索サーバ管理テーブルのデータ構造例を示す図。The figure which shows the data structure example of a search server management table. ファイル管理テーブルのデータ構造例を示す図。The figure which shows the data structure example of a file management table. システムの初期化フローを示す図。The figure which shows the initialization flow of a system. インデクスＩＤテーブルの初期化フローを示す図。The figure which shows the initialization flow of an index ID table. 初期化が終了したインデクスＩＤテーブル例を説明する図。The figure explaining the example of index ID table which initialization was complete | finished. スキャナモジュールによるインデクスリストの生成フローを示す図。The figure which shows the production | generation flow of the index list by a scanner module. インデクス生成モジュールによる分割インデクスの生成フローを示す図。The figure which shows the production | generation flow of the division | segmentation index by an index production | generation module. 検索サーバへの分割インデクスの配置フローを示す図。The figure which shows the arrangement | positioning flow of the division | segmentation index to a search server. 検索サーバの追加時の処理フローを示す図。The figure which shows the processing flow at the time of the addition of a search server. 検索サーバの削除時の処理フローを示す図。The figure which shows the processing flow at the time of deletion of a search server.

以下の実施の形態においては、複数のセクションに分割して、実施の形態に係る検索システムの実現に必要な処理機能を説明する。以下の実施の形態において、要素の数等（個数、数値、量、範囲等を含む）に言及する場合、特に明示した場合および原理的に明らかに特定の数に限定される場合等を除き、その特定の数に限定されるものではなく、特定の数以上でも以下でもよい。以下の実施の形態において、その構成要素（要素ステップ等も含む）は、特に明示した場合および原理的に明らかに必須であると考えられる場合等を除き、必ずしも必須のものではない。 In the following embodiment, processing functions necessary for realizing the search system according to the embodiment will be described by being divided into a plurality of sections. In the following embodiments, when referring to the number of elements, etc. (including the number, numerical value, quantity, range, etc.), unless otherwise specified and in principle limited to a specific number in principle, It is not limited to the specific number, and may be more or less than the specific number. In the following embodiments, the constituent elements (including element steps and the like) are not necessarily indispensable unless otherwise specified or apparently essential in principle.

また、以下の実施の形態において、各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路その他のハードウェアとして実現しても良い。また、前述した各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することにより実現しても良い。すなわち、ソフトウェアとして実現しても良い。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、SSD（Solid State Drive）等の記憶装置、ICカード、SDカード、DVD等の記憶媒体に格納することができる。 Further, in the following embodiments, each configuration, function, processing unit, processing unit, and the like may be partially or entirely realized as, for example, an integrated circuit or other hardware. Further, each configuration, function, and the like described above may be realized by a processor interpreting and executing a program that realizes each function. That is, it may be realized as software. Information such as programs, tables, and files for realizing each function can be stored in a memory, a hard disk, a storage device such as an SSD (Solid State Drive), or a storage medium such as an IC card, an SD card, or a DVD.

また、制御線や情報線は、説明上必要と考えられるものを示すものであり、製品上必要な全ての制御線や情報線を表すものでない。実際にはほとんど全ての構成が相互に接続されていると考えて良い。 Control lines and information lines indicate what is considered necessary for the description, and do not represent all control lines and information lines necessary for the product. In practice, it can be considered that almost all components are connected to each other.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一の機能を有する部材には同一または関連する符号を付し、その繰り返しの説明は省略する。また、以下の実施の形態では、特に必要なとき以外は同一または同様な部分の説明を原則として繰り返さない。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that components having the same function are denoted by the same or related reference symbols throughout the drawings for describing the embodiments, and the repetitive description thereof is omitted. In the following embodiments, the description of the same or similar parts will not be repeated in principle unless particularly necessary.

［検索システムの全体構成］
図１に、本形態例に係る検索システムの構成例を示す。本形態例に係る検索システムは、検索クライアント１００、検索サーバ１０１、ファイルサーバ１０２、分散処理サーバ１０３、管理サーバ１０４から構成され、それらがネットワーク１０５を通じて互いに接続されている。ネットワーク１０５は、ローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（ＷＡＮ）等として一般に知られるネットワークを用いて実現することができる。なお、ネットワーク１０５は、有線ネットワークでも無線ネットワークでも構わない。また、検索システムは、１つの領域・国内に構築される必要は無く、複数の地域・国間を跨いで構築されてもよい。 [Entire configuration of search system]
FIG. 1 shows a configuration example of a search system according to this embodiment. The search system according to this embodiment includes a search client 100, a search server 101, a file server 102, a distributed processing server 103, and a management server 104, which are connected to each other through a network 105. The network 105 can be realized using a network generally known as a local area network (LAN), a wide area network (WAN), or the like. The network 105 may be a wired network or a wireless network. Further, the search system does not need to be constructed in one region / country, and may be constructed across a plurality of regions / countries.

［検索クライアントの構成］
検索クライアント１００は、Ｗｅｂブラウザを動作させることができる環境がインストールされたコンピュータであり、据え置き型に限らず、携帯型のコンピュータ、携帯情報端末、携帯電話機の端末を含む。検索クライアント１００は、ＨＴＴＰ（Hypertext Transfer Protocol）等を使用して検索サーバ１０１に対して検索クエリを送信する機能と、検索サーバ１０１から検索結果を取得する機能と、取得した検索結果を利用者に表示する機能とを有している。検索クライアント１００は、検索システム上に複数存在する。 Search client configuration
The search client 100 is a computer in which an environment capable of operating a Web browser is installed, and is not limited to a stationary type, but includes a portable computer, a portable information terminal, and a mobile phone terminal. The search client 100 uses a HTTP (Hypertext Transfer Protocol) or the like to transmit a search query to the search server 101, a function to acquire a search result from the search server 101, and the acquired search result to the user. A display function. A plurality of search clients 100 exist on the search system.

［検索サーバの構成］
図２に、検索サーバ１０１の内部構成例を示す。検索サーバ１０１は、検索クライアント１００から検索クエリを受信して検索処理を実行し、検索結果を返信するサーバである。検索サーバ１０１は、検索システム内に複数台存在し、それぞれがローカルストレージ２０１を保持している。ローカルストレージ２０１内には、ファイルサーバ１０２に保存されるファイル群に基づいて生成された検索用の分割インデクス２０２が保存されている。 Search server configuration
FIG. 2 shows an internal configuration example of the search server 101. The search server 101 is a server that receives a search query from the search client 100, executes a search process, and returns a search result. A plurality of search servers 101 exist in the search system, and each holds a local storage 201. In the local storage 201, a search split index 202 generated based on a file group stored in the file server 102 is stored.

検索サーバ１０１には、インデクス管理モジュール２０３と検索モジュール２０４がインストールされている。インデクス管理モジュール２０３は、分割インデクス２０２の管理・更新用のプログラムである。検索モジュール２０４は、検索用の分割インデクスを用いて検索処理を実行するプログラムである。因みに、インデクス管理モジュール２０３と検索モジュール２０４は、検索サーバ１０１のそれぞれにインストールされている。 In the search server 101, an index management module 203 and a search module 204 are installed. The index management module 203 is a program for managing / updating the split index 202. The search module 204 is a program that executes a search process using a search split index. Incidentally, the index management module 203 and the search module 204 are installed in each of the search servers 101.

分割インデクス２０２は、ファイルサーバ１０２上に保存されているファイル群に基づいて、管理サーバ１０４上のインデクス生成モジュール４０３及び分散処理サーバ１０３により生成される検索用のインデクスである。後述するように、分割インデクス２０２は、コンシステントハッシュ法に基づいて、インデクスＩＤ毎に分割されたインデクスである。なお、インデクスＩＤには分割インデクス２０２が紐付けられており、この紐付きを通じ、検索サーバ１０１に分割インデクス２０２が配置される。検索サーバ１０１上に配置させる分割インデクス２０２の数（インデクスの分割数）は、あらかじめ管理者が決定する。 The split index 202 is a search index generated by the index generation module 403 and the distributed processing server 103 on the management server 104 based on a file group stored on the file server 102. As will be described later, the divided index 202 is an index divided for each index ID based on the consistent hash method. Note that a divided index 202 is associated with the index ID, and the divided index 202 is arranged on the search server 101 through this association. The number of divided indexes 202 (number of divided indexes) to be arranged on the search server 101 is determined in advance by the administrator.

インデクス管理モジュール２０３は、分割インデクス２０２を、検索サーバ１０１に配置・管理するモジュールである。分割インデクス２０２が新たに生成された場合、インデクス管理モジュール２０３は、分割インデクス２０２を検索サーバ１０１のローカルディスク２０１にダウンロードして保存する。 The index management module 203 is a module that arranges and manages the split index 202 in the search server 101. When the split index 202 is newly generated, the index management module 203 downloads the split index 202 to the local disk 201 of the search server 101 and stores it.

検索サーバ１０１に分割インデクス２０２が既に存在し、その分割インデクス２０２の更新操作を実行する場合、インデクス管理モジュール２０３は、既存の分割インデクス２０２に対して、新規に生成された分割インデクスをマージして最新の分割インデクスを生成する。 When the split index 202 already exists in the search server 101 and the update operation of the split index 202 is executed, the index management module 203 merges the newly generated split index with the existing split index 202. Generate the latest split index.

検索サーバ１０１の追加により、システム全体で保持している分割インデクスの数が増加した場合、インデクス管理モジュール２０３は、それぞれの検索サーバ１０１に保存されている既存の分割インデクス２０２をさらに分割する機能を有する。なお、新たに追加された検索サーバ１０１のインデクス管理モジュール２０３は、他の検索サーバ１０１で新規に分割されたインデクスを集約して１つの分割インデクス２０２を生成する機能を有する。 When the number of split indexes held in the entire system increases due to the addition of the search server 101, the index management module 203 has a function of further splitting the existing split index 202 stored in each search server 101. Have. The newly added index management module 203 of the search server 101 has a function of aggregating indexes newly divided by other search servers 101 and generating one divided index 202.

削除対象の検索サーバ１０１におけるインデクス管理モジュール２０３は、自サーバに保持されていた分割インデクス２０２をインデクスＩＤに従って再度分割し、他の検索サーバ１０１の分割インデクス２０２に割り振る機能を有する。 The index management module 203 in the search server 101 to be deleted has a function of dividing the divided index 202 held in its own server again according to the index ID and allocating it to the divided indexes 202 of other search servers 101.

検索モジュール２０４は、検索サーバ１０１に配置された分割インデクス２０２を使用して、検索クライアント１００から受け取った検索クエリに対する検索結果を生成し、検索クライアント１００に検索結果を返信する機能を有する検索エンジンである。検索モジュール２０４は、他の検索サーバ群にインストールされているそれぞれの検索モジュール２０４と連携し、検索処理を分散的に実行する機能も有している。 The search module 204 is a search engine that has a function of generating a search result for a search query received from the search client 100 using the split index 202 arranged in the search server 101 and returning the search result to the search client 100. is there. The search module 204 also has a function of executing search processing in a distributed manner in cooperation with each search module 204 installed in another search server group.

［ファイルサーバの構成］
ファイルサーバ１０２は、企業内等において作成された大量のドキュメントデータを保存するサーバである。ファイルサーバ１０２は、検索システム内に複数台存在する。各ファイルサーバ１０２は、分散処理サーバ１０３及び管理サーバ１０４と、ＮＦＳ（Network File System）やＣＩＦＳ（Common Internet File System）等のプロトコルを通じて接続されている。これにより、分散処理サーバ１０３及び管理サーバ１０４上の各モジュールは、ファイルサーバ１０２上に存在するファイルへのアクセス及びファイル情報の取得が可能である。 [File Server Configuration]
The file server 102 is a server that stores a large amount of document data created in a company or the like. A plurality of file servers 102 exist in the search system. Each file server 102 is connected to the distributed processing server 103 and the management server 104 through a protocol such as NFS (Network File System) or CIFS (Common Internet File System). Thereby, each module on the distributed processing server 103 and the management server 104 can access a file existing on the file server 102 and acquire file information.

［分散処理サーバの構成］
図３に、分散処理サーバ１０３の内部構成例を示す。分散処理サーバ１０３は、検索システム内に複数台存在する。これら複数の分散処理サーバ１０３は、一つの処理命令を他のサーバとの連携により分散的に処理する機能を有するサーバ群である。 [Configuration of distributed processing server]
FIG. 3 shows an internal configuration example of the distributed processing server 103. A plurality of distributed processing servers 103 exist in the search system. The plurality of distributed processing servers 103 is a server group having a function of processing one processing instruction in a distributed manner in cooperation with other servers.

分散処理サーバ１０３には、分散ファイルシステム３０２と分散処理モジュール３０３がインストールされている。分散処理サーバ１０３には、ローカルストレージ３０１が設けられている。分散ファイルシステム３０２は、ローカルストレージ３０１を用い、共通する一つのファイルシステムを全ての分散処理サーバ１０３から利用可能とするモジュールである。分散処理モジュール３０３は、管理サーバ１０４のインデクス生成モジュール４０２から命令を受けた場合、他の分散処理サーバ１０３と連携し、分割インデクス２０２を分散的に生成する機能を有するモジュールである。 A distributed file system 302 and a distributed processing module 303 are installed in the distributed processing server 103. The distributed processing server 103 is provided with a local storage 301. The distributed file system 302 is a module that uses the local storage 301 and makes a common file system available from all the distributed processing servers 103. The distributed processing module 303 is a module having a function of generating the split index 202 in a distributed manner in cooperation with other distributed processing servers 103 when receiving an instruction from the index generation module 402 of the management server 104.

［管理サーバの構成］
図４に、管理サーバ１０４の内部構成例を示す。管理サーバ１０４は、検索システムを構成する検索サーバ１０１、ファイルサーバ１０２、分散処理サーバ１０３等のサーバ管理機能を有するサーバである。管理サーバ１０４のローカルストレージ４０１には、分割インデクスの生成を制御するためのスキャナモジュール４０２、インデクス生成モジュール４０３、パイプライン制御モジュール４０４、システム管理モジュール４０５、インデクスＩＤテーブル４０６、検索サーバ管理テーブル４０７、ファイル管理テーブル４０８がインストールされている。これらのモジュールは、管理サーバ１０４以外に存在してもよい。例えばこれらのモジュールの全部又は一部は、分散処理サーバ１０３上で直接動作可能であってもよい。 [Management server configuration]
FIG. 4 shows an internal configuration example of the management server 104. The management server 104 is a server having a server management function such as a search server 101, a file server 102, and a distributed processing server 103 that constitute the search system. The local storage 401 of the management server 104 includes a scanner module 402, an index generation module 403, a pipeline control module 404, a system management module 405, an index ID table 406, a search server management table 407, for controlling the generation of divided indexes. A file management table 408 is installed. These modules may exist in addition to the management server 104. For example, all or some of these modules may be directly operable on the distributed processing server 103.

スキャナモジュール４０２は、ファイルサーバ１０２上のファイル・ディレクトリをスキャンして、ファイル・フォルダパス名の一覧とそれらの属性情報を取得する機能と、それらのファイル・フォルダが新規生成・更新・削除のいずれの状態であるかを判定し、インデクスのターゲットとなるファイルパスが記述されたインデクスリストを生成する機能とを有するモジュールである。 The scanner module 402 scans a file / directory on the file server 102 to acquire a list of file / folder path names and their attribute information, and the file / folder is newly created / updated / deleted. It is a module having a function of determining whether or not the file is in the state and generating an index list in which the file path that is the target of the index is described.

スキャナモジュール４０２の機能は、以下の処理機能の実行を通じ実現することができる。例えばＬｉｎｕｘのＦｉｎｄコマンドを利用し、ファイルサーバ１０２上のファイル・ディレクトリパスの一覧とそれらの属性情報を取得する。この後、取得したファイル属性情報のハッシュ値を計算する。次に、任意のタイミングに取得しておいたファイル管理テーブル４０８（後述）に格納されているファイル属性情報のハッシュ値７０２（図７）と計算されたハッシュ値を比較し、その一致・不一致により、インデクス対象となるか否かを判定する。 The functions of the scanner module 402 can be realized by executing the following processing functions. For example, by using a Linux Find command, a list of file / directory paths on the file server 102 and their attribute information are acquired. Thereafter, a hash value of the acquired file attribute information is calculated. Next, the hash value 702 (FIG. 7) of the file attribute information stored in the file management table 408 (described later) acquired at an arbitrary timing is compared with the calculated hash value. It is determined whether or not it is an index target.

ハッシュ値が同じであった場合、スキャナモジュール４０２は、該当するファイル・ディレクトリに更新が無いと判定し、インデクシングの対象外とする。ハッシュ値が異なる場合、スキャナモジュール４０２は、ファイル・ディレクトリに更新があったと判定し、インデクシング対象に設定する。 If the hash values are the same, the scanner module 402 determines that the corresponding file / directory has not been updated, and excludes it from the indexing target. If the hash values are different, the scanner module 402 determines that the file / directory has been updated, and sets it as an indexing target.

ファイル管理テーブル４０８にファイル・フォルダパス７０１が存在するにもかかわらず、Ｆｉｎｄコマンドによって取得できない場合、スキャナモジュール４０２は、当該ファイルパスがファイル削除を示すように、インデクスリストに情報を書き出す。 If the file management table 408 has a file / folder path 701 but cannot be obtained by the Find command, the scanner module 402 writes information to the index list so that the file path indicates file deletion.

なお、インデクスリストは、インデクス処理対象のファイルパス、処理ステータスが記述されたテキストファイルである。インデクスリストに記載されるファイルパスと処理ステータスは、スキャナモジュール４０２がファイル管理テーブル４０８から抜き出して生成する一時ファイルであり、後述するインデクス生成モジュール４０３により利用される。 The index list is a text file that describes the file path and processing status of the index processing target. The file path and the processing status described in the index list are temporary files generated by the scanner module 402 extracted from the file management table 408, and are used by the index generation module 403 described later.

スキャナモジュール４０２は、各ファイルサーバ１０２上のファイルシステムのルートから最深部までを一度にスキャンするのでなく、１フォルダ階層毎又は任意のフォルダ階層毎にインデクスリストを出力し、インデクス生成モジュール４０３及びインデクス管理モジュール２０３の間でパイプライン処理を実行する。これにより、スキャナモジュール４０２がファイルサーバ１０２のスキャンを完全に終える前に、インデクス生成モジュール４０３及びインデクス管理モジュール２０３がインデクスの生成・更新処理を開始することが可能となり、インデクス生成速度の高速化を実現することが可能となる。 The scanner module 402 does not scan from the root to the deepest part of the file system on each file server 102 at once, but outputs an index list for each folder hierarchy or for each arbitrary folder hierarchy, and the index generation module 403 and the index generation module 403 Pipeline processing is executed between the management modules 203. As a result, before the scanner module 402 completes scanning of the file server 102, the index generation module 403 and the index management module 203 can start index generation / update processing, thereby increasing the index generation speed. It can be realized.

インデクス生成モジュール４０３は、スキャナモジュール４０２が出力したインデクスリストに基づいて、分散処理サーバ１０３にインデクスを分散的に生成させる機能を有するモジュールである。インデクス生成モジュール４０３は、コンシステントハッシュ法に基づいてファイルパスに対応するハッシュ値を算出し、当該ハッシュ値から対応するインデクスＩＤを求める。また、インデクス生成モジュール４０３は、インデクスＩＤ毎に分割インデクスを生成する。 The index generation module 403 is a module having a function of causing the distributed processing server 103 to generate indexes in a distributed manner based on the index list output by the scanner module 402. The index generation module 403 calculates a hash value corresponding to the file path based on the consistent hash method, and obtains a corresponding index ID from the hash value. Also, the index generation module 403 generates a divided index for each index ID.

インデクス生成モジュール４０３の処理は、タスクと呼ばれる処理単位に分割され、複数の分散処理サーバ１０３に分散される。なお、タスクは、分散処理サーバ１０３上において、第一の分散処理と第二の分散処理に分けて実行される。これらの処理は、大規模分散処理の技術として知られるＭａｐＲｅｄｕｃｅを使用することでも実現できる。その場合、第一の分散処理をＭａｐ処理、第二の分散処理をＲｅｄｕｃｅ処理として実現する。詳細動作については後述する。 The processing of the index generation module 403 is divided into processing units called tasks and distributed to a plurality of distributed processing servers 103. Note that the task is executed on the distributed processing server 103 separately into the first distributed processing and the second distributed processing. These processes can also be realized by using MapReduce, which is known as a large-scale distributed processing technique. In that case, the first distributed processing is realized as Map processing, and the second distributed processing is realized as Reduce processing. Detailed operation will be described later.

パイプライン制御モジュール４０４は、インデクスの生成を高速化するために、スキャナモジュール４０２、インデクス生成モジュール４０３、インデクス管理モジュール２０３の処理を多重化制御するためのモジュールである。各モジュールのパイプライン制御に関する詳細動作は後述する。 The pipeline control module 404 is a module for controlling the processes of the scanner module 402, the index generation module 403, and the index management module 203 in a multiplexed manner in order to speed up the index generation. Detailed operations regarding pipeline control of each module will be described later.

システム管理モジュール４０５は、検索システム上に存在するサーバ群の管理や各種テーブルを初期化を実行する機能と、システムの初期化に係るパラメータを管理者が入力するためのユーザインターフェースを提供する機能とを有するモジュールである。 The system management module 405 includes a function for managing a group of servers existing on the search system and initializing various tables, and a function for providing a user interface for an administrator to input parameters relating to system initialization. It is a module which has.

インデクスＩＤテーブル４０６の例を図５に示す。インデクスＩＤテーブル４０６は、仮想インデクスＩＤ５０１とインデクスＩＤ５０２を格納するテーブルであり、ファイルパスからインデクスＩＤを取得するために用いられる。インデクスＩＤテーブル４０６は、コンシステントハッシュ法の実現手段として利用される。 An example of the index ID table 406 is shown in FIG. The index ID table 406 stores a virtual index ID 501 and an index ID 502, and is used to acquire an index ID from a file path. The index ID table 406 is used as a means for realizing the consistent hash method.

以下、コンシステントハッシュ法について解説する。コンシステントハッシュ法は、０〜２＾１２８−１（２＾１２８はＭＤ５ハッシュ法に基づく値。ＭＤ５は一例であって、任意のハッシュアルゴリズムを利用することが可能である）の整数の目盛りが振られた円周上にインデクスＩＤのハッシュ値を求めて配置し、円周上の範囲を分割する。なお、インデクスＩＤのハッシュ値を取得するとは、インデクスＩＤを文字列としてＭＤ５等のハッシュ関数を適用することを意味する。 The following describes the consistent hash method. The consistent hash method has an integer scale of 0 to 2 ^ 128-1 (2 ^ 128 is a value based on the MD5 hash method. MD5 is an example and any hash algorithm can be used). A hash value of the index ID is obtained and arranged on the circled circle, and the range on the circle is divided. Note that acquiring the hash value of the index ID means applying a hash function such as MD5 using the index ID as a character string.

ファイルパスからインデクスＩＤを取得するには、ファイルパスから同じハッシュ関数（この例ではＭＤ５）を利用してハッシュ値を求めて円周上に配置し、その位置から反時計回りに回って最初に遭遇するハッシュ値に対応するインデクスＩＤが、ファイルパスに紐付けるインデクスＩＤとなる。以上が基本的なコンシステントハッシュの概念である。ただし、単純なコンシステントハッシュ法は、各インデクスＩＤに割り当てられるファイル数は、円周上で分割される間隔に依存する。 In order to obtain the index ID from the file path, the hash value is obtained from the file path by using the same hash function (MD5 in this example), arranged on the circumference, and rotated counterclockwise from that position first. The index ID corresponding to the encountered hash value is the index ID associated with the file path. The above is the basic concept of consistent hash. However, in the simple consistent hash method, the number of files assigned to each index ID depends on the interval divided on the circumference.

このため、インデクスＩＤのハッシュ値だけで分割すると、インデクスＩＤの追加・削除を行った場合に、各インデクスＩＤに割り当てられるファイル数に偏りが生じてしまう。これは、インデクスサイズが各分割インデクス間で偏ることを意味し、検索パフォーマンスの劣化を招くことになる。このため、インデクスサイズを平準化する必要がある。 For this reason, if only the hash value of the index ID is divided, when the index ID is added / deleted, the number of files allocated to each index ID is biased. This means that the index size is biased among the respective split indexes, and the search performance is deteriorated. For this reason, it is necessary to level the index size.

平準化を行うには、円周上に配置されるインデクスＩＤに対応する点の間隔を短くすることが必要となる。そこで、コンシステントハッシュ法の仮想ノードに相当する仮想インデクスＩＤを生成する。仮想インデクスＩＤは、インデクスＩＤに紐付けられるハッシュ値であり、１インデクスＩＤあたりｎ個の仮想インデクスＩＤを生成し、システム上に存在するそれぞれの分割インデクス間でサイズを平準化させる。仮想インデクスＩＤの生成と使用方法については後述する。 In order to perform leveling, it is necessary to shorten the interval between the points corresponding to the index IDs arranged on the circumference. Therefore, a virtual index ID corresponding to the virtual node of the consistent hash method is generated. The virtual index ID is a hash value linked to the index ID, and n virtual index IDs are generated for each index ID, and the size is leveled between the respective divided indexes existing on the system. A method for generating and using the virtual index ID will be described later.

検索サーバ管理テーブル４０７の例を図６に示す。検索サーバ管理テーブル４０７は、インデクスＩＤ６０１と、そのインデクスＩＤが紐付けられている分割インデクスが配置されている配置先検索サーバ名６０２、分割インデクスの保存先のパス６０３、削除インデクスリストの保存先のパス６０４が格納されたテーブルである。削除インデクスリスト６０４は、インデクス生成モジュール４０３により生成される一時ファイルであり、検索サーバ１０１上に既に配置されている分割インデクス２０２において、削除すべきファイルパスが１行毎に書かれたテキストファイルである。 An example of the search server management table 407 is shown in FIG. The search server management table 407 includes an index ID 601, an arrangement destination search server name 602 in which a divided index associated with the index ID is arranged, a path 603 for storing the divided index, and a storage destination for the deletion index list. It is a table in which a path 604 is stored. The deletion index list 604 is a temporary file generated by the index generation module 403, and is a text file in which the file path to be deleted is written for each line in the divided index 202 already arranged on the search server 101. is there.

ファイル管理テーブル４０８の例を図７に示す。ファイル管理テーブル４０８は、ファイルサーバ１０２上に存在するファイル・フォルダパス名７０１の一覧と、それらの属性情報及びその属性情報から生成したハッシュ値７０２を保存・管理するためのテーブルである。このテーブルに保存されているハッシュ値７０２と、スキャナモジュール４０２のスキャン実行時に取得したファイルの属性情報から生成されるハッシュ値７０２を比較し、ファイルの更新状態（処理ステータス）７０３をチェックする。 An example of the file management table 408 is shown in FIG. The file management table 408 is a table for storing and managing a list of file / folder path names 701 existing on the file server 102, their attribute information, and a hash value 702 generated from the attribute information. The hash value 702 stored in the table is compared with the hash value 702 generated from the attribute information of the file acquired when the scanner module 402 executes the scan, and the file update state (processing status) 703 is checked.

［検索サーバ管理テーブルの初期化フロー］
図８に、検索サーバ管理テーブル４０７の初期化フローを示す。ここでは、検索サーバ１０１が２台存在し、各検索サーバ１０１上に２つ分割インデクス２０２を配置する場合を想定する。すなわち、検索システム全体におけるインデクスの分割数は４（＝２×２）である場合を想定する。また、２台の検索サーバ名は、”Ｓｅａｒｃｈ１”と”Ｓｅａｒｃｈ２”であるものとする。 [Search server management table initialization flow]
FIG. 8 shows an initialization flow of the search server management table 407. Here, it is assumed that there are two search servers 101 and two divided indexes 202 are arranged on each search server 101. That is, it is assumed that the number of index divisions in the entire search system is 4 (= 2 × 2). Further, it is assumed that the names of the two search servers are “Search 1” and “Search 2”.

まず、管理者は、検索サーバ管理テーブル４０７の初期化を行うために、検索サーバ１０１の台数、及び、各検索サーバ１０１上に配置する分割インデクス２０２の数からインデクスの分割数を設定する（Ｓ８０１）。 First, in order to initialize the search server management table 407, the administrator sets the number of index divisions based on the number of search servers 101 and the number of split indexes 202 arranged on each search server 101 (S801). ).

前述したように、この説明では、２台の検索サーバ１０１上に２つずつ分割インデクス２０２が配置されている。このため、全体のインデクス分割数は４である。この情報をシステム管理モジュール４０５に入力すると、システム管理モジュール４０５は、各分割インデクス２０２に対して割り振るインデクスＩＤを決定する（Ｓ８０２）。本明細書の場合、インデクスＩＤは０から始まる昇順の数字とする。すなわち、システム管理モジュール４０５は、「０」、「１」、「２」、「３」の順番にインデクスＩＤを割り振る。 As described above, in this description, two divided indexes 202 are arranged on each of the two search servers 101. For this reason, the total number of index divisions is four. When this information is input to the system management module 405, the system management module 405 determines an index ID to be allocated to each divided index 202 (S802). In this specification, the index ID is an ascending number starting from 0. That is, the system management module 405 allocates index IDs in the order of “0”, “1”, “2”, “3”.

次に、システム管理モジュール４０５は、各インデクスＩＤと検索サーバ１０１との紐付けを実行し（Ｓ８０３）、その結果を検索サーバ管理テーブル４０７に格納する（Ｓ８０４）。本実施例に場合、システム管理モジュール４０５が自動的にインデクスＩＤと検索サーバの紐付けを実行するが、管理者が手動で設定してもよい。 Next, the system management module 405 associates each index ID with the search server 101 (S803), and stores the result in the search server management table 407 (S804). In this embodiment, the system management module 405 automatically associates the index ID with the search server, but may be set manually by the administrator.

例えば本実施例の場合、検索サーバ管理テーブル４０７のエントリは、「インデクスＩＤ＝０，配置先検索サーバ名＝Ｓｅａｒｃｈ１」、「インデクスＩＤ＝１，配置先検索サーバ名＝Ｓｅａｒｃｈ１」、「インデクスＩＤ＝２，配置先検索サーバ名＝Ｓｅａｒｃｈ２」、「インデクスＩＤ＝３，配置先検索サーバ名＝Ｓｅａｒｃｈ２」の４つとなる。なお、初期化後の段階において、分割インデクス保存先パス６０３、削除インデクスリスト保存先パス６０４は空欄である。以上で、検索サーバ管理テーブル４０７の初期化が完了する。 For example, in this embodiment, the entries of the search server management table 407 include “index ID = 0, placement destination search server name = Search 1”, “index ID = 1, placement destination search server name = Search 1”, “index ID = 2, “location destination search server name = Search 2” and “index ID = 3, placement destination search server name = Search 2”. Note that the split index storage destination path 603 and the deletion index list storage destination path 604 are blank at the stage after initialization. Thus, the initialization of the search server management table 407 is completed.

［インデクスＩＤテーブルの初期化フロー］
図９に、インデクスＩＤテーブル４０６の初期化フローを示す。インデクスＩＤテーブル４０６の初期化も検索サーバ管理テーブル４０７の初期化と同様のタイミングで実行される。 [Index ID table initialization flow]
FIG. 9 shows an initialization flow of the index ID table 406. The initialization of the index ID table 406 is also executed at the same timing as the initialization of the search server management table 407.

まず、管理者が検索サーバ１０１の台数と各検索サーバ１０１上に配置する分割インデクスの数に基づいてインデクスの分割数を設定し（Ｓ９０１）、インデクスＩＤを決定する（Ｓ９０２）。 First, the administrator sets the number of index divisions based on the number of search servers 101 and the number of split indexes arranged on each search server 101 (S901), and determines an index ID (S902).

ここでも、インデクスＩＤは、「０」、「１」、「２」、「３」の４つであるものとする。なお、仮想インデクスＩＤの数は、一つのインデクスＩＤに対して２であるものとする。仮想インデクスＩＤの数は、最終的にインデクスＩＤに紐付けられるファイル数が平準化されるように定められる任意の固定値である。 Also here, it is assumed that there are four index IDs of “0”, “1”, “2”, and “3”. Note that the number of virtual index IDs is 2 for one index ID. The number of virtual index IDs is an arbitrary fixed value determined so that the number of files finally linked to the index IDs is leveled.

次に、システム管理モジュール４０５は、１つのインデクスＩＤに対して任意の仮想インデクスＩＤを生成する（Ｓ９０３）。例えばインデクスＩＤ「０」に紐付ける仮想インデクスＩＤを「０−０」、「０−１」、インデクスＩＤ「１」に紐付ける仮想インデクスＩＤを「１−０」、「１−１」、インデクスＩＤ「２」に紐付ける仮想インデクスＩＤを「２−０」、「２−１」、インデクスＩＤ「３」に紐付ける仮想インデクスＩＤを「３−０」、「３−１」とする。 Next, the system management module 405 generates an arbitrary virtual index ID for one index ID (S903). For example, the virtual index IDs associated with the index ID “0” are “0-0” and “0-1”, and the virtual index IDs associated with the index ID “1” are “1-0”, “1-1”, and the index. The virtual index IDs associated with the ID “2” are “2-0” and “2-1”, and the virtual index IDs associated with the index ID “3” are “3-0” and “3-1”.

続いて、システム管理モジュール４０５は、仮想インデクスＩＤの文字列からハッシュ値を取得する（Ｓ９０４）。この後、システム管理モジュール４０５は、取得されたハッシュ値をインデクスＩＤテーブル４０６の仮想インデクスＩＤ５０１のカラムに格納し、そのエントリのインデクスＩＤ５０２のカラムにこの仮想インデクスＩＤが紐付けられるインデクスＩＤを格納する（Ｓ９０５）。 Subsequently, the system management module 405 acquires a hash value from the character string of the virtual index ID (S904). Thereafter, the system management module 405 stores the acquired hash value in the column of the virtual index ID 501 of the index ID table 406, and stores the index ID associated with this virtual index ID in the column of the index ID 502 of the entry. (S905).

図１０に、初期化が終了したインデクスＩＤテーブル４０６の例を示す。このテーブルを利用することにより、ファイルパスが与えられたとき、そのファイルパスがどのインデクスＩＤに紐付けるかを知ることが可能となる。例えばファイルパス「／ＦｉｌｅＳｅｒｖｅｒ１／ｔｅｓｔ．ｔｘｔ」のハッシュ値を求めたところ「２９９９９９９９９９９」であった場合、このハッシュ値は、項番３と項番４の点の間に配置され、項番３のエントリの点にヒットする（コンシステントハッシュの円周上で左に回る場合）。項番３のインデクスＩＤは「３」であるので、ファイルパス「／ＦｉｌｅＳｅｒｖｅｒ１／ｔｅｓｔ．ｔｘ」”のインデクスＩＤは「３」となることが分かる。 FIG. 10 shows an example of the index ID table 406 that has been initialized. By using this table, when a file path is given, it is possible to know which index ID the file path is associated with. For example, when the hash value of the file path “/FileServer1/test.txt” is obtained and found to be “29999999999”, this hash value is arranged between the points of item number 3 and item number 4, Hit the entry point (when turning to the left on the consistent hash circumference). Since the index ID of item number 3 is “3”, it can be seen that the index ID of the file path “/FileServer1/test.tx” is “3”.

このテーブルはコンシステントハッシュ法の実現方式であり、このテーブルを元にしてファイルパスからインデクスＩＤを取得し、インデクスＩＤ毎に分割インデクスを生成すると、各々の分割インデクスのサイズ又は紐付けられるファイル数の平準化が実現される。 This table is an implementation method of the consistent hash method. When an index ID is obtained from a file path based on this table and a split index is generated for each index ID, the size of each split index or the number of files to be linked Leveling is realized.

［インデクスリストの生成フロー］
図１１に、スキャナモジュール４０２によるインデクスリストの生成フローを示す。まず、パイプライン制御モジュール４０４は、スキャナモジュール４０２に対し、フォルダツリーの１階層目のインデクスリストの生成開始を指示する（Ｓ１１０１）。前述したように、インデクスリストの生成は、１階層ずつに限らず、任意の階層数毎に実行してもよい。 [Index list generation flow]
FIG. 11 shows an index list generation flow by the scanner module 402. First, the pipeline control module 404 instructs the scanner module 402 to start generating an index list in the first layer of the folder tree (S1101). As described above, the generation of the index list is not limited to one layer, but may be executed for any number of layers.

次に、スキャナモジュール４０２は、ファイル管理テーブル４０８にアクセスし、指定された階層のファイル群が存在するか否かをチェックする（Ｓ１１０２）。指定された階層のファイルパスにエントリが存在する場合、スキャナモジュール４０２は、処理ステータスのカラムに削除を示す「−１」を設定する（Ｓ１１０３）。なお、指定された階層のファイルパスにエントリが存在しない場合、スキャナモジュール４０２は、Ｓ１１０３をスキップする。 Next, the scanner module 402 accesses the file management table 408 and checks whether a file group of the designated hierarchy exists (S1102). If there is an entry in the file path of the designated hierarchy, the scanner module 402 sets “−1” indicating deletion in the processing status column (S1103). If there is no entry in the file path of the specified hierarchy, the scanner module 402 skips S1103.

その後、スキャナモジュール４０２は、ファイル検索の階層指定オプションを付与してＦｉｎｄコマンドを実行する（Ｓ１１０４）。これは、実際のＬｉｎｕｘＯＳ上では、Ｆｉｎｄコマンドに、ｍａｘｄｅｐｔｈ＝１（階層深度が１の場合）を設定することで実施できる。 After that, the scanner module 402 executes a Find command with a file search hierarchy designation option (S1104). This can be implemented by setting maxdepth = 1 (when the hierarchical depth is 1) in the Find command on an actual Linux OS.

指定した階層のファイル・フォルダパスとその属性情報を取得すると、スキャナモジュール４０２は、各々の属性情報に基づいてハッシュ値を取得する（Ｓ１１０５）。 When the file / folder path of the designated hierarchy and its attribute information are acquired, the scanner module 402 acquires a hash value based on each attribute information (S1105).

続いて、スキャナモジュール４０２は、Ｆｉｎｄにより取得したファイルパスをキーに使用し、ファイルパスの有無をファイル管理テーブル４０８に問い合わせる（Ｓ１１０６）。 Subsequently, the scanner module 402 uses the file path acquired by Find as a key, and inquires of the file management table 408 whether the file path exists (S1106).

ファイルパスがファイル管理テーブル４０８に存在しない場合（Ｓ１１０６で否定結果）、当該ファイルは新規作成であることを意味する。従って、この場合、スキャナモジュール４０２は、ファイル管理テーブル４０８に新たにそのファイルパス７０１をキーとするエントリを生成し、ファイルハッシュ７０２と処理ステータス７０３に新規生成を示す「１」を追加する（Ｓ１１０７）。 If the file path does not exist in the file management table 408 (No in S1106), it means that the file is newly created. Therefore, in this case, the scanner module 402 newly creates an entry with the file path 701 as a key in the file management table 408, and adds “1” indicating new creation to the file hash 702 and the processing status 703 (S1107). ).

一方、ファイルパス７０１がファイル管理テーブル４０８に存在する場合（Ｓ１１０６で肯定結果）、当該ファイルは既にファイル管理テーブル４０８に登録されているファイルであることを意味する。この場合、スキャナモジュール４０２は、ハッシュ値のチェックを実行する（Ｓ１１０８）。具体的には、スキャナモジュール４０２は、ファイル管理テーブル４０８からファイルパス７０１が一致するエントリのファイルハッシュ７０２を取得し、Ｆｉｎｄコマンドにより取得したハッシュ値と比較する。 On the other hand, if the file path 701 exists in the file management table 408 (Yes in S1106), it means that the file is already registered in the file management table 408. In this case, the scanner module 402 executes a hash value check (S1108). Specifically, the scanner module 402 acquires the file hash 702 of the entry with the matching file path 701 from the file management table 408 and compares it with the hash value acquired by the Find command.

ハッシュ値が一致した場合（Ｓ１１０８で肯定結果）、ファイル更新がなかったことを意味する。従って、この場合、スキャナモジュール４０２は、ファイルパスが一致するエントリの処理ステータスに「０」を設定する（Ｓ１１０９）。 If the hash values match (Yes in S1108), it means that there was no file update. Accordingly, in this case, the scanner module 402 sets “0” to the processing status of the entry with the matching file path (S1109).

ハッシュ値が一致しなかった場合（Ｓ１１０８で否定結果）、ファイル更新があったことを意味する。従って、この場合、スキャナモジュール４０２は、ファイルハッシュ７０２を新たなハッシュ値で上書きし、処理ステータス７０３にファイル更新があったことを示す「２」を上書きする（Ｓ１１１０）。 If the hash values do not match (No in S1108), it means that the file has been updated. Therefore, in this case, the scanner module 402 overwrites the file hash 702 with a new hash value, and overwrites “2” indicating that the file has been updated in the processing status 703 (S1110).

以上の処理により、指定された階層のファイル処理（「０」＝処理なし、「１」＝インデクス新規生成、「２」＝インデクス更新、「−１」＝インデクスから削除）が確定する。 With the above processing, the file processing of the designated hierarchy (“0” = no processing, “1” = new index generation, “2” = index update, “−1” = deletion from the index) is confirmed.

次に、スキャナモジュール４０２は、ファイル管理テーブル４０８にアクセスし、指定されたフォルダ階層のエントリ内で処理ステータス７０３が、「１」、「２」、「−１」であるエントリを取得してインデクスリストに書き出し、分散ファイルシステム３０２上に保存する（Ｓ１１１１）。すなわち、何らかの変化があったファイルだけを抽出する。なお、インデクスリストは、インデクス処理対象のファイルパス、処理ステータスが記述されたテキストファイルである。 Next, the scanner module 402 accesses the file management table 408, acquires an entry whose processing status 703 is “1”, “2”, or “−1” in the entry of the designated folder hierarchy, and indexes it. The data is written in the list and stored on the distributed file system 302 (S1111). That is, only files that have changed are extracted. The index list is a text file that describes the file path and processing status of the index processing target.

その後、スキャナモジュール４０２は、パイプライン制御モジュール４０４にインデクスリストの保存先パスと生成終了を通知する（Ｓ１１１２）。 Thereafter, the scanner module 402 notifies the pipeline control module 404 of the storage destination path of the index list and the end of generation (S1112).

以後、スキャナモジュール４０２は、パイプライン制御モジュール４０４に指示されたディレクトリのエントリをファイル管理テーブル４０８から取得し、フォルダ深度を２、３…と深めながらインデクスリストを生成する。 Thereafter, the scanner module 402 acquires the directory entry designated by the pipeline control module 404 from the file management table 408, and generates an index list while increasing the folder depth to 2, 3,.

［分割インデクス生成のフロー］
図１２に、インデクス生成モジュール４０３による分割インデクス２０２の生成フローを示す。インデクス生成モジュール４０３は、スキャナモジュール４０２から与えられるインデクスリストに基づいて分割インデクス２０２を生成する。インデクス生成モジュール４０３の処理は、１つのインデクスリストに対して、タスクと呼ばれる複数の処理単位に分割され、複数の分散処理サーバ１０３上で分散的に処理される。以下、タスク生成及び分散処理サーバ上での処理を示す。 [Flow of split index generation]
FIG. 12 shows a generation flow of the split index 202 by the index generation module 403. The index generation module 403 generates the split index 202 based on the index list given from the scanner module 402. The processing of the index generation module 403 is divided into a plurality of processing units called tasks for one index list, and is distributedly processed on a plurality of distributed processing servers 103. The task generation and processing on the distributed processing server will be described below.

まず、スキャナモジュール４０２がインデクスリストの生成終了をパイプライン制御モジュール４０４に通知する（Ｓ１２０１）。このとき、パイプライン制御モジュール４０４は、分散処理サーバ１０３上でインデクスの生成を開始可能か否かをチェックする（Ｓ１２０２）。 First, the scanner module 402 notifies the pipeline control module 404 of the end of index list generation (S1201). At this time, the pipeline control module 404 checks whether or not index generation can be started on the distributed processing server 103 (S1202).

分散処理サーバ１０３上でインデクスの生成が開始可能な場合（Ｓ１２０２で肯定結果）、パイプライン制御モジュール４０４は、インデクス生成モジュール４０３に対し、分割インデクスの生成開始とインデクスリストの保存先パスを通知する（Ｓ１２０３）。なお、インデクスの生成が開始可能でない場合（Ｓ１２０２で否定結果）の場合、パイプライン制御モジュール４０４は、一定時間の待機時間の後（Ｓ１２０２１）、再び、Ｓ１２０２の判定処理に戻る。 If index generation can be started on the distributed processing server 103 (Yes in S1202), the pipeline control module 404 notifies the index generation module 403 of the generation start of the split index and the storage path of the index list. (S1203). If index generation cannot be started (No in S1202), the pipeline control module 404 returns to the determination process in S1202 again after a predetermined waiting time (S12021).

先の通知を受けたインデクス生成モジュール４０３は、分散ファイルシステム３０２上からインデクスリストを取得する（Ｓ１２０４）。インデクス生成モジュール４０３は、第一の分散処理として、以下に示すＳ１２０５〜Ｓ１２０７までの処理を行う。 Receiving the previous notification, the index generation module 403 acquires an index list from the distributed file system 302 (S1204). The index generation module 403 performs the following processing from S1205 to S1207 as the first distributed processing.

まず、インデクス生成モジュール４０３は、インデクスリストを任意の数に分割する（Ｓ１２０５）。ここでの数は、分散処理サーバ１０３の台数及び処理性能から決定される数である。インデクスリストは、インデクス処理対象のファイルパス、処理ステータスが記述されたテキストファイルであり、このファイルを分割する際には、分割数に応じて単純に任意の行で区切って複数のインデクスリストが生成されることとなる。 First, the index generation module 403 divides the index list into an arbitrary number (S1205). The number here is a number determined from the number of distributed processing servers 103 and the processing performance. An index list is a text file that describes the file path and processing status of the index processing target. When this file is divided, multiple index lists are generated by simply dividing the file according to the number of divisions. Will be.

分割された各々のインデクスリストは、それぞれが、分散処理サーバ１０３上で複数のタスクとして処理される。第一の分散処理における各々のタスク処理は、分割されたインデクスリストに記述されているファイルパスを取得し（Ｓ１２０６）、インデクスＩＤテーブルに問い合わせ、インデクスＩＤを取得する（Ｓ１２０７）。 Each of the divided index lists is processed as a plurality of tasks on the distributed processing server 103. Each task process in the first distributed process acquires the file path described in the divided index list (S1206), inquires the index ID table, and acquires the index ID (S1207).

第一のタスク処理が全て完了すると、インデクス生成モジュール４０３は、分散処理サーバ１０３上でインデクスＩＤによるグルーピングを行い、インデクスＩＤをキーとするインデクスリストを生成する（Ｓ１２０８）。 When all of the first task processing is completed, the index generation module 403 performs grouping based on the index ID on the distributed processing server 103, and generates an index list using the index ID as a key (S1208).

次に、第二の分散処理として、インデクス生成モジュール４０３は、以下に示すＳ１２０９〜Ｓ１２１２までの処理を行う。 Next, as a second distributed process, the index generation module 403 performs the following processes from S1209 to S1212.

まず、インデクス生成モジュール４０３は、インデクスＩＤをキーとするインデクスリスト（インデクスＩＤ分だけリストが存在する）に対し、分散処理サーバ１０３上で複数のタスクとして処理を開始する。 First, the index generation module 403 starts processing as a plurality of tasks on the distributed processing server 103 for an index list (there is a list corresponding to the index ID) using the index ID as a key.

第二の分散処理におけるタスク処理は、インデクスＩＤをキーとするインデクスリストからファイルパスと処理ステータスを取得する（Ｓ１２０９）。 In the task processing in the second distributed processing, the file path and the processing status are acquired from the index list using the index ID as a key (S1209).

次に、タスク処理は、処理ステータスをチェックする（Ｓ１２１０）。ここで、処理ステータスが、「１」（＝ファイル新規生成）又は「２」（＝ファイル更新）の場合、各タスクは、ファイルサーバ１０２からファイルをダウンロードした後、分割インデクスを生成する（Ｓ１２１１）。なお、このとき生成される分割インデクスは、分散処理サーバ１０３のローカルストレージ３０１上に一時的に生成される。 Next, the task processing checks the processing status (S1210). If the processing status is “1” (= new file creation) or “2” (= file update), each task creates a split index after downloading the file from the file server 102 (S1211). . Note that the split index generated at this time is temporarily generated on the local storage 301 of the distributed processing server 103.

これに対し、処理ステータスが「−１」（＝インデクスから削除）の場合、各タスクは、削除インデクスリストとしてファイルパスを削除インデクスリストとして出力する（Ｓ１２１２）。なお、削除インデクスリストは、検索サーバ１０１上に既に配置されている分割インデクスから削除すべきファイルパスが１行毎に書かれたテキストファイルである。 On the other hand, when the processing status is “−1” (= deleted from the index), each task outputs the file path as a deletion index list as a deletion index list (S1212). The deletion index list is a text file in which a file path to be deleted from a split index already arranged on the search server 101 is written for each line.

この後、インデクス生成モジュール４０３は、第二のタスク処理により生成された分割インデクスと削除インデクスリストをセットとして、分散ファイルサーバ１０３上にアップロードする（Ｓ１２１３）。 Thereafter, the index generation module 403 uploads the divided index and the deletion index list generated by the second task processing as a set to the distributed file server 103 (S1213).

その後、インデクス生成モジュール４０３は、アップロードした保存先を分割インデクス保存先パス６０３と削除インデクスリスト保存先パス６０４に格納し（Ｓ１２１４）、パイプライン制御モジュール４０４に対し、分割インデクスの生成完了を通知する（Ｓ１２１５）。 After that, the index generation module 403 stores the uploaded storage destination in the split index storage destination path 603 and the deletion index list storage destination path 604 (S1214), and notifies the pipeline control module 404 of the completion of split index generation. (S1215).

以上のように、分散処理サーバ１０３上では、第一の分散処理と第二の分散処理が実行され、タスク処理が同時並列的に実行される。これにより、分割インデクスの生成速度が向上する。なお、コンシステントハッシュ法における仮想インデクスＩＤを利用して第二のタスク処理を実行することにより、分散処理数をさらに調整することもできる。 As described above, on the distributed processing server 103, the first distributed processing and the second distributed processing are executed, and the task processing is executed simultaneously and in parallel. Thereby, the generation speed of the split index is improved. Note that the number of distributed processes can be further adjusted by executing the second task process using the virtual index ID in the consistent hash method.

さらに、スキャナモジュール４０２とインデクス生成モジュール４０３は非同期に動作する。このため、スキャナモジュール４０２によるインデクスリストの生成が複数完了した場合には、Ｓ１２０１〜Ｓ１２１３の処理は多重化することが可能となり、分割インデクスの生成速度が向上する。 Further, the scanner module 402 and the index generation module 403 operate asynchronously. For this reason, when a plurality of index list generations by the scanner module 402 are completed, the processing of S1201 to S1213 can be multiplexed, and the generation speed of the split index is improved.

［検索サーバへの分割インデクスの配置フロー］
図１３に、インデクス生成モジュール４０３により生成された分割インデクス（この時点では、分割インデクスは、検索サーバ１０１ではなく、分散ファイルシステム３０２上に保存されている）を、インデクス管理モジュール２０３が、検索サーバ１０１に配置するフローである。 [Flow of split index allocation to search servers]
FIG. 13 shows the split index generated by the index generation module 403 (at this time, the split index is stored on the distributed file system 302 instead of the search server 101), and the index management module 203 stores the search server. 101 is a flow to be arranged in 101.

図１３に示すフローは、パイプライン制御モジュール４０４が、インデクス生成モジュール４０３から分割インデクス２０２の生成終了通知を受けることで開始する（Ｓ１３０１）。この通知の受けたパイプライン制御モジュール４０４は、検索サーバ管理テーブル４０７に問い合わせを行い、インデクスＩＤをキーとして、配置先検索サーバ名６０２を取得する（Ｓ１３０２）。 The flow shown in FIG. 13 starts when the pipeline control module 404 receives a generation end notification of the split index 202 from the index generation module 403 (S1301). The pipeline control module 404 that has received this notification makes an inquiry to the search server management table 407, and acquires the location search server name 602 using the index ID as a key (S1302).

次に、パイプライン制御モジュール４０４は、特定された検索サーバ１０１上のインデクス管理モジュール２０３に対し、インデクス処理が可能か否かの問い合わせを行う（Ｓ１３０３）。インデクス処理が可能な場合（Ｓ１３０３で肯定結果）、パイプライン制御モジュール４０４は、インデクス管理モジュール２０３に対し、インデクス処理の開始を命令する（Ｓ１３０４）。なお、インデクス管理モジュール２０３が他の処理を実行中の場合、パイプライン制御モジュール４０４は、一定の時間待機する（Ｓ１３０５）。 Next, the pipeline control module 404 makes an inquiry to the index management module 203 on the specified search server 101 as to whether or not index processing is possible (S1303). If the index process is possible (Yes in S1303), the pipeline control module 404 instructs the index management module 203 to start the index process (S1304). If the index management module 203 is executing another process, the pipeline control module 404 waits for a certain time (S1305).

次に、インデクス管理モジュール２０３は、既に分割インデクスが存在するか否かをチェックする（Ｓ１３０６）。既に分割インデクス２０２が同じ検索サーバ１０１上に存在する場合（Ｓ１３０６で肯定結果）、インデクス管理モジュール２０３は、分散ファイルシステム３０２上からインデクスＩＤに対応する分割インデクス２０２と削除インデクスリストをダウンロードする（Ｓ１３０７）。 Next, the index management module 203 checks whether or not a split index already exists (S1306). If the split index 202 already exists on the same search server 101 (Yes in S1306), the index management module 203 downloads the split index 202 and the deletion index list corresponding to the index ID from the distributed file system 302 (S1307). ).

インデクス管理モジュール２０３は、検索サーバ１０１上に存在する既存の分割インデクス２０２に対して、削除インデクスリストに基づいてインデクスを削除する（Ｓ１３０８）。次に、インデクス管理モジュール２０３は、ダウンロードした分割インデクス２０２を既存の分割インデクス２０２にマージし、最新の分割インデクス２０２を生成する（Ｓ１３０９）。 The index management module 203 deletes the index from the existing split index 202 existing on the search server 101 based on the delete index list (S1308). Next, the index management module 203 merges the downloaded divided index 202 with the existing divided index 202 to generate the latest divided index 202 (S1309).

一方、分割インデクス２０２が同じ検索サーバ１０１上に存在しなかった場合（Ｓ１３０６で否定結果）、インデクス管理モジュール２０３は、分散ファイルシステム３０２上からインデクスＩＤに対応する分割インデクス２０２をダウンロードする（Ｓ１３１０）。 On the other hand, if the split index 202 does not exist on the same search server 101 (No in S1306), the index management module 203 downloads the split index 202 corresponding to the index ID from the distributed file system 302 (S1310). .

続いて、インデクス管理モジュール２０３は、検索モジュール２０４に分割インデクス２０２のマウントを要求する（Ｓ１３１１）。これにより、検索モジュール２０４に分割インデクスがマウントされ、検索の実行が可能となる。 Subsequently, the index management module 203 requests the search module 204 to mount the split index 202 (S1311). As a result, the split index is mounted on the search module 204, and the search can be executed.

最後に、インデクス管理モジュール２０３は、パイプライン制御モジュール４０４に対し、分割インデクスの配置終了を通知し、処理を完了する（Ｓ１３１２）。 Finally, the index management module 203 notifies the pipeline control module 404 of the end of the allocation of the split index and completes the processing (S1312).

［検索サーバの追加フロー］
図１４に、検索システムに検索サーバ１０１が追加された場合に実行される処理フローを示す。 [Search server addition flow]
FIG. 14 shows a processing flow executed when the search server 101 is added to the search system.

この処理フローは、システム管理モジュール４０５に対し、管理者が、検索サーバ１０１の追加を入力することで開始される（Ｓ１４０１）。 This processing flow starts when the administrator inputs addition of the search server 101 to the system management module 405 (S1401).

検索サーバ１０１が追加されたことを受け付けると、システム管理モジュール４０５は、新規に追加された検索サーバ１０１に対し、新規にインデクスＩＤを割り当てる（Ｓ１４０２）。例えば２台の検索サーバ１０１が配置された検索システムに、１台の検索サーバ１０１が新たに追加される場合にあって、１台の検索サーバ１０１に２つの分割インデクス２０２が配置されるとき、新たに追加される検索サーバ１０１にはインデクスＩＤ４，５が割り当てられる。 Upon accepting that the search server 101 has been added, the system management module 405 assigns a new index ID to the newly added search server 101 (S1402). For example, when one search server 101 is newly added to a search system in which two search servers 101 are arranged, and when two divided indexes 202 are arranged in one search server 101, Index IDs 4 and 5 are assigned to the newly added search server 101.

次に、システム管理モジュール４０５は、検索サーバ管理テーブル４０７に、新規に生成されたインデクスＩＤ６０１のエントリを作成し、そのエントリに配置先検索サーバ名６０２を設定する（Ｓ１４０３）。すなわち、検索サーバ管理テーブル４０７の初期化を実行する。 Next, the system management module 405 creates an entry for the newly generated index ID 601 in the search server management table 407, and sets the location search server name 602 in the entry (S1403). That is, the search server management table 407 is initialized.

その後、システム管理モジュール４０５は、新規に生成されたインデクスＩＤ５０２に対応付ける仮想インデクスＩＤ５０１のハッシュ値をインデクスＩＤテーブル４０６に格納する（Ｓ１４０４）。すなわち、インデクスＩＤテーブルを初期化する。 Thereafter, the system management module 405 stores the hash value of the virtual index ID 501 associated with the newly generated index ID 502 in the index ID table 406 (S1404). That is, the index ID table is initialized.

新規の仮想インデクスＩＤがインデクスＩＤテーブル４０６に追加されると、パイプライン制御モジュール４０４は、再配置のターゲットとなる全ての検索サーバ１０１上のインデクス管理モジュール２０３に対し、分割インデクスの再配置開始を命令する（Ｓ１４０５）。すなわち、再配置に関係する既存の検索サーバ１０１に対し、分割インデクスの再配置を命じる。 When a new virtual index ID is added to the index ID table 406, the pipeline control module 404 starts to relocate the divided indexes to the index management modules 203 on all the search servers 101 that are relocation targets. A command is issued (S1405). That is, the existing search server 101 related to the rearrangement is instructed to rearrange the split index.

再配置命令を受けた検索サーバ１０１のインデクス管理モジュール２０３は、既存の分割インデクス２０２の先頭からファイルパスを順々に取得する（Ｓ１４０６）。 The index management module 203 of the search server 101 that has received the rearrangement command sequentially acquires the file path from the top of the existing split index 202 (S1406).

次に、インデクス管理モジュール２０３は、ファイルパスからハッシュ値を計算してインデクスＩＤテーブル４０６に問い合わせ、インデクスＩＤを取得する（Ｓ１４０７）。 Next, the index management module 203 calculates a hash value from the file path, queries the index ID table 406, and acquires the index ID (S1407).

次に、インデクス管理モジュール２０３は、取得したインデクスＩＤが新規に追加されたインデクスＩＤか否か判定する（Ｓ１４０８）。インデクスＩＤが新規でなかった場合（Ｓ１４０８で否定結果）、インデクス管理モジュール２０３は、そのファイルパスについて何も処理を行わない。インデクスＩＤが新規であった場合（Ｓ１４０８で肯定結果）、インデクス管理モジュール２０３は、分割インデクスからそのエントリを抜き出し、新規インデクスＩＤに紐付けられている分割インデクスを生成・追加する（Ｓ１４０９）。 Next, the index management module 203 determines whether or not the acquired index ID is a newly added index ID (S1408). If the index ID is not new (No in S1408), the index management module 203 performs no processing for the file path. If the index ID is new (Yes in S1408), the index management module 203 extracts the entry from the split index, and generates and adds a split index associated with the new index ID (S1409).

なお、Ｓ１４０６〜Ｓ１４０９の操作は分割インデクス２０２に登録されている全てのファイルパスに対して処理される。また、分割インデクスは、一時的に検索サーバ１０１のローカルストレージ２０１上に生成されるものとする。 Note that the operations of S1406 to S1409 are processed for all the file paths registered in the split index 202. Further, it is assumed that the split index is temporarily generated on the local storage 201 of the search server 101.

その後、新規に生成された分割インデクスを分散ファイルシステム３０２にアップロードし（Ｓ１４１０）、パイプライン制御モジュール４０４に分割終了を通知する（Ｓ１４１１）。 Thereafter, the newly created split index is uploaded to the distributed file system 302 (S1410), and the pipeline control module 404 is notified of the end of splitting (S1411).

パイプライン制御モジュール４０４は、各々の検索サーバ１０１上のインデクス管理モジュール２０３から終了通知を受けた順番に、新規に追加された検索サーバ１０１のインデクス管理モジュール２０３に対し、インデクス配置処理の開始を指示する（Ｓ１４１２）。 The pipeline control module 404 instructs the index management module 203 of the newly added search server 101 to start index allocation processing in the order in which the end notification is received from the index management module 203 on each search server 101. (S1412).

新規に追加された検索サーバ１０１のインデクス管理モジュール２０３は、分散ファイルシステム３０２から分割インデクスをダウンロードし、分割インデクスのマージ処理を繰り返す（Ｓ１４１３）。以上により、新規追加された検索サーバ１０１上に分散インデクス２０２を生成することが可能となる。 The newly added index management module 203 of the search server 101 downloads the split index from the distributed file system 302 and repeats the split index merge process (S1413). As described above, the distributed index 202 can be generated on the newly added search server 101.

［検索サーバ削除フロー］
図１５に、検索システムから検索サーバ１０１が削減された場合の処理フローを示す。この処理フローは、管理者が、検索サーバ１０１の削減をシステム管理モジュール４０５に入力することで開始される（Ｓ１５０１）。 [Search server deletion flow]
FIG. 15 shows a processing flow when the search server 101 is reduced from the search system. This processing flow starts when the administrator inputs the reduction of the search server 101 to the system management module 405 (S1501).

検索サーバ１０１が削除されたことを受け付けると、システム管理モジュール４０５は、削減対象である検索サーバ１０１が配置先ファイルサーバ名になっているエントリのインデクスＩＤ６０１を検索サーバ管理テーブル４０７から取得し、そのインデクスＩＤに紐付けられている仮想インデクスＩＤを計算して取得する（Ｓ１５０２）。 Upon accepting that the search server 101 has been deleted, the system management module 405 acquires from the search server management table 407 the index ID 601 of the entry whose search server 101 that is the reduction target is the destination file server name. A virtual index ID associated with the index ID is calculated and acquired (S1502).

その後、システム管理モジュール４０５は、Ｓ１５０２で取得した仮想インデクスＩＤを、インデクスＩＤテーブル４０６から削除する（Ｓ１５０３）。 Thereafter, the system management module 405 deletes the virtual index ID acquired in S1502 from the index ID table 406 (S1503).

次に、システム管理モジュール４０５は、パイプライン制御モジュール４０４に対し、削減される検索サーバ１０１のインデクス管理モジュール２０３にインデクス削除の指示を出す（Ｓ１５０４）。 Next, the system management module 405 instructs the pipeline control module 404 to delete the index to the index management module 203 of the search server 101 to be reduced (S1504).

指示を受けたインデクス管理モジュール２０３は、分割インデクス２０２に登録されているファイルパスを先頭から終端まで順に取得する（Ｓ１５０５）。 Upon receiving the instruction, the index management module 203 obtains the file paths registered in the split index 202 in order from the beginning to the end (S1505).

次に、インデクス管理モジュール２０３は、取得したファイルパスからハッシュ値を計算し、計算されたハッシュ値に対応するインデクスＩＤをインデクスＩＤテーブルに問い合わせる（Ｓ１５０６）。 Next, the index management module 203 calculates a hash value from the acquired file path, and inquires the index ID table for the index ID corresponding to the calculated hash value (S1506).

その後、インデクス管理モジュール２０３は、分割インデクス２０２からファイルパスのエントリのインデクスＩＤを抜き出し、取得したインデクスＩＤに紐付けられた新規の分割インデクスを生成し、又は、その分割インデクスにインデクスデータを追加する。その後、インデクス管理モジュール２０３は、再配置先にマージするための分割インデクスを生成する（Ｓ１５０７）。この分割処理が終わった時、削除ターゲットである検索サーバ１０１のローカルストレージ２０１に、インデクスＩＤ毎の分割インデクスが複数存在する。 After that, the index management module 203 extracts the index ID of the file path entry from the split index 202, generates a new split index linked to the acquired index ID, or adds the index data to the split index. . Thereafter, the index management module 203 generates a split index for merging with the relocation destination (S1507). When this division processing is completed, there are a plurality of division indexes for each index ID in the local storage 201 of the search server 101 that is the deletion target.

次に、インデクス管理モジュール２０３は、Ｓ１５０７で生成したインデクスＩＤ毎の分割インデクスを分散ファイルシステム３０２上にアップロードする（Ｓ１５０８）。 Next, the index management module 203 uploads the split index for each index ID generated in S1507 onto the distributed file system 302 (S1508).

続いて、インデクス管理モジュール２０３は、システム管理モジュール４０５に対し、(1) インデクスＩＤ毎の分割インデクス生成が完了したこと、(2) 分散ファイルシステム３０２上の保存先情報を通知する（Ｓ１５０９）。 Subsequently, the index management module 203 notifies the system management module 405 of (1) completion of generation of the split index for each index ID and (2) storage destination information on the distributed file system 302 (S1509).

この通知を受けて、システム管理モジュール４０５は、パイプライン制御モジュール４０４に指示を出し、再配置のターゲットとなる全ての検索サーバ１０１上のインデクス管理モジュール２０３に対してインデクスのマージを命じる指示を出す（Ｓ１５１０）。 Upon receiving this notification, the system management module 405 issues an instruction to the pipeline control module 404, and issues an instruction for merging indexes to the index management modules 203 on all the search servers 101 that are the targets of relocation. (S1510).

指示を受けた各々のインデクス管理モジュール２０３は、分割インデクスのダウンロードとマージ処理を行い、最新の分割インデクス２０２を生成する（Ｓ１５１１）。 Receiving the instruction, each index management module 203 performs download and merge processing of the split index, and generates the latest split index 202 (S1511).

以上の完了後、インデクス管理モジュール２０３は、削除ターゲットの検索サーバ１０１をシステム上から削除する（Ｓ１５１２）。 After completion of the above, the index management module 203 deletes the deletion target search server 101 from the system (S1512).

［まとめ］
本実施の形態によれば、検索インデクスに対応するハッシュ値をマッピングするコンシステントハッシュ空間に仮想ノード（仮想インデクスＩＤ）を設定することにより、分割インデクスのサイズの平準化と偏りの抑制とを同時に実現することができる。これにより、検索パフォーマンスの向上を実現することができる。 [Summary]
According to the present embodiment, by setting a virtual node (virtual index ID) in a consistent hash space that maps a hash value corresponding to a search index, leveling of the size of the split index and suppression of bias are simultaneously performed. Can be realized. Thereby, the improvement of search performance is realizable.

また、本実施の形態によれば、検索サーバ１０１の物理的な追加又は削除に伴う分割インデクスの追加又は削除に関しても、仮想ノード（仮想インデクスＩＤ）の再配置により柔軟に対応することができる。結果的に、各検索サーバ１０１に対応付けられる複数の分割インデクス２０２の管理を簡素化することができる。 Further, according to the present embodiment, addition or deletion of a divided index accompanying physical addition or deletion of the search server 101 can be flexibly handled by rearrangement of virtual nodes (virtual index IDs). As a result, it is possible to simplify the management of the plurality of divided indexes 202 associated with each search server 101.

また、本実施の形態によれば、パイプライン処理による分割インデクスの生成を、複数台の分散処理サーバ１０３に分散して実行することができる。これにより、分割インデクスの生成速度を向上させることができる。さらに、分散処理サーバ１０３上における分割インデクスの生成をインデクスＩＤ毎に実行することにより、分割インデクスの生成時における分散処理サーバ間の無駄なネットワークトラフィック及びディスクＩ／Ｏを軽減することができる。これにより、分割インデクスの生成をより効率的にかつ高速化することができる。 Further, according to the present embodiment, the generation of the split index by the pipeline processing can be distributed to the plurality of distributed processing servers 103 and executed. Thereby, the production | generation speed | rate of a division | segmentation index can be improved. Furthermore, by generating the split index on the distributed processing server 103 for each index ID, it is possible to reduce unnecessary network traffic and disk I / O between the distributed processing servers when the split index is generated. Thereby, the generation of the split index can be more efficiently and speeded up.

また、本実施の形態によれば、分割リストの生成対象とするファイルパスを与えるインデクスリストの生成処理を、ファイルサーバ内のフォルダツリーの任意の階層数毎に実行することにより、分割インデクスの生成をより効率的にかつ高速化することができる。 Further, according to the present embodiment, by generating an index list generation process for giving a file path to be generated as a split list for each arbitrary number of hierarchies in the folder tree in the file server, split index generation is performed. Can be made more efficient and faster.

１００…検索クライアント
１０１…検索サーバ
１０２…ファイルサーバ
１０３…分散処理サーバ
１０４…管理サーバ
１０５…ネットワーク
２０１…ローカルストレージ
２０２…分割インデクス
２０３…インデクス管理モジュール
２０４…検索モジュール
３０１…ローカルストレージ
３０２…分散ファイルシステム
３０３…分散処理モジュール
４０１…ローカルストレージ
４０２…スキャナモジュール
４０３…インデクス生成モジュール
４０４…パイプライン制御モジュール
４０５…システム管理モジュール
４０６…インデクスＩＤテーブル
４０７…検索サーバ管理テーブル
４０８…ファイル管理テーブル DESCRIPTION OF SYMBOLS 100 ... Search client 101 ... Search server 102 ... File server 103 ... Distributed processing server 104 ... Management server 105 ... Network 201 ... Local storage 202 ... Split index 203 ... Index management module 204 ... Search module 301 ... Local storage 302 ... Distributed file system 303 ... Distributed processing module 401 ... Local storage 402 ... Scanner module 403 ... Index generation module 404 ... Pipeline control module 405 ... System management module 406 ... Index ID table 407 ... Search server management table 408 ... File management table

Claims

In a file search system that searches large file systems,
A first processing function unit for scanning a file / directory of the file system to generate a list of file paths;
A process of assigning an index ID to each of the divided indexes assigned to each search server, a process of associating each index ID with the search server that is the placement destination, the number of search servers, and the number of divided indexes to be placed on each search server Based on the number, a process of associating n virtual index IDs with each index ID, and a hash value (first hash value) in the consistent hash space is obtained from the character string of the virtual index ID. Processing, processing for generating a table indicating a correspondence relationship between the hash value (first hash value) and the index ID, and a hash value (second second) in the consistent hash space from the file path read from the list (Hash value) and the hash value (second hash value) Referring to Buru, a second processing function unit that executes a process of determining the index ID to be associated with each file path, and a process of generating a split index for each of the index ID,
A third processing function unit for placing the generated split index in the search server ;
A file search system comprising:

The file search system according to claim 1 ,
The file processing system, wherein the second processing function unit divides the list into an arbitrary number, and distributes the generation processing of the divided index for each divided list to a plurality of distributed processing systems.

The file search system according to claim 2 ,
The file processing system, wherein the second processing function unit divides the list for each index ID.

The file search system according to claim 1,
The first processing function unit generates a list of the file paths for each arbitrary folder hierarchy in the file system, each of the product, to give the generated list to the second processing function unit Feature file search system.

The file search system according to claim 1,
When adding a search server, the correspondence between the hash value (first hash value) of the virtual index ID set in the consistent hash space that maps the hash value uniquely calculated from the file path and the index ID is updated. A fourth processing function unit,
A file processing system comprising: a fifth processing function unit that executes rearrangement of divided indexes using the correspondence relationship after update.

The file search system according to claim 1,
When deleting a search server, the corresponding index ID is calculated from the file path registered in the split index assigned to the search server to be deleted, and merged for each relocation destination search server corresponding to each index ID And a sixth processing function unit for generating a divided index for use.