JP2014186482A

JP2014186482A - Full-text search system

Info

Publication number: JP2014186482A
Application number: JP2013060248A
Authority: JP
Inventors: Masato Harada; 匡人原田; Ryo Nishimura; 涼西村
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2013-03-22
Filing date: 2013-03-22
Publication date: 2014-10-02
Anticipated expiration: 2033-03-22
Also published as: JP6050165B2

Abstract

PROBLEM TO BE SOLVED: To provide a full-text search system capable of reducing retrieval time by reducing N-gram index to be read.SOLUTION: The full-text search system segments a document into registration substrings and registers a character position list which stores a piece of character position information representing a position of a character in each document, and a document ID list which stores a start position of the character position information in the character position list and the document ID while associating with each other. The full-text search system segments a retrieval term into retrieval substrings, compares document ID lists corresponding to the retrieval substring with each other, and extracts all document IDs commonly included. By acquiring the start position of the character position information for each retrieval substring from the document ID list, the full-text search system acquires the relevant character position information only from the character position list. With respect to the respective document IDs, when the respective retrieval substrings adjacently appear in the same order as that of the retrieval term before segmentation, the full-text search system determines that the retrieval term is included in the document of the document ID.

Description

本発明は、予め登録した文書群から、指定した文字列を含む文書を検索する全文検索装置に関するものである。 The present invention relates to a full-text search apparatus for searching a document including a designated character string from a previously registered document group.

全文検索装置が大規模な文書データベースから指定された検索文字列（以下、検索タームと呼ぶ）が含まれる文書を高速に探し出す方式としては、登録されている文書を走査して、予め各文書に出現する文字列のインデクスを作成しておく方式がある。 As a method in which the full-text search device searches a document containing a search character string (hereinafter referred to as a search term) designated from a large-scale document database at high speed, a registered document is scanned and each document is previously stored. There is a method to create an index of the character string that appears.

インデクスの方式としては非特許文献１に記載されている、文字種に応じて１文字又は連接２文字(以下、「Ｎ−ｇｒａｍターム」と呼ぶ)の出現する文書ＩＤと出現位置をインデクス（以下、「Ｎ−ｇｒａｍインデクス」と呼ぶ）として、検索ターム中の位置関係とＮ−ｇｒａｍインデクス中の位置関係が等しいるかどうかを判定(以下、「隣接判定」と呼ぶ)する方式が開示されている。 As an indexing method, the document ID and the appearance position where one character or two concatenated characters (hereinafter referred to as “N-gram term”) appear according to the character type are described in Non-Patent Document 1. A method for determining whether or not the positional relationship in the search term and the positional relationship in the N-gram index are equal (hereinafter referred to as “neighboring determination”) is disclosed as “N-gram index”) .

Ｎ−ｇｒａｍインデクスは文書群に対して数倍のデータサイズになるため、通常はＮ−ｇｒａｍインデクスをハードディスク上に配置する。検索を行う時は、検索タームをＮ−ｇｒａｍタームに分割して、Ｎ−ｇｒａｍタームごとのＮ−ｇｒａｍインデクスをハードディスクから読み込んで隣接判定を行う。また複雑な検索が行えるようにする為に、Ｎ−ｇｒａｍインデクスに情報を増やすことがある。特許文献１ではＮ−ｇｒａｍインデクスに文字が出現する構造の情報を追加する方式が開示されている。特許文献１ではＸＭＬ文書に対して構造情報を意識した検索を実現している。 Since the N-gram index has a data size several times that of the document group, the N-gram index is usually arranged on the hard disk. When performing a search, the search term is divided into N-gram terms, and an N-gram index for each N-gram term is read from the hard disk to determine adjacency. In addition, in order to enable a complicated search, information may be increased in the N-gram index. Patent Document 1 discloses a method of adding information on a structure in which characters appear in an N-gram index. Japanese Patent Laid-Open No. 2004-228620 implements a search that is aware of structural information for an XML document.

特開２０１０−１０８１９１号公報JP 2010-108191 A 特開平８−１９４７１８号公報JP-A-8-194718

”日本語文書用高速全文検索の一手法”（電子情報通信学会論文誌Ｄ−１Ｖｏｌ．Ｊ７５−Ｄ−１Ｎｏ．９８３６〜８４６頁１９９２年９月)"A method of high-speed full-text search for Japanese documents" (The IEICE Transactions D-1 Vol. J75-D-1 No. 9 pages 836-846, September 1992)

文書群に頻出する文字(以下、「頻出文字」と呼ぶ)は、その文字のＮ−ｇｒａｍインデクスのデータサイズが大きくなってしまうので、検索タームの中に頻出文字が含まれていると大きな頻出文字のＮ−ｇｒａｍインデクスをハードディスクから読み込む必要があるので、検索を行うのに時間がかかる。また、公知例１のように、複雑な検索を行うためにＮ−ｇｒａｍインデクスに情報を増やすとその分ハードディスクから読み込むデータ量が増えるので同様に検索を行うのに時間がかかる。 Characters that appear frequently in a document group (hereinafter referred to as “frequent characters”) increase the data size of the N-gram index of the character, so if frequent characters are included in the search term, a large frequent Since it is necessary to read the character N-gram index from the hard disk, it takes time to perform the search. Further, as in the known example 1, when the information is increased in the N-gram index for performing a complicated search, the amount of data read from the hard disk increases accordingly, and it takes time to perform the search in the same manner.

この検索に時間がかかるという問題に対しては、特許文献２で、頻出文字に１文字を追加したデータサイズが小さいＮ−ｇｒａｍインデクスを作成して、検索を行う時には小さいＮ−ｇｒａｍインデクスをハードディスクから読み込むことで検索を行う時間を短くする方式が開示されている。 To solve the problem that this search takes time, in Patent Document 2, an N-gram index having a small data size with one character added to a frequent character is created, and when performing a search, a small N-gram index is stored on the hard disk. A method of shortening the time for searching by reading from the document is disclosed.

特許文献２では、検索タームに指定されている頻出文字に続く１文字を利用して検索時に読み込むＮ−ｇｒａｍインデクスのサイズを小さくして検索を行う時間を短くしているが、頻出文字に続く１文字にしか有効ではなく、検索タームの中の頻出文字が最後に出現する場合や、頻出単語だけの場合は有効ではない。 In Patent Document 2, the search time is shortened by reducing the size of the N-gram index read at the time of search using one character following the frequent character specified in the search term. It is effective only for one character, and is not effective when a frequent character in the search term appears last or only a frequent word.

また、文書群の中で、頻出文字の次に出現する文字の種類だけＮ−ｇｒａｍインデクスを作成する必要があるので、Ｎ−ｇｒａｍインデクス全体のサイズが大きくなり作成にも時間がかかる。 In addition, since it is necessary to create N-gram indexes for only the types of characters that appear next to frequently appearing characters in the document group, the size of the entire N-gram index is increased, and the creation takes time.

本発明の目的は、Ｎ−ｇｒａｍインデクスを文書ＩＤと文字位置情報の２つに分けて、検索タームに指定された頻出文字以外の文字を利用して、ハードディスクから読み込む頻出文字のＮ−ｇｒａｍインデクスを少なくすることによって、検索を行う時間を短くする全文検索装置を提供することにある。 An object of the present invention is to divide an N-gram index into two parts, a document ID and character position information, and use the characters other than the frequent characters specified in the search term, and use the N-gram index of the frequent characters read from the hard disk. It is an object of the present invention to provide a full-text search apparatus that shortens the time for performing a search by reducing the number of times.

上記の目的を達成するべく本発明は、以下の構成を提供する。
本発明による全文検索装置の態様は、複数の文書の各々を所定の文字数の登録用部分文字列に分割し、前記登録用部分文字列ごとに、当該登録用部分文字列の各文書中における文字の位置を示す文字位置情報を記憶した文字位置リストと、前記文字位置リストにおける前記文字位置情報の開始位置と各文書の文書ＩＤとを対応付けて記憶した文書ＩＤリストとを登録する手段と、
取得した検索タームを、前記登録用部分文字列の文字数と同数の検索用部分文字列に分割し、分割した前記検索用部分文字列の各々と同じ前記登録用部分文字列ごとに記憶されている前記文書ＩＤリストを互いに比較することにより、比較した全ての前記文書ＩＤリストに共通して含まれる文書ＩＤを全て抽出する手段と、
前記検索用部分文字列ごとに前記文書ＩＤリストから、抽出された共通する文書ＩＤの各々に対応付けられた前記文字位置情報の開始位置をそれぞれ取得する手段と、
取得した前記文字位置情報の開始位置に基づいて、前記検索用部分文字列ごとに前記文字位置リストから、該当する文字位置情報のみをそれぞれ取得する手段と、
抽出された共通する文書ＩＤの各々について、前記検索用部分文字列ごとに取得した前記文字位置情報に基づいて、分割前の検索タームと同じ順番で各検索用部分文字列が隣接して出現するか否かを判断し、各検索用部分文字列が隣接して出現する場合は当該文書ＩＤの文書に当該検索タームが含まれると判断する手段と、を有する。 In order to achieve the above object, the present invention provides the following configurations.
An aspect of the full-text search device according to the present invention is to divide each of a plurality of documents into registration partial character strings having a predetermined number of characters, and for each of the registration partial character strings, characters in each document of the registration partial character string. Means for registering a character position list that stores character position information indicating the position of the document, and a document ID list that stores the start position of the character position information in the character position list and the document ID of each document in association with each other;
The obtained search terms are divided into the same number of search partial character strings as the number of characters of the registration partial character string, and stored for each of the same registration partial character strings as each of the divided search partial character strings. Means for extracting all document IDs included in common in all the compared document ID lists by comparing the document ID lists with each other;
Means for respectively obtaining a start position of the character position information associated with each of the extracted common document IDs from the document ID list for each partial character string for search;
Means for acquiring only corresponding character position information from the character position list for each partial character string for search based on the start position of the acquired character position information;
For each of the extracted common document IDs, the search partial character strings appear adjacently in the same order as the search terms before the division based on the character position information acquired for each of the search partial character strings. And means for determining that the search term is included in the document with the document ID when each partial character string for search appears adjacently.

本発明の全文検索装置によれば、次のような効果がある。ハードディスクから読み込むＮ−ｇｒａｍインデクスを最小限に抑えることで、検索処理の高速化を図ることができる。 The full-text search device of the present invention has the following effects. By minimizing the N-gram index read from the hard disk, the search process can be speeded up.

本発明を適用した一例におけるシステム構成を示す図である。It is a figure which shows the system configuration | structure in an example to which this invention is applied. 本発明を適用した一例におけるＮ−ｇｒａｍインデクスのデータ構造を示した図である。It is the figure which showed the data structure of the N-gram index in an example to which this invention is applied. 本発明を適用した一例における文書の登録処理を示した図である。It is the figure which showed the registration process of the document in an example to which this invention is applied. 本発明を適用した一例における文書の検索処理を示した図である。It is the figure which showed the search process of the document in an example to which this invention is applied. 本発明を適用した一例における文書の検索処理の具体的なデータの流れを示した図である。It is the figure which showed the specific data flow of the search process of the document in an example to which this invention is applied.

以下、本発明の一例を示した図面を参照して本発明の実施の形態を説明する。
図１は本発明における全文検索装置を備えたシステム全体の構成図である。
110は検索クライアント、120はＬＡＮ等の通信回線、130は検索サーバを示し、検索クライアント110と検索サーバ130は、通信回線120で接続されている。 Embodiments of the present invention will be described below with reference to the drawings showing an example of the present invention.
FIG. 1 is a configuration diagram of an entire system including a full-text search apparatus according to the present invention.
110 is a search client, 120 is a communication line such as a LAN, 130 is a search server, and the search client 110 and the search server 130 are connected by a communication line 120.

検索サーバ130は、検索対象となる複数の文書の登録処理及び文書の検索処理を行なう全文検索装置131と、検索時に使用するＲＡＭ等の検索用メモリ132と、登録した複数の文書から作成したＮ−ｇｒａｍインデクス136を登録する補助記憶装置の典型例であるハードディスク133とを有し、ハードディスク133に登録されるＮ−ｇｒａｍインデクスは、文書ＩＤリスト134と、文字位置リスト135とから構成されている。検索サーバ130は、適宜のコンピュータであり、所定のプログラムを導入されており、ＣＰＵがメモリに所定のプログラムを読み込み実行することにより、全文検索装置131として機能する。
なお、以下の図面の説明においては、図１中の符号を参照する場合がある。 The search server 130 is a full-text search device 131 that performs registration processing and document search processing of a plurality of documents to be searched, a search memory 132 such as a RAM used for searching, and an N created from a plurality of registered documents. The hard disk 133, which is a typical example of an auxiliary storage device that registers the gram index 136, has an N-gram index that is registered in the hard disk 133, and includes a document ID list 134 and a character position list 135. . The search server 130 is an appropriate computer and has a predetermined program installed therein, and functions as the full-text search device 131 when the CPU reads and executes the predetermined program in the memory.
In the following description of the drawings, reference numerals in FIG. 1 may be referred to.

図２は、図１のハードディスク133に登録するＮ−ｇｒａｍインデクスのデータ構造例を示す図である。図に示すとおり、Ｎ−ｇｒａｍインデクス136は、登録する複数の文書の各々を、所定の文字数ごとに分割した登録用部分文字列であるＮ−ｇｒａｍターム210、211と、Ｎ−ｇｒａｍターム210、211ごとに作成された文書ＩＤリスト134A、134B及び文字位置リスト135A、135Bとから構成されている。図２では、文書を１文字ごとに分割したＮ−ｇｒａｍタームの例であり、分割したＮ−ｇｒａｍタームのうち代表として２つのＮ−ｇｒａｍターム”あ”210及び”め”211のみを示している。 FIG. 2 is a diagram showing an example data structure of an N-gram index registered in the hard disk 133 of FIG. As shown in the figure, the N-gram index 136 includes N-gram terms 210 and 211, which are partial character strings for registration obtained by dividing each of a plurality of documents to be registered by a predetermined number of characters, N-gram terms 210, It is composed of document ID lists 134A and 134B and character position lists 135A and 135B created for each 211. FIG. 2 shows an example of an N-gram term obtained by dividing a document into characters. Only two N-gram terms “A” 210 and “M” 211 are shown as representatives of the divided N-gram terms. Yes.

Ｎ−ｇｒａｍターム210、211ごとの文書ＩＤリスト134A、134Bには、複数の文書の各々の文書ＩＤと、文字位置リスト135A、135B上での文字位置情報の開始位置を示すオフセットとを登録する。文書ＩＤリスト134A、134B上では、文書ＩＤが昇順となるように登録する。 In the document ID lists 134A and 134B for each of the N-gram terms 210 and 211, the document ID of each of a plurality of documents and an offset indicating the start position of the character position information on the character position lists 135A and 135B are registered. . On the document ID lists 134A and 134B, the document IDs are registered in ascending order.

Ｎ−ｇｒａｍターム210、211ごとの文字位置リスト135A、135Bには、複数の文書の各々の中での当該Ｎ−ｇｒａｍターム210、211の出現する数を示す出現個数と、当該文書中における当該Ｎ−ｇｒａｍターム210、211の文字の位置を示す文字位置（出現個数と同数）とから構成される文字位置情報を登録する。文字位置リスト135A、135Bにおけるこの文字位置情報の開始位置が、文書ＩＤリスト134A、134Bにおいてオフセットとして登録される。文字位置リスト135A、135B上では、文書ＩＤの昇順に、かつ、文字位置が昇順となるように登録する。 In the character position lists 135A and 135B for each of the N-gram terms 210 and 211, the number of appearances indicating the number of occurrences of the N-gram terms 210 and 211 in each of a plurality of documents, and the number of occurrences in the document Character position information composed of character positions (the same number as the number of appearances) indicating the positions of the N-gram terms 210 and 211 is registered. The start position of the character position information in the character position lists 135A and 135B is registered as an offset in the document ID lists 134A and 134B. On the character position lists 135A and 135B, registration is performed so that the document IDs are in ascending order and the character positions are in ascending order.

ここで例えば、文書ＩＤ”001”である文書”あめがふる”が登録されている場合、１文字目のＮ−ｇｒａｍターム”あ”210に対応する文書ＩＤリスト134Aには、文書ＩＤ”001”と文字位置リスト135A上のオフセットである”1”が登録されるとともに、文字位置リスト135Aの当該オフセット１番目には、文書ＩＤ”001”の文書中の”あ”の出現個数”3”210aと文字位置”1”、”3”、”12”210bとが登録されている。また、文書ＩＤリスト134Aには、文書ＩＤ”900”と文字位置リスト135A上のオフセットである”4000”が登録されるとともに、文字位置リスト135Aの当該オフセット4000番目には、文書ＩＤ”900”の文書中の”あ”の出現個数”1”210cと文字位置”2”210dとが登録されている。 Here, for example, when the document “Amegawa” with the document ID “001” is registered, the document ID list “001” corresponding to the first character N-gram term “A” 210 is included in the document ID “001”. "And" 1 ", which is an offset on the character position list 135A, are registered, and the first offset of the character position list 135A is the number of occurrences of" a "in the document with the document ID" 001 "" 3 " 210a and character positions “1”, “3”, and “12” 210b are registered. In the document ID list 134A, the document ID “900” and the offset “4000” on the character position list 135A are registered, and the document ID “900” is added to the 4000th offset in the character position list 135A. The number of occurrences of “a” in the document “1” 210c and the character position “2” 210d are registered.

さらに例えば、文書ＩＤ”001”である文書”あめがふる”が登録されている場合、２文字目のＮ−ｇｒａｍターム”め”211に対応する文書ＩＤリスト134Bには、文書ＩＤ”001”と文字位置リスト135B上のオフセットである”1”が登録されるとともに、文字位置リスト135Bの当該オフセット１番目には、文書ＩＤ”001”の文書中の”め”の出現個数”1”211aと文字位置”2”211bとが登録されている。また、文書ＩＤリスト134Bには、文書ＩＤ”900”と文字位置リスト135b上のオフセットである”10000”が登録されるとともに、文字位置リスト135Bの当該オフセット10000番目には、文書ＩＤ”900”の文書中の”め”の出現個数”1”211cと文字位置”3”211dとが登録されている。 Further, for example, when the document “Amega Fu” with the document ID “001” is registered, the document ID list “134” corresponding to the N-gram term “M” 211 of the second character has the document ID “001”. "1", which is an offset on the character position list 135B, is registered, and the first offset of the character position list 135B is the number of occurrences of "me" in the document with the document ID "001" "1" 211a And character position “2” 211b are registered. The document ID list 134B is registered with the document ID “900” and the offset “10000” on the character position list 135b, and the document ID “900” is included in the offset 10000th in the character position list 135B. The number of occurrences of “me” in the document “1” 211c and the character position “3” 211d are registered.

図３は、図１に示した全文検索装置131における登録処理を示すフローチャートである。
文書の登録処理は、検索クライアント110から検索サーバ130に登録処理の要求と登録する文書群を送信することで開始する。（ステップ310）検索サーバ130は、登録対象の文書ごとに文書ＩＤを採番する。 FIG. 3 is a flowchart showing registration processing in the full-text search apparatus 131 shown in FIG.
The document registration process is started by transmitting a registration process request and a document group to be registered from the search client 110 to the search server 130. (Step 310) The search server 130 assigns a document ID to each document to be registered.

検索サーバ130は、検索クライアント110から送られて来た登録対象の文書を、１文字又は２文字の登録用部分文字列であるＮ−ｇｒａｍタームに全て分割する。分割するときは、Ｎ−ｇｒａｍタームの文字と、当該Ｎ−ｇｒａｍタームの文字の文書上での文字位置とをペアで記憶する。（ステップ320） The search server 130 divides all the documents to be registered sent from the search client 110 into N-gram terms that are one-character or two-character registration partial character strings. When dividing, the character of the N-gram term and the character position on the document of the character of the N-gram term are stored as a pair. (Step 320)

次に、ステップ310で作成したＮ−ｇｒａｍタームと文字位置群を、同じ文字のＮ−ｇｒａｍタームごとに分類する。（ステップ330）
分類したＮ−ｇｒａｍタームごとに、ハードディスク133上にある文字位置リスト135の末尾に記憶してある文字位置を追加する。このとき、文字位置は昇順で追加する。文字位置を追加するときは、Ｎ−ｇｒａｍタームごとに文字位置追加前の末尾の位置をオフセットとして記憶しておく（ステップ340） Next, the N-gram term and character position group created in step 310 are classified for each N-gram term of the same character. (Step 330)
For each classified N-gram term, the character position stored at the end of the character position list 135 on the hard disk 133 is added. At this time, character positions are added in ascending order. When adding a character position, the last position before adding the character position is stored as an offset for each N-gram term (step 340).

次に、文書ＩＤリスト134の中でステップ340で文字位置リスト135の追加を行ったＮ−ｇｒａｍタームと同じＮ−ｇｒａｍタームの文書ＩＤリスト134の末尾に、文書ＩＤとステップ340で記憶しておいたオフセットを追加する。このとき、文書ＩＤは検索サーバ130の内部で採番したＩＤを使用して、文書ＩＤリスト134の中で文書ＩＤが必ず昇順に並ぶようにする。（ステップ350） Next, in the document ID list 134, the document ID and the document ID are stored in step 340 at the end of the document ID list 134 of the same N-gram term as the N-gram term to which the character position list 135 is added in step 340. Add the offset. At this time, the document ID used in the search server 130 is used as the document ID so that the document IDs are always arranged in ascending order in the document ID list 134. (Step 350)

登録対象の複数の文書に出現する全てのＮ−ｇｒａｍタームについて文書ＩＤをＮ−ｇｒａｍインデクスに追加したら登録処理は完了となる。（ステップ360） When the document ID is added to the N-gram index for all N-gram terms appearing in a plurality of registration target documents, the registration process is completed. (Step 360)

図４は、全文検索装置において３文字以上の検索用文字列を検索タームとして指定したときの検索（以下、「文字列検索」と呼ぶ）の検索処理を示すフローチャートである。図４を用いて文字列検索処理について説明する。 FIG. 4 is a flowchart showing a search process (hereinafter referred to as “character string search”) when a search character string of three or more characters is designated as a search term in the full-text search device. The character string search process will be described with reference to FIG.

文字列検索は、検索クライアント110から検索タームと検索実行の要求が検索サーバ130に送信されることで開始する。（ステップ410） The character string search starts when a search term and a search execution request are transmitted from the search client 110 to the search server 130. (Step 410)

検索サーバ130は、検索クライアント110から受け取った検索タームを１文字又は２文字の検索用部分文字列であるＮ−ｇｒａｍタームに分割する。（ステップ420）ここで分割する検索用部分文字列の文字数は、図３に示した登録時のステップ320で分割する登録用部分文字列の文字数と同じとする。 The search server 130 divides the search term received from the search client 110 into N-gram terms that are one-character or two-character search partial character strings. (Step 420) The number of characters in the partial character string for search divided here is the same as the number of characters in the partial character string for registration divided in Step 320 at the time of registration shown in FIG.

次に、複数のＮ−ｇｒａｍタームから１つのＮ−ｇｒａｍタームを選択して、ハードディスク133から検索用メモリ132に当該Ｎ−ｇｒａｍタームの文書ＩＤリスト（以下、「基準文書ＩＤリスト」と呼ぶ）を読み込む。（ステップ430） Next, one N-gram term is selected from a plurality of N-gram terms, and the document ID list of the N-gram term (hereinafter referred to as “reference document ID list”) is stored in the search memory 132 from the hard disk 133. Is read. (Step 430)

続いて、複数のＮ−ｇｒａｍタームから別のＮ−ｇｒａｍタームを選択して、ハードディスク133から検索用メモリ132に文書ＩＤリスト（以下、「比較文書ＩＤリスト」と呼ぶ）を読み込む。（ステップ440） Subsequently, another N-gram term is selected from the plurality of N-gram terms, and a document ID list (hereinafter referred to as “comparison document ID list”) is read from the hard disk 133 to the search memory 132. (Step 440)

基準文書ＩＤリストに出現し、比較文書ＩＤリストに出現しない文書ＩＤは文字位置を比較しなくてもステップ410で指定された検索タームを含まないことが確定するので、基準文書ＩＤリストと比較文書ＩＤリストの両方に出現する文書ＩＤだけが残るように、文書ＩＤを絞り込む。絞り込んだ結果を新たな基準文書ＩＤリストとして検索メモリ上に配置する。（ステップ450） Since document IDs that appear in the reference document ID list and do not appear in the comparison document ID list do not include the search term specified in step 410 without comparing character positions, the reference document ID list and the comparison document are determined. The document IDs are narrowed down so that only document IDs appearing in both ID lists remain. The narrowed down result is arranged on the search memory as a new reference document ID list. (Step 450)

まだ比較していないＮ−ｇｒａｍタームが残っている場合は、ステップ440に戻って次の比較文書ＩＤリストを読み込み、新たな基準文書ＩＤリストと比較することによりさらに文書ＩＤを絞り込む。比較するＮ−ｇｒａｍタームが残っていない場合は、ステップ470に進む。（ステップ460）このようにして最終的に得られた基準文書ＩＤリストには、検索用の複数のＮ−ｇｒａｍタームに対応する複数の文書ＩＤリストの各々の中で共通する文書ＩＤのみが全て含まれている。 If N-gram terms that have not been compared still remain, the process returns to step 440 to read the next comparison document ID list, and further narrow down the document IDs by comparing with the new reference document ID list. If there are no remaining N-gram terms to compare, go to step 470. (Step 460) In the reference document ID list finally obtained in this way, only the document IDs common to each of the plurality of document ID lists corresponding to the plurality of N-gram terms for search are all included. include.

メモリ上にある基準文書ＩＤリストに出現する文書ＩＤについて、それぞれのＮ−ｇｒａｍタームの文字位置情報を、ハードディスク133の文字位置リスト135から検索用メモリ132に読み込む。（ステップ470）このとき、基準文書ＩＤリストは、ステップ340で追加した文字位置情報の開始位置のオフセットを持っているので、オフセットに基づいてヒットする可能性のある文書の文字位置情報だけをハードディスク133から読み込むことができる。文字位置リスト135は、文書ＩＤリスト134と比較すると、データサイズの比率が”１：Ｎ−ｇｒａｍタームの文書内平均出現個数”となるので、データサイズが大きい文字位置情報のハードディスク133からの読み込みを従来よりも減らすことで、検索処理の高速化を図ることができる。 For the document IDs that appear in the reference document ID list on the memory, the character position information of each N-gram term is read from the character position list 135 of the hard disk 133 into the search memory 132. (Step 470) At this time, since the reference document ID list has the offset of the start position of the character position information added in Step 340, only the character position information of the document that may hit based on the offset is stored in the hard disk. You can read from 133. Compared with the document ID list 134, the character position list 135 has a data size ratio of “1: N-gram term average number of appearances in the document”, so that character position information having a large data size is read from the hard disk 133. The search processing can be speeded up by reducing the value of the conventional method.

ステップ470で読み込んだ文字位置情報を参照して、各Ｎ−ｇｒａｍタームが、分割前の検索タームと同じ順番で隣接して出現するかどうかを文書ＩＤごとにチェック（以下、「隣接照合」と呼ぶ）して、同じ順番で隣接して出現するなら、その文書ＩＤは検索ヒットしたと記憶する。（ステップ480）すなわちその文書ＩＤの文書に検索タームが含まれると判断される。 With reference to the character position information read in step 470, it is checked for each document ID whether each N-gram term appears adjacently in the same order as the search term before division (hereinafter referred to as “adjacent matching”). If it appears adjacently in the same order, the document ID is stored as a search hit. (Step 480) That is, it is determined that the search term is included in the document with the document ID.

基準文書ＩＤリストの文書ＩＤ全てでステップ480の隣接照合が終了したら、ヒットした文書の一覧を検索結果として検索クライアント110に返して検索処理は完了となる。（ステップ490） When the adjacent collation in step 480 is completed for all document IDs in the reference document ID list, a list of hit documents is returned to the search client 110 as a search result, and the search process is completed. (Step 490)

上述した図４の文字列検索処理を、上述した図２のＮ−ｇｒａｍインデクスのデータ構造と図５の検索処理の具体例を用いて説明する。 4 will be described using the data structure of the N-gram index in FIG. 2 and a specific example of the search process in FIG.

この例は、複数の文書を登録したＮ−ｇｒａｍインデクスが、上述した図２の状態のときに、検索タームを”あめ”510として検索を行う例である。 In this example, when the N-gram index in which a plurality of documents are registered is in the state of FIG. 2 described above, the search is performed with the search term “Ame” 510.

まず、検索クライアント110から入力された“あめ”510を、１文字のＮ−ｇｒａｍターム“あ”と”め”520に分解する。（ステップ410、ステップ420） First, “ame” 510 input from the search client 110 is decomposed into one-character N-gram terms “a” and “me” 520. (Step 410, Step 420)

次に、一つ目のＮ−ｇｒａｍタームとして“あ”を選択して、“あ”の文書ＩＤリスト134Aをハードディスク133から検索用メモリ132に基準文書ＩＤリストとして読み込む。検索用メモリ132上の基準文書ＩＤリスト531が、読み込んだ“あ”の文書ＩＤリスト134Aである。（ステップ430） Next, “A” is selected as the first N-gram term, and the document ID list 134 A of “A” is read from the hard disk 133 into the search memory 132 as a reference document ID list. The reference document ID list 531 on the search memory 132 is the read “A” document ID list 134A. (Step 430)

次に、残りのＮ−ｇｒａｍタームとして“め”を選択して、“め”の文書ＩＤリスト134Bをハードディスク133から検索用メモリ132に比較文書ＩＤリストとして読み込む。メモリ530上の比較文書ＩＤリスト532が読み込んだ“め”の文書ＩＤリスト134Bである。（ステップ440） Next, “M” is selected as the remaining N-gram term, and the “M” document ID list 134B is read from the hard disk 133 into the search memory 132 as a comparison document ID list. The comparison document ID list 532 on the memory 530 is the “document” document ID list 134B read. (Step 440)

次に、検索用メモリ132上の基準文書ＩＤリスト531と比較文書ＩＤリスト532を比較して、両方に出現する文書ＩＤ001と900で新たな基準文書ＩＤリスト533を作成する。なお、この例では文書ＩＤリスト531と532の省略部分に同じ文書ＩＤは出現しないものとする。（ステップ450） Next, the reference document ID list 531 and the comparison document ID list 532 on the search memory 132 are compared, and a new reference document ID list 533 is created with the document IDs 001 and 900 appearing in both. In this example, it is assumed that the same document ID does not appear in the omitted parts of the document ID lists 531 and 532. (Step 450)

ここで、未だ文書ＩＤリストを読み込んでいないＮ−ｇｒａｍタームは存在しない。新たな基準文書ＩＤリスト533には、検索用のＮ−ｇｒａｍターム”あ”と”め”に対応する元の２つの文書ＩＤリスト134A、134Bの中の共通する文書ＩＤのみが含まれている。続いて、新たな基準文書ＩＤリスト533にある文書ＩＤと、文字位置リスト上のオフセットの情報すなわち文字位置情報とを用いて、図２の“あ”と“め”の各々の文字位置リスト135Aと135Bから文書ＩＤ”009”と”900”の文字位置情報のみを、ハードディスク133から検索用メモリ132に読み込む。読み込んだＮ−ｇｒａｍターム“あ”の文字位置が文字位置リスト534、“め”の文字位置が文字位置リスト535である。（ステップ460、ステップ470）ここで、図２の文字位置リスト135Aと135Bの全体では無く、一部分だけハードディスク133から読み込むことができているので、従来技術に比べて読み込むデータ量が少なく、検索処理の高速化を図ることができる。 Here, there is no N-gram term for which the document ID list has not yet been read. The new reference document ID list 533 includes only the common document IDs in the original two document ID lists 134A and 134B corresponding to the search N-gram terms “a” and “me”. . Subsequently, using the document IDs in the new reference document ID list 533 and the offset information on the character position list, that is, the character position information, each character position list 135A of “A” and “ME” in FIG. Only character position information of document IDs “009” and “900” is read from the hard disk 133 to the search memory 132 from the 135B and 135B. The read character position of the N-gram term “A” is the character position list 534, and the character position of “ME” is the character position list 535. (Step 460, Step 470) Here, only a part of the character position lists 135A and 135B in FIG. 2 can be read from the hard disk 133, so that the amount of data to be read is smaller than that of the prior art, and the search processing is performed. Can be speeded up.

次に、文字位置リスト534と535を見ると文書ＩＤ001の１文字目210bに“あ”、２文字目211bに“め”が出現していることが分かるので、文書ＩＤ001には検索条件“あめ”510を含んでいることが分かる。同様に文書ＩＤ900は、２文字目210dに“あ”、３文字目211dに“め”が出現しているので、文書ＩＤ900も検索条件“あめ”510を含んでいることが分かる。（ステップ480） Next, looking at the character position lists 534 and 535, it can be seen that “A” appears in the first character 210b of the document ID 001, and “ME” appears in the second character 211b. ”510 is included. Similarly, since “A” appears in the second character 210d of the document ID 900 and “Me” appears in the third character 211d, it can be seen that the document ID 900 also includes the search condition “AME” 510. (Step 480)

最後に、検索結果として文書ＩＤ001と900を検索クライアント110に返して検索処理を終了する（ステップ490） Finally, document IDs 001 and 900 are returned as search results to the search client 110, and the search process is terminated (step 490).

110：検索クライアント
120：通信回線
130：検索サーバ
131：全文検索装置
132：検索用メモリ
133：ハードディスク
134：文書ＩＤリスト
135：文字位置リスト
136：Ｎ−ｇｒａｍインデクス
210：登録用のＮ−ｇｒａｍターム“あ”
211：登録用のＮ−ｇｒａｍターム“い”
134A：“あ”の文書ＩＤリスト
134B：“め”の文書ＩＤリスト
135A：“あ”の文字位置リスト
135B：“め”の文字位置リスト
510：検索ターム
520：検索タームを分割した検索用のＮ−ｇｒａｍターム
531：“あ”の文書ＩＤリスト
532：“め”の文書ＩＤリスト
533：基準文書ＩＤリスト
534：“あ”の文字位置リスト
535：“め”の文字位置リスト 110: Search client
120: Communication line
130: Search server
131: Full-text search device
132: Search memory
133: Hard disk
134: Document ID list
135: Character position list
136: N-gram index
210: N-gram term "A" for registration
211: N-gram term "I" for registration
134A: Document ID list for “A”
134B: Document ID list for “Me”
135A: Character position list of “A”
135B: Character position list for “Me”
510: Search term
520: N-gram term for search divided into search terms
531: Document ID list for “A”
532: Document ID list for “Me”
533: Standard document ID list
534: Character position list of “A”
535: "Me" character position list

Claims

Each of the plurality of documents is divided into registration partial character strings having a predetermined number of characters, and character position information indicating the positions of characters in each document of the registration partial character strings is stored for each of the registration partial character strings. Means for registering a character position list and a document ID list in which the start position of the character position information in the character position list and the document ID of each document are stored in association with each other;
The obtained search terms are divided into the same number of search partial character strings as the number of characters of the registration partial character string, and stored for each of the same registration partial character strings as each of the divided search partial character strings. Means for extracting all document IDs included in common in all the compared document ID lists by comparing the document ID lists with each other;
Means for respectively obtaining a start position of the character position information associated with each of the extracted common document IDs from the document ID list for each partial character string for search;
Means for acquiring only corresponding character position information from the character position list for each partial character string for search based on the start position of the acquired character position information;
For each of the extracted common document IDs, the search partial character strings appear adjacently in the same order as the search terms before the division based on the character position information acquired for each of the search partial character strings. And a means for determining that the search term is included in the document with the document ID when the partial character strings for search appear adjacent to each other.