JP2012014646A

JP2012014646A - Document retrieval device, document retrieval method, and program

Info

Publication number: JP2012014646A
Application number: JP2010153273A
Authority: JP
Inventors: Motohiro Akaishizawa; 元博赤石沢
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-07-05
Filing date: 2010-07-05
Publication date: 2012-01-19
Anticipated expiration: 2030-07-05
Also published as: JP5560971B2

Abstract

PROBLEM TO BE SOLVED: To provide a document retrieval device capable of performing retrieval at high speed without causing omission of the retrieval.SOLUTION: The document retrieval device includes: an index creation part 6 which uses a character string of N characters obtained by dividing a document, or a character string of N+1 or more characters with high use frequency in the retrieval as a key, and creates a mixed N-gram index for holding an identifier of a document including the key, and an appearance position of the key in the document; and a retrieval part 3 which outputs a list of documents including the input retrieval key by referring to the mixed N-gram index. The retrieval part 3 retrieves, when a key which matches the retrieval key does not exist in the mixed N-gram index, the mixed N-gram index by each character string obtained by dividing the retrieval key into character strings of the N characters, when the key which matches each character string exists, determines whether or not the document includes the retrieval key based on the appearance position of the key in the document including the key, and outputs a list of documents determined to include the retrieval key.

Description

本発明は、文書検索装置、文書検索方法、及びプログラムに関する。 The present invention relates to a document search apparatus, a document search method, and a program.

記憶装置に記憶された大量の文書ファイルの中から、入力されたキーワードを含む文書を検索する全文検索技術が知られている。全文検索を行う文書検索装置においては、予め検索対象の文書ファイルのインデックスを作成しておき、検索時にはこのインデックスのキーワードと入力された検索キーワードを照合する方法が一般的である。 A full-text search technique for searching for a document including an input keyword from a large number of document files stored in a storage device is known. In a document search apparatus that performs full-text search, a method of creating an index of a document file to be searched in advance and collating the keyword of this index with an input search keyword at the time of search is common.

従来の文書検索装置は、インデックスの作成方法によって大きく２つに分けられる。１つは、辞書を用いて文書を単語に分割し、単語をインデックスのキーとする形態素解析方式であり、もう１つは、文書をＮ文字ずつに分割してインデックスのキーとするＮ−ｇｒａｍ方式である。 Conventional document search apparatuses are roughly divided into two types according to the index creation method. One is a morphological analysis method that uses a dictionary to divide a document into words and uses the words as index keys, and the other is an N-gram that divides a document into N characters and uses them as index keys. It is a method.

形態素解析方式は検索速度が速いという利点があるが、インデックス作成速度は遅い。また、インデックス作成の際に間違った単語の切り出しを行うと、検索漏れが発生するという問題がある。間違った単語の切り出しを避けるためには、インデックス作成の際に用いる辞書のメンテナンスが必要である。 Although the morphological analysis method has an advantage that the search speed is fast, the index creation speed is slow. In addition, if an incorrect word is cut out when creating an index, there is a problem that a search omission occurs. In order to avoid cutting out wrong words, it is necessary to maintain a dictionary used for index creation.

Ｎ−ｇｒａｍ方式は、インデックス作成速度が速く、検索漏れも少ないため広く利用されている方式であるが、検索時に入力された検索キーをＮ文字のキーに分割し、それぞれのキーでのインデックス検索結果を統合する必要があるため、検索速度が遅いという短所を持っている。なお、Ｎ−ｇｒａｍ方式では、Ｎが小さすぎても大きすぎても性能が劣化するため、一般に２文字ずつに文書を分割するｂｉ−ｇｒａｍ方式が広く利用されている。 The N-gram method is widely used because it has a high index creation speed and few search omissions. However, the search key input at the time of searching is divided into N-character keys, and index search with each key is performed. Since the results need to be integrated, the search speed is slow. Note that in the N-gram system, the performance deteriorates when N is too small or too large, and therefore the bi-gram system that divides a document into two characters is widely used.

特許文献１には、テキストデータを全文検索する情報検索方法において、検索対象テキストを単語単位に分割し、単語単位に分割したテキストから、単語の区切りを示す単語情報を持ち文字数がＮである単語情報付文字列インデックスを作成し、単語情報付文字列インデックスから、検索語を文字列検索または単語検索もしくはその両方で検索する方法が記載されている。 Japanese Patent Application Laid-Open No. 2004-133830 discloses an information search method for searching text data in a full text, a search target text is divided into words, a word having word information indicating a word break and having N characters from the divided text. A method is described in which a character string index with information is created and a search word is searched by a character string search or a word search or both from the word information-added character string index.

特開２００１−０３４６２３号公報JP 2001-034623 A

特許文献１に記載された方法では、単語の区切りを調べることにより検索ノイズを減らそうとしているが、通常のＮ−ｇｒａｍ方式に比べると検索漏れが増える可能性がある。 In the method described in Patent Document 1, an attempt is made to reduce search noise by examining word breaks, but there is a possibility that search omissions may increase as compared with the normal N-gram method.

そこで、本発明の目的は、検索漏れを起こさず、かつ検索を高速に行うことが可能な文書検索装置を得ることである。 Accordingly, an object of the present invention is to obtain a document search apparatus that does not cause a search omission and can perform a search at a high speed.

本発明に係る文書検索装置は、文書を分割して生成したＮ（Ｎは自然数）文字の文字列、又は検索時にキーワードとして使用される頻度が一定条件を超えるＮ＋１文字以上の文字列をキーとし、そのキーを含む文書の識別子、及びその文書におけるキーの出現位置を保持する混合Ｎ−ｇｒａｍインデックスを作成する、インデックス作成部と、前記混合Ｎ−ｇｒａｍインデックスを参照することにより、入力された検索キーを含む文書のリストを出力する検索部と、を備え、前記検索部は、まず、前記検索キーそのもので前記混合Ｎ−ｇｒａｍインデックスを検索し、前記検索キーと一致するキーが存在する場合には、そのキーを含む文書のリストを出力し、前記検索キーと一致するキーが存在しない場合には、前記検索キーをＮ文字の文字列に分割し、分割した各々の文字列で前記混合Ｎ−ｇｒａｍインデックスを検索し、各々の文字列と一致するキーが存在する場合には、そのキーを含む文書におけるキーの出現位置に基づいて、その文書が前記検索キーを含むか否かを判定し、前記検索キーを含むと判断された文書のリストを出力するものである。 The document search apparatus according to the present invention uses, as a key, a character string of N (N is a natural number) characters generated by dividing a document, or a character string of N + 1 characters or more whose frequency used as a keyword during a search exceeds a certain condition. An index creation unit that creates a mixed N-gram index that holds an identifier of a document including the key, and an appearance position of the key in the document, and a search input by referring to the mixed N-gram index A search unit that outputs a list of documents including a key, and the search unit first searches the mixed N-gram index with the search key itself, and there is a key that matches the search key. Outputs a list of documents including the key, and if there is no key that matches the search key, the search key is set to N characters. The mixed N-gram index is searched for each divided character string, and when there is a key that matches each character string, based on the appearance position of the key in the document including the key, It is determined whether or not the document includes the search key, and a list of documents determined to include the search key is output.

本発明によれば、検索漏れを起こさず、かつ検索を高速に行うことが可能な文書検索装置を得ることができる。 According to the present invention, it is possible to obtain a document search apparatus that does not cause a search omission and can perform a search at a high speed.

本発明の実施の形態による、文書検索装置の構成を示すブロック図。The block diagram which shows the structure of the document search apparatus by embodiment of this invention. 本発明の実施の形態による、文書検索装置の動作を説明する図。The figure explaining operation | movement of the document search apparatus by embodiment of this invention. 本発明の実施の形態による、文書検索装置の動作のフローチャート。The flowchart of operation | movement of the document search apparatus by embodiment of this invention. 本発明の実施の形態による、文書検索装置の動作のフローチャート。The flowchart of operation | movement of the document search apparatus by embodiment of this invention. 本発明の実施の形態による、文書検索装置の動作のフローチャート。The flowchart of operation | movement of the document search apparatus by embodiment of this invention.

次に、本発明を実施するための形態について、図面を参照して詳細に説明する。
実施の形態
図１は、本発明の実施の形態による文書検索装置１０の構成を示すブロック図である。
文書検索装置１０は、入力部１、出力部２、検索部３、検索キーインデックス記憶部４、混合ｂｉ−ｇｒａｍインデックス（混合Ｎ−ｇｒａｍインデックス）記憶部５、及びインデックス作成部６を備えている。なお、文書検索装置１０はＮ−ｇｒａｍ方式の中でも、２文字ずつに文書を分割するｂｉ−ｇｒａｍ方式を用いている。 Next, embodiments for carrying out the present invention will be described in detail with reference to the drawings.
Embodiment FIG. 1 is a block diagram showing a configuration of a document search apparatus 10 according to an embodiment of the present invention.
The document search apparatus 10 includes an input unit 1, an output unit 2, a search unit 3, a search key index storage unit 4, a mixed bi-gram index (mixed N-gram index) storage unit 5, and an index creation unit 6. . Note that the document search apparatus 10 uses a bi-gram method that divides a document into two characters at a time among the N-gram methods.

入力部１、出力部２、検索部３、およびインデックス作成部６は、プログラムに従ってコンピュータのプロセッサが行う動作のモジュールを表している。検索キーインデックス記憶部４、混合ｂｉ−ｇｒａｍインデックス記憶部５は、メモリやハードディスクにより実現される。 The input unit 1, the output unit 2, the search unit 3, and the index creation unit 6 represent modules of operations performed by a computer processor according to a program. The search key index storage unit 4 and the mixed bi-gram index storage unit 5 are realized by a memory or a hard disk.

入力部１は、利用者が、検索を実行する際に検索キーの入力を行うキーボード、マウス等によって実現される。 The input unit 1 is realized by a keyboard, a mouse, or the like that allows a user to input a search key when executing a search.

出力部２は、検索結果としてヒットした文書ファイル等のリストを表示するディスプレイなどの表示装置によって実現される。 The output unit 2 is realized by a display device such as a display that displays a list of hit document files and the like as search results.

検索部３は、混合ｂｉ−ｇｒａｍインデックス記憶部５を参照し、入力部１を介して入力された検索キーを含む文書ファイルのリストを結果として出力する。検索部３は、まず検索キーそのもので混合ｂｉ−ｇｒａｍインデックス記憶部５を参照し、検索キーと一致するキーワードを有する文書ファイルがあれば、ヒットした文書ファイルをリストの形式で出力部２へ通知する。 The search unit 3 refers to the mixed bi-gram index storage unit 5 and outputs a list of document files including the search key input via the input unit 1 as a result. The search unit 3 first refers to the mixed bi-gram index storage unit 5 with the search key itself, and if there is a document file having a keyword that matches the search key, notifies the output unit 2 of the hit document file in the form of a list. To do.

もし、検索キーと一致するキーワードを有する文書ファイルがない場合、検索キーの長さが２文字より長い場合は、検索キーを２文字ずつに分割して混合ｂｉ−ｇｒａｍインデックス記憶部５を参照し、ヒットしたレコードのファイルＩＤと文字の出現位置に基づいて、検索キーを含む文書ファイルを取得し、取得した文書ファイルをリストの形式で出力部２へ通知する。さらに、検索キーが２文字より長い場合は、その検索キーを検索キーインデックス記憶部４に登録する。 If there is no document file having a keyword that matches the search key and the length of the search key is longer than two characters, the search key is divided into two characters and the mixed bi-gram index storage unit 5 is referred to. The document file including the search key is acquired based on the file ID of the hit record and the appearance position of the character, and the acquired document file is notified to the output unit 2 in the form of a list. Further, when the search key is longer than two characters, the search key is registered in the search key index storage unit 4.

検索キーインデックス記憶部４は、検索に用いられた検索キーをキーとして、検索回数、最終検索日時を保持するインデックスを記憶する。 The search key index storage unit 4 stores an index that holds the number of searches and the last search date and time using the search key used for the search as a key.

混合ｂｉ−ｇｒａｍインデックス記憶部５は、文書を各文字位置から２文字ずつ取り出したものと検索に利用された検索キーをキーとするインデックスを記憶する。 The mixed bi-gram index storage unit 5 stores an index in which two characters are extracted from each character position and a search key used for the search as a key.

インデックス作成部６は、文書ファイルを走査して混合ｂｉ−ｇｒａｍインデックスを作成し、混合ｂｉ−ｇｒａｍインデックス記憶部５に記憶する。すなわち、インデックス作成部６は、通常のｂｉ−ｇｒａｍインデックスの２文字キーに加えて、検索キーインデックス記憶部４を参照し、２文字より長いキーを追加する。 The index creation unit 6 creates a mixed bi-gram index by scanning the document file and stores it in the mixed bi-gram index storage unit 5. That is, the index creation unit 6 refers to the search key index storage unit 4 in addition to the normal two-character key of the bi-gram index, and adds a key longer than two characters.

次に、文書検索装置１０の動作について説明する。
なお、一般の検索装置では、フォルダと時間等を指定すると、指定されたフォルダの配下にあるフォルダ内の文書ファイルを定期的に参照し、更新されたファイルがあったら、その更新内容をインデックスに反映させるような仕組みで運用される。また、Ｗｅｂ上の文書ファイルを検索対象とする場合は、フォルダをたどる代わりに、ｈｔｍｌファイル内のＵＲＬをたどってインデックスを更新する。ここでは、単純化のため、ファイルを指定すると、そのファイルに関してインデックスを更新する仕組みとする。 Next, the operation of the document search apparatus 10 will be described.
In a general search device, when a folder and time are specified, document files in the folder under the specified folder are periodically referenced. If there is an updated file, the updated content is used as an index. It is operated in a mechanism that reflects it. When a document file on the Web is to be searched, the index is updated by following the URL in the html file instead of following the folder. Here, for simplification, when a file is specified, the index is updated for the file.

まず、図２と図３用いてインデックス作成部６の動作について説明する。
インデックス作成部６は、文書ファイルｆｉｌｅ１．ｔｘｔやｆｉｌｅ２．ｔｘｔなどから、混合ｂｉ−ｇｒａｍインデックスを作成する。図２に示すように、ｆｉｌ１．ｔｘｔの内容は、「今日は良い天気です。・・・」であり、ｆｉｌ２．ｔｘｔの内容は、「今日は良い日です。・・・」である。 First, the operation of the index creation unit 6 will be described with reference to FIGS.
The index creation unit 6 reads the document files file1. txt or file2. A mixed bi-gram index is created from txt and the like. As shown in FIG. The content of txt is “Today is a good weather. The content of txt is “Today is a good day ...”.

まず、インデックス作成部６は、ｆｉｌｅ１．ｔｘｔとそのＩＤ（１）を受け取る（図３：ステップＳ１）。 First, the index creation unit 6 sets file1. txt and its ID (1) are received (FIG. 3: step S1).

ｆｉｌｅ名とＩＤの対応関係は、図２に示す「ファイルＩＤ対応テーブル」に保持される。ファイルＩＤ対応テーブルは、例えば、混合ｂｉ−ｇｒａｍインデックス記憶部５に保持することができる。なお、ファイルのＩＤの変わりに直接ファイル名を混合ｂｉ−ｇｒａｍインデックス記憶部５に保持するようにしてもよい。ただし、文字列のままでは記憶容量が大きくなり、処理効率も悪いので、ここではファイル名をＩＤに変換している。 The correspondence between the file name and the ID is held in the “file ID correspondence table” shown in FIG. The file ID correspondence table can be held in the mixed bi-gram index storage unit 5, for example. Note that the file name may be directly stored in the mixed bi-gram index storage unit 5 instead of the file ID. However, if the character string is left as it is, the storage capacity becomes large and the processing efficiency is poor, so here the file name is converted to ID.

次に、まず同一のファイルが更新された場合のことを想定し、すでに、混合ｂｉ−ｇｒａｍインデックス中の「ファイルＩＤ＝出現位置」を保持するフィールドに、ｆｉｌｅ１．ｔｘｔのＩＤがあれば、それを消去する（ステップＳ２）。すなわち、１＝〜（〜は任意）のペアを削除する。ファイルＩＤ＝出現位置フィールドが空になれば、キー自体も削除する。 Next, assuming that the same file has been updated, file1.file is already stored in the field holding “file ID = appearance position” in the mixed bi-gram index. If there is an ID of txt, it is deleted (step S2). That is, a pair of 1 = ˜ (˜ is arbitrary) is deleted. If the file ID = appearance position field becomes empty, the key itself is also deleted.

次に、ファイル中の文書をバッファに読み込み、先頭から１文字ずつずらして、インデックス登録処理を行う（ステップＳ３〜Ｓ７）。 Next, the document in the file is read into the buffer, and the index registration process is performed by shifting the characters one by one from the beginning (steps S3 to S7).

まず、ｆｉｌｅ１．ｔｘｔの先頭から２文字のキーを登録する。すなわち、キー：「今日」、ファイルＩＤ＝出現位置：１＝０というレコードを混合ｂｉ−ｇｒａｍインデックス記憶部５に登録する（ステップＳ４）。 First, file1. Register a two-character key from the beginning of txt. That is, a record of key: “today”, file ID = appearance position: 1 = 0 is registered in the mixed bi-gram index storage unit 5 (step S4).

次に、２文字より大きいキーを、検索キーインデックス記憶部４を参照して検索する。すなわち、検索キーインデックス記憶部４から「今日は」、「今日は良」、「今日は良い」等を検索する。最大何文字まで検索するかは、例えば８文字のように、あらかじめ決めておく。 Next, a key larger than two characters is searched with reference to the search key index storage unit 4. That is, “today is good”, “today is good”, “today is good”, and the like are searched from the search key index storage unit 4. The maximum number of characters to be searched is determined in advance, for example, 8 characters.

２文字より大きいキーが検索キーインデックスに存在し、検索回数が一定以上（例えば５件）で、検索日時が現在より一定期間内（例えば１０日）などの条件を満たす場合には、そのキーを混合ｂｉ−ｇｒａｍインデックス記憶部５に登録する。 If a key larger than 2 characters exists in the search key index, the search frequency is more than a certain number (for example, 5), and the search date is within a certain period (for example, 10 days) from the present, the key is Register in the mixed bi-gram index storage unit 5.

例えば、ポインタが文書の先頭から３つ進んだ状態でキーが「良い天気です。」の場合、図２に示す検索キーインデックスの中の「良い天気」がヒットするので、キー：「良い天気」、ファイルＩＤ＝出現位置：１＝３というデータを混合ｂｉ−ｇｒａｍインデックス記憶部５に登録する（ステップＳ５）。 For example, if the key is “good weather” with the pointer advanced three from the beginning of the document, “good weather” in the search key index shown in FIG. 2 is hit, so the key: “good weather” The data of file ID = appearance position: 1 = 3 is registered in the mixed bi-gram index storage unit 5 (step S5).

ポインタを１文字進め（ステップＳ６）、ファイルの最後まで進んだら、登録が完了する（ステップＳ７：ＮＯ）。 When the pointer is advanced by one character (step S6) and the end of the file is reached, registration is completed (step S7: NO).

次に、図４を用いて検索部３の動作を説明する。
まず、検索部３は、利用者が入力部１を介して入力した検索キーを取得する（ステップＳ１１）。 Next, the operation of the search unit 3 will be described with reference to FIG.
First, the search part 3 acquires the search key which the user input via the input part 1 (step S11).

次に、検索部３は、入力された検索キーで、混合ｂｉ−ｇｒａｍインデックス記憶部５を検索する（ステップＳ１２）。 Next, the search unit 3 searches the mixed bi-gram index storage unit 5 with the input search key (step S12).

入力された検索キーでヒットするキーがあった場合には（ステップＳ１３：ＹＥＳ）、そのキーのファイルＩＤ＝出現位置に登録されているファイルＩＤをファイル名に変換して出力部２に出力する（ステップＳ１５）。 If there is a hit key in the input search key (step S13: YES), the file ID of the key = the file ID registered at the appearance position is converted into a file name and output to the output unit 2. (Step S15).

入力された検索キーでヒットするキーがなかった場合には（ステップＳ１３：ＮＯ）、検索キーを２文字ずつに分割して、それぞれのキーについて混合ｂｉ−ｇｒａｍインデックス記憶部５を検索し、ヒットするキーに登録されているファイルＩＤのリストを取得する（ステップＳ１４）。さらに、ファイルＩＤをファイル名に変換して出力部２に出力する（ステップＳ１５）。 When there is no hit key among the input search keys (step S13: NO), the search key is divided into two characters, the mixed bi-gram index storage unit 5 is searched for each key, and the hit A list of file IDs registered in the key to be acquired is acquired (step S14). Further, the file ID is converted into a file name and output to the output unit 2 (step S15).

さらに、検索キーが３文字以上である場合には（ステップＳ１６：ＹＥＳ）、その検索キーを検索キーインデックス記憶部４に登録する（ステップＳ１７）。 Further, when the search key is 3 characters or more (step S16: YES), the search key is registered in the search key index storage unit 4 (step S17).

次に、図５を用いて検索キーインデックス記憶部４の更新処理について説明する。
まず、検索部３は、利用者が入力部１を介して入力した検索キーを取得する（ステップＳ２１）。 Next, the update process of the search key index storage unit 4 will be described with reference to FIG.
First, the search part 3 acquires the search key which the user input via the input part 1 (step S21).

次に、取得した検索キーで検索キーインデックス記憶部４を検索する（ステップＳ２２）。 Next, the search key index storage unit 4 is searched with the acquired search key (step S22).

検索キーインデックスに、取得した検索キーでヒットするキーがあった場合には（ステップＳ２３：ＹＥＳ）、そのキーのレコードの「検索回数」を１加算し、「検索日時」に現在の日時を登録する（ステップＳ２４）。 If there is a key hit in the search key index with the acquired search key (step S23: YES), the "search count" of the record of the key is incremented by 1, and the current date is registered in "search date" (Step S24).

取得した検索キーでヒットするキーがなかった場合には（ステップＳ２３：ＮＯ）、検索キーインデックス記憶部４にそのキーのレコードを追加し、「検索日時」には現在の日時を登録する（ステップＳ２５）。 If there is no hit key in the acquired search key (step S23: NO), a record of the key is added to the search key index storage unit 4, and the current date and time is registered in “search date and time” (step S23). S25).

以上のように、本実施形態によれば、混合ｂｉ−ｇｒａｍインデックスとして、通常のｂｉ−ｇｒａｍインデックスのキーに加え、よく利用される検索キーの場合には、その検索キー自体をインデックスのキーとすることにより、同一の検索キーが何度も検索される環境において、検索漏れを起こさずに、かつ検索速度を向上させることができる。特に、社内文書の検索など、同一の文字列が検索キーとしてよく使われる環境では有効である。 As described above, according to the present embodiment, as a mixed bi-gram index, in addition to a normal bi-gram index key, in the case of a frequently used search key, the search key itself is used as an index key. By doing so, in the environment where the same search key is searched many times, the search speed can be improved without causing a search omission. This is particularly effective in an environment where the same character string is often used as a search key, such as in-house document search.

本発明の効果について具体例を用いて説明する。例えば、検索キーとして「良い天気」が入力された場合を例に説明する。また、混合ｂｉ−ｇｒａｍインデックスの内容は図２に示すものとする。 The effect of the present invention will be described using a specific example. For example, a case where “good weather” is input as a search key will be described as an example. The contents of the mixed bi-gram index are shown in FIG.

まず、「良い天気」という４文字の検索キーが混合ｂｉ−ｇｒａｍインデックス記憶部５に登録されていない場合には、検索キーを２文字ずつに分割し、それぞれのキーについて混合ｂｉ−ｇｒａｍインデックス記憶部５を検索する（通常のＮ−ｇｒａｍ方式の検索）。 First, when the four-character search key “good weather” is not registered in the mixed bi-gram index storage unit 5, the search key is divided into two characters, and the mixed bi-gram index is stored for each key. Part 5 is searched (normal N-gram search).

この結果、キー：「良い」、ファイルＩＤ＝出現位置：１＝３、２＝３、及びキー：「天気」、ファイルＩＤ＝出現位置：１＝５の２つのレコードがヒットする。この中で、ファイルＩＤが共通に存在し、それぞれの出現位置が順番どおりつながるものを探す。この場合、「天気」の出現位置が「良い」の出現位置より２文字分後のものを選ぶ必要がある。キー「良い」のファイルＩＤ＝出現位置「１＝３」、キー「天気」のファイルＩＤ＝出現位置「１＝５」がこの条件を満たすため、ファイルＩＤ＝１、すなわち（ｆｉｌｅ１．ｔｘｔ）に「良い天気」が含まれると判断できる。 As a result, two records of key: “good”, file ID = appearance position: 1 = 3, 2 = 3, and key: “weather”, file ID = appearance position: 1 = 5 are hit. Among these, a search is made for a file ID that exists in common and whose appearance positions are connected in order. In this case, it is necessary to select the “weather” appearing position two characters after the “good” appearing position. Since the file ID of the key “good” = appearance position “1 = 3” and the file ID of the key “weather” = appearance position “1 = 5” satisfy this condition, the file ID = 1, that is, (file1.txt) It can be determined that “good weather” is included.

インデックスの規模が大きくなると、それぞれのキーについてレコードが何千件もヒットする可能性があり、検索速度が遅くなってしまう。一方、本実施形態によれば、「良い天気」という検索キーの使用頻度が多い場合には、混合ｂｉ−ｇｒａｍインデックスに登録されているため、上記のような通常のＮ−ｇｒａｍ方式の検索処理を省略することができ、検索の高速化が実現できる。 As the size of the index grows, thousands of records can be hit for each key, slowing down the search. On the other hand, according to the present embodiment, when the search key “good weather” is frequently used, since it is registered in the mixed bi-gram index, the normal N-gram search process as described above is used. Can be omitted, and the search speed can be increased.

上記の実施の形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）文書を分割して生成したＮ（Ｎは自然数）文字の文字列、又は検索時にキーワードとして使用される頻度が一定条件を超えるＮ＋１文字以上の文字列をキーとし、そのキーを含む文書の識別子、及びその文書におけるキーの出現位置を保持する混合Ｎ−ｇｒａｍインデックスを作成する、インデックス作成部と、
前記混合Ｎ−ｇｒａｍインデックスを参照することにより、入力された検索キーを含む文書のリストを出力する検索部と、を備え、
前記検索部は、まず、前記検索キーそのもので前記混合Ｎ−ｇｒａｍインデックスを検索し、前記検索キーと一致するキーが存在する場合には、そのキーを含む文書のリストを出力し、前記検索キーと一致するキーが存在しない場合には、前記検索キーをＮ文字の文字列に分割し、分割した各々の文字列で前記混合Ｎ−ｇｒａｍインデックスを検索し、各々の文字列と一致するキーが存在する場合には、そのキーを含む文書におけるキーの出現位置に基づいて、その文書が前記検索キーを含むか否かを判定し、前記検索キーを含むと判断された文書のリストを出力する、文書検索装置。 A part or all of the above embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Supplementary note 1) A character string of N (N is a natural number) characters generated by dividing a document, or a character string of N + 1 characters or more whose frequency used as a keyword during search exceeds a certain condition is included as a key. An index creation unit for creating a mixed N-gram index that holds an identifier of a document and an appearance position of a key in the document;
A search unit that outputs a list of documents including the input search key by referring to the mixed N-gram index;
The search unit first searches the mixed N-gram index with the search key itself, and when there is a key that matches the search key, outputs a list of documents including the key, and the search key If there is no key that matches the character string, the search key is divided into N character strings, the mixed N-gram index is searched for each divided character string, and the key that matches each character string is found. If it exists, it is determined whether or not the document includes the search key based on the appearance position of the key in the document including the key, and a list of documents determined to include the search key is output. Document retrieval device.

（付記２）前記検索部は、
検索時にキーワードとして使用された文字列をキーとし、そのキーワードの使用回数を保持する検索キーインデックスを作成し、
前記インデックス作成部は、
前記検索キーインデックスを参照し、前記使用回数が一定条件を超えるＮ＋１文字以上のキーを混合Ｎ−ｇｒａｍインデックスのキーとして登録する、付記１に記載の文書検索装置。 (Appendix 2) The search unit
Create a search key index that holds the number of times the keyword is used, using the character string used as a keyword during the search as a key.
The index creation unit
The document search apparatus according to appendix 1, wherein the search key index is referred to, and a key of N + 1 characters or more in which the number of uses exceeds a predetermined condition is registered as a key of the mixed N-gram index.

（付記３）文書を分割して生成したＮ（Ｎは自然数）文字の文字列、又は検索時にキーワードとして使用される頻度が一定条件を超えるＮ＋１文字以上の文字列をキーとし、そのキーを含む文書の識別子、及びその文書におけるキーの出現位置を保持する混合Ｎ−ｇｒａｍインデックスを作成する工程と、
前記混合Ｎ−ｇｒａｍインデックスを参照することにより、入力された検索キーを含む文書のリストを出力する工程と、を備え、
前記文書のリストを出力する工程では、まず、前記検索キーそのもので前記混合Ｎ−ｇｒａｍインデックスを検索し、前記検索キーと一致するキーが存在する場合には、そのキーを含む文書のリストを出力し、前記検索キーと一致するキーが存在しない場合には、前記検索キーをＮ文字の文字列に分割し、分割した各々の文字列で前記混合Ｎ−ｇｒａｍインデックスを検索し、各々の文字列と一致するキーが存在する場合には、そのキーを含む文書におけるキーの出現位置に基づいて、その文書が前記検索キーを含むか否かを判定し、前記検索キーを含むと判断された文書のリストを出力する、文書検索方法。 (Supplementary note 3) A character string of N (N is a natural number) characters generated by dividing a document or a character string of N + 1 characters or more whose frequency used as a keyword when searching exceeds a certain condition is included as a key. Creating a mixed N-gram index that holds the identifier of the document and the appearance position of the key in the document;
Outputting a list of documents including the input search key by referring to the mixed N-gram index,
In the step of outputting the list of documents, first, the mixed N-gram index is searched with the search key itself, and if there is a key that matches the search key, a list of documents including the key is output. If there is no key that matches the search key, the search key is divided into character strings of N characters, the mixed N-gram index is searched for each divided character string, and each character string is searched. If there is a key that coincides with the document, it is determined whether or not the document includes the search key based on the appearance position of the key in the document including the key, and the document determined to include the search key Document search method that outputs a list of files.

（付記４）コンピュータを、
文書を分割して生成したＮ（Ｎは自然数）文字の文字列、又は検索時にキーワードとして使用される頻度が一定条件を超えるＮ＋１文字以上の文字列をキーとし、そのキーを含む文書の識別子、及びその文書におけるキーの出現位置を保持する混合Ｎ−ｇｒａｍインデックスを作成する、インデックス作成部と、
前記混合Ｎ−ｇｒａｍインデックスを参照することにより、入力された検索キーを含む文書のリストを出力する検索部と、して機能させるプログラムであって、
前記検索部は、まず、前記検索キーそのもので前記混合Ｎ−ｇｒａｍインデックスを検索し、前記検索キーと一致するキーが存在する場合には、そのキーを含む文書のリストを出力し、前記検索キーと一致するキーが存在しない場合には、前記検索キーをＮ文字の文字列に分割し、分割した各々の文字列で前記混合Ｎ−ｇｒａｍインデックスを検索し、各々の文字列と一致するキーが存在する場合には、そのキーを含む文書におけるキーの出現位置に基づいて、その文書が前記検索キーを含むか否かを判定し、前記検索キーを含むと判断された文書のリストを出力する、プログラム。 (Appendix 4)
A character string of N (N is a natural number) characters generated by dividing a document, or a character string of N + 1 characters or more whose frequency used as a keyword during a search exceeds a certain condition, and an identifier of a document including the key, An index creation unit that creates a mixed N-gram index that holds the appearance position of the key in the document;
A program that functions as a search unit that outputs a list of documents including an input search key by referring to the mixed N-gram index,
The search unit first searches the mixed N-gram index with the search key itself, and when there is a key that matches the search key, outputs a list of documents including the key, and the search key If there is no key that matches the character string, the search key is divided into N character strings, the mixed N-gram index is searched for each divided character string, and the key that matches each character string is found. If it exists, it is determined whether or not the document includes the search key based on the appearance position of the key in the document including the key, and a list of documents determined to include the search key is output. ,program.

１入力部、２出力部、３検索部、４検索キーインデックス記憶部、５混合ｂｉ−ｇｒａｍインデックス記憶部、６インデックス作成部、１０文書検索装置 1 input unit, 2 output unit, 3 search unit, 4 search key index storage unit, 5 mixed bi-gram index storage unit, 6 index creation unit, 10 document search device

Claims

A character string of N (N is a natural number) characters generated by dividing a document, or a character string of N + 1 characters or more whose frequency used as a keyword during a search exceeds a certain condition, and an identifier of a document including the key, An index creation unit that creates a mixed N-gram index that holds the appearance position of the key in the document;
A search unit that outputs a list of documents including the input search key by referring to the mixed N-gram index;
The search unit first searches the mixed N-gram index with the search key itself, and when there is a key that matches the search key, outputs a list of documents including the key, and the search key If there is no key that matches the character string, the search key is divided into N character strings, the mixed N-gram index is searched for each divided character string, and the key that matches each character string is found. If it exists, it is determined whether or not the document includes the search key based on the appearance position of the key in the document including the key, and a list of documents determined to include the search key is output. Document retrieval device.

The search unit
Create a search key index that holds the number of times the keyword is used, using the character string used as the keyword during the search,
The index creation unit
The document search apparatus according to claim 1, wherein the search key index is referenced and a key of N + 1 characters or more in which the number of uses exceeds a certain condition is registered as a key of a mixed N-gram index.

A character string of N (N is a natural number) characters generated by dividing a document, or a character string of N + 1 characters or more whose frequency used as a keyword during a search exceeds a certain condition, and an identifier of a document including the key, And creating a mixed N-gram index that holds the location of the key in the document;
Outputting a list of documents including the input search key by referring to the mixed N-gram index,
In the step of outputting the list of documents, first, the mixed N-gram index is searched with the search key itself, and if there is a key that matches the search key, a list of documents including the key is output. If there is no key that matches the search key, the search key is divided into character strings of N characters, the mixed N-gram index is searched for each divided character string, and each character string is searched. If there is a key that coincides with the document, it is determined whether or not the document includes the search key based on the appearance position of the key in the document including the key, and the document determined to include the search key Document search method that outputs a list of files.

Computer
A character string of N (N is a natural number) characters generated by dividing a document, or a character string of N + 1 characters or more whose frequency used as a keyword during a search exceeds a certain condition, and an identifier of a document including the key, An index creation unit that creates a mixed N-gram index that holds the appearance position of the key in the document;
A program that functions as a search unit that outputs a list of documents including an input search key by referring to the mixed N-gram index,
The search unit first searches the mixed N-gram index with the search key itself, and when there is a key that matches the search key, outputs a list of documents including the key, and the search key If there is no key that matches the character string, the search key is divided into N character strings, the mixed N-gram index is searched for each divided character string, and the key that matches each character string is found. If it exists, it is determined whether or not the document includes the search key based on the appearance position of the key in the document including the key, and a list of documents determined to include the search key is output. ,program.