JPH07319895A

JPH07319895A - Device and method for retrieving document

Info

Publication number: JPH07319895A
Application number: JP6106406A
Authority: JP
Inventors: Isamu Iwai; 勇岩井; Kenichi Nogami; 謙一野上; Yukio Nakamoto; 幸夫中本; Toshihiro Ozaki; 敏宏尾崎; Toshie Yuzawa; 敏惠湯澤
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1994-05-20
Filing date: 1994-05-20
Publication date: 1995-12-08

Abstract

PURPOSE:To retrieve a target document at high speed even if a retrieval key is the synthesis word of a word and a character or a sentence and even if the quantity of the document to be retrieved is large. CONSTITUTION:When the keyword of the document is inputted from an input part 220, a keyword extraction part 224 dissolves the keyword into the lines of plural keyword ID. A keyword matching part 226 searches document ID including all obtained keyword ID by searching an index in an index storage buffer. Then, an adjacent information matching part 227 refers to the adjacent information index of an adjacent information storage buffer 273, retrieves document ID having ID having the same line as that of keyword ID in the keyword inputted among adjacent information of searched document ID and it is set to be a retrieval answer. Thus, retrieval speed can considerably be improved since all the documents are not necessary to be retrieved.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は検索キーワードを含む文
書を検索する文書検索装置に係わり、特に２語以上の検
索キーワードが入力された場合の文書検索方法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval apparatus for retrieving a document containing a retrieval keyword, and more particularly to a document retrieval method when two or more retrieval keywords are input.

【０００２】[0002]

【従来の技術】従来のデータベース検索装置の中には検
索対象文書中の任意の文字列によって目的の文書を検索
することができるフルテキストサーチ方式を採用したも
のがある。このような検索装置は大量の文書を高速に検
索するために、前処理でインデックスを作成しなければ
ならない。このインデックスは全検索対象分文書中から
全ての単語や文字を抽出し、抽出した単語や文字を含む
文書を表現したものである。ユーザが検索キーに前処理
で作成した前記インデックス中の単語や文字を使用した
場合、このインデックスを参照することによって目的の
文書を高速に検索することができている。しかし、ユー
ザは単語と文字の合成語や文章で検索キーの入力を行う
場合があり、このような場合、前記データベース検索装
置は入力された検索キーを単語や文字に分割し、その分
割された単語や文字を含む文書を前記インデックスを参
照することにより検索する。更に、原文書中にその検索
キーと同一の単語や文字が出現しているかどうかを文字
列マッチングにより判定し、同一の単語や文字を含む文
書のみを最終的に検索してくる装置もある。2. Description of the Related Art Some conventional database search devices employ a full-text search method that can search for a target document by using an arbitrary character string in a document to be searched. Such a search device must create an index in preprocessing in order to search a large amount of documents at high speed. This index expresses a document including all the words and characters extracted from all the search target documents. When the user uses a word or character in the index created in the preprocessing as a search key, the target document can be searched at high speed by referring to this index. However, the user may input the search key with a word or character composite word or a sentence. In such a case, the database search device divides the input search key into words or characters, and Documents containing words and characters are searched by referring to the index. Further, there is an apparatus that determines whether or not the same word or character as the search key appears in the original document by character string matching, and finally searches only the document including the same word or character.

【０００３】このように従来の方式では、ユーザが入力
した検索キーが単語と文字の合成語や文章であった場
合、前記検索キーが原文書中に出現しているかどうかを
文字列マッチングにより比較対照することによって、目
的の文書をサーチしていた。これでは、被検索対象の文
書が大きければ大きいほど、装置が文書をなめる量が大
きくなって、これに時間を取られることになり、大幅に
検索速度が減少してしまうという欠点があった。As described above, according to the conventional method, when the search key input by the user is a word or character composite word or a sentence, it is compared by character string matching whether or not the search key appears in the original document. By contrast, they were searching for the desired document. In this case, the larger the document to be searched is, the more the device licks the document, which takes time, and the search speed is significantly reduced.

【０００４】[0004]

【発明が解決しようとする課題】従来のフルテキストサ
ーチ方式を採用したデータベース検索装置では、ユーザ
が入力した検索キーが単語と文字の合成語や文章であっ
た場合、前記検索キーが原文書中に出現しているかどう
かを文字列マッチング等により比較対照することで、目
的の文書をサーチしていた。これでは、被検索対象の文
書が大きければ大きいほど、装置が文書をなめる量が大
きくなって、これに時間を取られることになり、大幅に
検索速度が減少してしまうという欠点があった。In the conventional database search apparatus employing the full-text search method, when the search key input by the user is a word or character composite word or a sentence, the search key is included in the original document. The target document was searched by comparing and contrasting whether or not it appeared in the text by using character string matching or the like. In this case, the larger the document to be searched is, the more the device licks the document, which takes time, and the search speed is significantly reduced.

【０００５】そこで本発明は上記の欠点を除去し、検索
キーが単語と文字の合成語や文章であり、且つ被検索文
書が大量であっても、高速に目的の文書を検索すること
ができる文書検索装置及び文書検索方法を提供すること
を目的としている。Therefore, the present invention eliminates the above-mentioned drawbacks, and the target document can be searched at high speed even if the search key is a compound word or sentence of words and characters and the number of documents to be searched is large. An object is to provide a document search device and a document search method.

【０００６】[0006]

【課題を解決するための手段】本発明は、検索キー毎
に、この検索キーを含む被検索対象文書の識別番号を対
応させたインデックス情報を有し、２語以上の検索キー
ワードが入力されると、この検索キーワードを語単位に
分解し、更に各語単位を識別番号に置き換えた後、この
検索キーワードを構成する全ての識別番号を含む文書を
前記インデックス情報から検索する機能を有する文書検
索装置において、被検索対象文書毎に文書の全内容を語
単位に分解し、且つ前記語単位を識別番号に置き換えて
語順どおりに並べた隣接情報を作成する隣接情報作成手
段と、前記インデックス情報から検索された文書に対応
する前記隣接情報の中に、前記入力された検索キーワー
ドを分解して得た識別番号の並びと同一の並びがあるか
をサーチする検索手段と、この検索手段によりサーチさ
れた該当の並びを持つ文書を検索回答文書として出力す
る出力手段とを具備した構成を有する。The present invention has, for each search key, index information in which the identification number of the document to be searched including this search key is associated, and two or more search keywords are input. And a document retrieval device having a function of decomposing the search keyword into word units, replacing each word unit with an identification number, and then retrieving a document including all the identification numbers constituting the search keyword from the index information. In the above, in the search target document, the entire content of the document is decomposed into word units, and the word units are replaced with identification numbers to create adjacent information arranged in word order, and the index information is used to search the adjacent information. Search unit that searches the adjacent information corresponding to the input document for the same sequence as the sequence of identification numbers obtained by decomposing the input search keyword. If, having the configuration and an output means for outputting a document with the sequence of the relevant, which is searched by the search means as a retrieval answer document.

【０００７】[0007]

【作用】本発明の文書検索装置において、隣接情報作成
手段は被検索対象文書毎に文書の全内容を語単位に分解
し、且つ前記語単位を識別番号に置き換えて語順どおり
に並べた隣接情報を作成する。検索手段は前記インデッ
クス情報から検索された文書に対応する前記隣接情報の
中に、前記入力された検索キーワードを分解して得た識
別番号の並びと同一の並びがあるかをサーチする。出力
手段は前記検索手段によりサーチされた該当の並びを持
つ文書を検索回答文書として出力する。これにより、前
記検索手段はキー検索対象文書の内容を総なめする必要
がなくなり、検索速度を速めて検索効率を向上させるこ
とができる。In the document retrieval device of the present invention, the adjacency information creating means decomposes the entire contents of the document for each document to be retrieved into word units, and replaces the word units with identification numbers to arrange the adjacency information in word order. To create. The search means searches the adjacent information corresponding to the document searched from the index information for the same sequence as the sequence of the identification numbers obtained by decomposing the input search keyword. The output means outputs the document having the corresponding sequence searched by the search means as the search response document. This eliminates the need for the search means to swipe the contents of the key search target document, thereby increasing the search speed and improving the search efficiency.

【０００８】[0008]

【実施例】以下、本発明の一実施例を図面を参照して説
明する。図１は本発明の文書検索方法を適用した本発明
の文書検索装置と、この文書検索装置を搭載した本発明
の文書作成装置の一実施例を示した概略ブロック図であ
る。１は文書の検索や文書の作成制御を行うＣＰＵ及び
メモリ等から構成されている制御装置、ユーザが検索キ
ーを入力したり検索操作を行うための各指示や情報を入
力するキーボードやマウス等からなる入力装置２、入力
装置２によって入力された検索キーや検索操作の指示、
検索結果及び検索によって検索された文書の内容を表示
するＬＣＤ等の表示装置３、検索対象文書（文書作成し
て得た保存文書等も含む）や隣接情報等を格納する外部
記憶装置４、隣接情報を作成する際に文書を単語や文字
の語単位に分割するために使用される分割辞書５（メモ
リ上に構成される）から構成される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic block diagram showing an embodiment of a document search apparatus of the present invention to which the document search method of the present invention is applied and a document creation apparatus of the present invention equipped with this document search apparatus. Reference numeral 1 denotes a control device composed of a CPU and a memory for performing document search and document creation control, a keyboard and a mouse for inputting respective instructions and information for a user to input a search key or perform a search operation. An input device 2, a search key input by the input device 2 and a search operation instruction,
A display device 3 such as an LCD that displays the search result and the content of the document searched by the search, an external storage device 4 that stores the search target document (including the saved document obtained by creating the document), adjacent information, and the like. It is composed of a division dictionary 5 (configured on a memory) used for dividing a document into words or characters when creating information.

【０００９】図２は図１に示した制御装置の詳細例を示
したブロック図である。制御装置１は初期化部２０１、
文書取り込み部２０２、ワード切り出し部２０３、ワー
ド頻度カウント部２０４、キーワードソート部２０５、
隣接情報変換部２０６、最適データ変換部２０７、隣接
情報保存部２０８、キーワードリスト保存部２０９、入
力部２２０、インデックス読み込み部２２１、キーワー
ドリスト読み込み部２２２、隣接情報読み込み部２２
３、キーワード抽出部２２４、キーワード隣接情報変換
部２２５、キーワードマッチング部２２６、隣接情報マ
ッチング部２２７、検索回答部２２８、出力部２２９等
の各処理部及び、文書番号格納バッファ２５０、文書格
納バッファ２５１、分割ワード格納バッファ２５２、頻
度格納部２５３、頻度ソート格納バッファ２５４、最適
データ格納バッファ２５５、インデックス格納バッファ
２７１、キーワードリスト格納バッファ２７２、隣接情
報格納バッファ２７３、検索キー文字列バッファ２７
４、キーワード格納バッファ２７５、隣接情報変換バッ
ファ２７６、文書ＩＤ格納バッファ２７７、検索回答バ
ッファ２７８等の各バッファ部から構成されている。FIG. 2 is a block diagram showing a detailed example of the control device shown in FIG. The control device 1 includes an initialization unit 201,
A document capturing unit 202, a word cutting unit 203, a word frequency counting unit 204, a keyword sorting unit 205,
Adjacent information conversion unit 206, optimum data conversion unit 207, adjacency information storage unit 208, keyword list storage unit 209, input unit 220, index reading unit 221, keyword list reading unit 222, adjacency information reading unit 22.
3, keyword extraction unit 224, keyword adjacent information conversion unit 225, keyword matching unit 226, adjacent information matching unit 227, search response unit 228, output unit 229, and other processing units, and document number storage buffer 250 and document storage buffer 251. , Divided word storage buffer 252, frequency storage unit 253, frequency sort storage buffer 254, optimum data storage buffer 255, index storage buffer 271, keyword list storage buffer 272, adjacency information storage buffer 273, search key character string buffer 27
4, a keyword storage buffer 275, an adjacent information conversion buffer 276, a document ID storage buffer 277, a search response buffer 278, and the like.

【００１０】次に制御部１の処理部を構成する各機能ブ
ロックの処理内容について説明する。初期化部２０１
は、各バッファ部の初期化を行なう。文書取り込み部２
０２は、外部記憶装置４から、文書番号と原文書のデー
タをそれぞれ文書番号格納バッファ２５０及び文書格納
バッファ２５１に格納する。ワード切り出し部２０３
は、文書格納バッファ２５１に格納されている文書を分
割辞書５によって単語や文字等のキーワードに分割す
る。ワード頻度カウント部２０４は、分割ワード格納バ
ッファ２５２に格納されているキーワードの出現回数を
カウントする。キーワードソート部２０５は、頻度格納
部２５３に格納されているキーワードとキーワードの出
現回数により、出現頻度を計算し、出現頻度の高い順に
並べ代えを行なう。隣接情報変換部２０６は、分割ワー
ド格納バッファ２５２に格納されているデータをキーワ
ードＩＤに変換する。最適データ変換部２０７は、キー
ワードＩＤを最適なビットで表現する。隣接情報保存部
２０８は、最適データ格納バッファ２５５に格納されて
いる隣接情報及び文書番号格納バッファ２５０に格納さ
れている文書番号を外部記憶装置４に格納する。キーワ
ードリスト保存部２０９は、頻度ソート格納バッファ２
５４に格納してあるキーワードリストを外部記憶装置４
に格納する。入力部２２０は、ユーザが入力装置２を用
いて検索に関する操作の管理を行なう。インデックス読
み込み部２２１は、外部記憶装置４に格納されているキ
ーワードがどの文書に出現しているかを示すインデック
スをインデックス格納バッファ２７１への読み込みを行
なう。キーワードリスト読み込み部２２２は、検索対象
文書中から抽出されるキーワードに対して、検索対象文
書中に出現する頻度の高いキーワード順に「０、１、
２、・・・」をキーワードＩＤを付与したキーワードリ
ストを外部記憶装置４よりキーワードリスト格納バッフ
ァ２５６への読み込みを行なう。隣接情報読み込み部２
２３は、外部記憶装置４に格納されている隣接情報イン
デックスを隣接情報格納部２５３への読み込みを行な
う。キーワード抽出部２２４は、検索キー文字列バッフ
ァ２７４に格納されているユーザが入力した検索キー
を、検索対象文書中から抽出されているキーワードから
インデックス化されている単位に、検索キーからキーワ
ード単位に分割し、分割したキーワードをキーワード格
納バッファ２７５に格納する。キーワード隣接情報変換
部２２５は、キーワード格納バッファ２７５に格納され
ているキーワードをキーワードリスト格納バッファ２７
２に格納されているキーワード単位にキーワードＩＤが
付与されたキーワードリストにより、検索キー文字列バ
ッファ２７４に格納されている検索キーの並びで、キー
ワードをキーワードＩＤに変換する。キーワードマッチ
ング部２２６は、キーワード格納バッファ２７５に格納
されているユーザが入力した検索キーを構成しているキ
ーワードとインデックス格納バッファ２７１に格納され
ているキーワードが含まれている文書ＩＤを表現してい
るインデックスとのマッチングを行ない、キーワード格
納バッファ２７５に格納されているキーワードを全て含
んでいる文書ＩＤを抽出する。隣接情報マッチング部２
２７は、隣接情報変換バッファ２７６に格納してあるキ
ーワードＩＤの並びが隣接情報格納バッファ２７３に格
納してある文書ＩＤ単位のインデックスの中に同一のキ
ーワードＩＤの並びがあるかどうかのマッチングを行な
い、マッチした文書ＩＤは全て抽出する。抽出された文
書ＩＤは検索回答バッファ２７８に格納される。検索回
答部２２８は、検索回答バッファ２７８に格納されてい
る結果を得る。出力部２２９は、検索結果を表示装置３
に表示を行なう。Next, the processing contents of each functional block constituting the processing unit of the control unit 1 will be described. Initialization unit 201
Initializes each buffer unit. Document capture unit 2
02 stores the document number and the original document data from the external storage device 4 in the document number storage buffer 250 and the document storage buffer 251, respectively. Word cutout unit 203
Divides the document stored in the document storage buffer 251 into keywords such as words and characters using the division dictionary 5. The word frequency counting unit 204 counts the number of appearances of the keyword stored in the divided word storage buffer 252. The keyword sorting unit 205 calculates the appearance frequency based on the keywords stored in the frequency storage unit 253 and the number of appearances of the keywords, and sorts the appearance frequencies in descending order of appearance frequency. The adjacent information conversion unit 206 converts the data stored in the divided word storage buffer 252 into a keyword ID. The optimum data conversion unit 207 represents the keyword ID with optimum bits. The adjacent information storage unit 208 stores the adjacent information stored in the optimum data storage buffer 255 and the document number stored in the document number storage buffer 250 in the external storage device 4. The keyword list storage unit 209 uses the frequency sort storage buffer 2
The keyword list stored in 54 is stored in the external storage device 4
To store. The input unit 220 manages a search operation by the user using the input device 2. The index reading unit 221 reads an index indicating in which document the keyword stored in the external storage device 4 appears in the index storage buffer 271. The keyword list reading unit 222 selects “0, 1, ...” in order of the keywords having a high frequency of appearance in the search target document with respect to the keywords extracted from the search target document.
The keyword list with the keyword ID "2, ..." Is read from the external storage device 4 into the keyword list storage buffer 256. Adjacent information reading unit 2
23 reads the adjacent information index stored in the external storage device 4 into the adjacent information storage unit 253. The keyword extracting unit 224 converts the search key stored in the search key character string buffer 274 by the user into the units indexed from the keywords extracted from the documents to be searched, and the keyword units from the search key. It divides and stores the divided keywords in the keyword storage buffer 275. The keyword adjacency information conversion unit 225 uses the keyword list storage buffer 27 to convert the keywords stored in the keyword storage buffer 275.
According to the keyword list assigned with the keyword ID for each keyword stored in No. 2, the keyword is converted into the keyword ID in the arrangement of the search keys stored in the search key character string buffer 274. The keyword matching unit 226 represents a document ID that includes a keyword stored in the keyword storage buffer 275 that constitutes a search key input by the user and a keyword stored in the index storage buffer 271. Matching with the index is performed, and the document ID including all the keywords stored in the keyword storage buffer 275 is extracted. Adjacent information matching unit 2
The reference numeral 27 matches whether the arrangement of the keyword IDs stored in the adjacent information conversion buffer 276 is the same as the arrangement of the keyword IDs in the document ID unit index stored in the adjacent information storage buffer 273. , All matching document IDs are extracted. The extracted document ID is stored in the search response buffer 278. The search response section 228 obtains the result stored in the search response buffer 278. The output unit 229 displays the search results on the display device 3
Is displayed.

【００１１】次に本実施例の動作の流れを図３、図４及
び図５のフローチャートに従って説明する。最初にキー
ワードリストの作成を行なうための動作を図３のフロー
チャートにより説明する。まず、制御装置１の初期化部
２０１がステップ３０１にて起動し、各バッファをクリ
アする。次に文書取り込み部２０２がステップ３０２に
て起動し、外部記憶装置４に格納されていうｒ原文書の
データを読み込み、文書格納バッファ２５１に格納す
る。ここで、ワード切り出し部２０３がステップ３０３
にて起動し、文書格納バッファ２５１に格納されている
文書を、分割辞書５によって単語や文字等のキーワード
単位に分割する。分割辞書の例は図６に示しているよう
に分割することのできるキーワードで構成されている。
ワード切り出し部２０３によってキーワード単位に分割
された文書は図８に示すようにキーワード単位の形式
で、分割ワード格納バッファ２５２に格納される。文書
取り込み部２０２は更に読み込む文書があるないかをス
テップ３０４にて判定し、あればステップ３０２に戻
り、なければステップ３０５に進む。Next, the flow of operation of this embodiment will be described with reference to the flow charts of FIGS. 3, 4 and 5. First, the operation for creating the keyword list will be described with reference to the flowchart of FIG. First, the initialization unit 201 of the control device 1 is activated in step 301 to clear each buffer. Next, the document acquisition unit 202 is activated in step 302, reads the data of the r original document stored in the external storage device 4, and stores it in the document storage buffer 251. Here, the word cutout unit 203 makes the step 303
Starts, and the document stored in the document storage buffer 251 is divided into keywords such as words and characters by the division dictionary 5. An example of the division dictionary is composed of keywords that can be divided as shown in FIG.
The document divided into keyword units by the word cutout unit 203 is stored in the divided word storage buffer 252 in the keyword unit format as shown in FIG. The document capturing unit 202 determines in step 304 whether or not there is a document to be read, and if there is, returns to step 302, and if not, proceeds to step 305.

【００１２】全ての文書のキーワード分割が終了すれ
ば、ワード頻度カウント部２０４がステップ３０５にて
起動し、分割ワード格納バッファ２５２に格納されてい
るキーワードの出現回数をカウントし、キーワード及び
そのキーワードの出現回数を図９に示すように頻度格納
部２５３に格納する。次にキーワードソート部２０５が
起動し、頻度格納部２５３に格納されているキーワード
の出現回数とキーワード数から、文書中にそのキーワー
ドがどれくらい出現したかをステップ３０６にて計算し
た後、ステップ３０７にて出現頻度の高い順番に並べ変
え、これら並べ変えたキーワードに対して番号を１から
順に０、１、２、３、・・・と図７に示すように付与し
（この番号を「キーワードＩＤ」と呼ぶ）、ステップ３
０８にて頻度ソート格納バッファ２５４に図１０に示す
ように格納する。When the keyword division of all the documents is completed, the word frequency counting unit 204 is activated in step 305 to count the number of appearances of the keyword stored in the divided word storage buffer 252 to determine the keyword and its keyword. The number of appearances is stored in the frequency storage unit 253 as shown in FIG. Next, the keyword sorting unit 205 is activated, and in Step 306, the number of appearances of the keyword stored in the frequency storage unit 253 and the number of keywords are used to calculate how many such keywords appear in the document. , And the numbers are assigned to these rearranged keywords in order from 1 to 0, 1, 2, 3, ... As shown in FIG. )), Step 3
At 08, the data is stored in the frequency sort storage buffer 254 as shown in FIG.

【００１３】例えば、「の」というキーワードが出現し
たキーワードの中で一番出現頻度が高ければ、キーワー
ド「の」に対してキーワードＩＤ「０」を付与する。逆
に、出現頻度が低いものは、逆に「３、０００」等の大
きいキーワードＩＤを付与する。この頻度ソート格納バ
ッファ２５４に格納されているキーワードＩＤ及びキー
ワード群をキーワードリストと呼ぶ。キーワードＩＤを
付与が終ると、キーワードリスト保存部２０９がステッ
プ３０９にて起動し、頻度ソート格納バッファ２５４に
格納されているキーワードリストを外部記憶装置４に格
納して、処理を終了する。For example, if the appearance frequency of the keyword "no" is highest, the keyword ID "0" is given to the keyword "no". On the contrary, a keyword having a low appearance frequency is assigned a large keyword ID such as "3,000". The keyword ID and the keyword group stored in the frequency sort storage buffer 254 are called a keyword list. When the keyword ID storage is completed, the keyword list storage unit 209 is activated in step 309, stores the keyword list stored in the frequency sort storage buffer 254 in the external storage device 4, and ends the processing.

【００１４】次に、隣接情報の作成の具体例を図４のフ
ローチャートに従って説明する。まず、初期化部２０１
がステップ４０１にて起動することにより各バッファを
クリアする。次に、文書取り込み部２０２がステップ４
０２にて起動し、外部記憶装置４に格納されている原文
書及び原文書の番号のデータを読み込み、それぞれ文書
番号格納バッファ２５０と文書格納バッファ２５１に格
納する。ステップ４０３にてキーワードリスト読み込み
部２２２が起動し、外部記憶装置４に格納されているキ
ーワードリストをキーワードリスト格納バッファ２７２
へ格納する。ステップ４０４にて、ワード切り出し部２
０３が起動し、文書格納バッファ２５１に格納されてい
る文書を、分割辞書５によって単語や文字等のキーワー
ド単位に分割する。このようにキーワード単位に分割さ
れた文書はキーワード単位の形式で、分割ワード格納バ
ッファ２５２に格納される。次に、隣接情報変換部２０
６がステップ４０５にて起動し、分割ワード格納バッフ
ァ２５２に格納されているキーワード単位に分割されて
いる文書とキーワードリスト格納バッファ２２２に格納
されているキーワードリストとを比較し、このキーワー
ドリストの中の同一のキーワードに付与されているキー
ワードＩＤに変換する。この時、最適データ変換部２０
７がステップ４０５にて起動し、キーワードＩＤを表現
するのに最適なビット表現に変更を行なった後、変更後
のキーワードのＩＤを最適データ格納バッファ２５５に
格納する。Next, a specific example of creating the adjacent information will be described with reference to the flowchart of FIG. First, the initialization unit 201
Is activated in step 401 to clear each buffer. Next, the document capturing unit 202 performs step 4
In step 02, the original document and the original document number data stored in the external storage device 4 are read and stored in the document number storage buffer 250 and the document storage buffer 251, respectively. In step 403, the keyword list reading unit 222 is activated to load the keyword list stored in the external storage device 4 into the keyword list storage buffer 272.
Store to. In step 404, the word cutout unit 2
03 starts, and the document stored in the document storage buffer 251 is divided by the division dictionary 5 into keywords such as words and characters. The document divided into the keyword units in this way is stored in the divided word storage buffer 252 in the keyword unit format. Next, the adjacent information conversion unit 20
6 is started in step 405, the document divided in the keyword unit stored in the divided word storage buffer 252 is compared with the keyword list stored in the keyword list storage buffer 222, and in this keyword list Is converted to a keyword ID assigned to the same keyword. At this time, the optimum data conversion unit 20
7 starts in step 405, and after changing to a bit expression most suitable for expressing the keyword ID, the changed keyword ID is stored in the optimum data storage buffer 255.

【００１５】最適ビットの表現方法は、１バイト中の１
つのビットをフラグとして利用することにより、そのキ
ーワードＩＤを１バイトで表現したり、２バイトで表現
したりする。図１３に示した例では、バイトの先頭ビッ
トが「１」の時、次のバイトに連続であることを示して
いる。例えば、キーワードＩＤ「１０₍₁₀₎」は「０ａ
₍₁₅₎」と表現され、キーワードＩＤ「４００₍₁₀₎」は
「８３１０₍₁₆₎」と表現される。分割ワード格納バッフ
ァ２５２に格納されているキーワード全てがキーワード
ＩＤに変換されると、隣接情報保存部２０８がステップ
４０６にて起動し、最適データ格納バッファ２５５に格
納されている隣接情報及び文書番号格納バッファ２５０
に格納されている本隣接情報を作成するのに使用した原
文書の文書の番号（以下「文書ＩＤ」と呼ぶ）を外部記
憶装置４に格納する。この隣接情報保存部２０８が外部
記憶装置４に図１１に示すように格納したものを隣接情
報インデックスと呼ぶ。The optimum bit is represented by 1 in 1 byte.
By using one bit as a flag, the keyword ID is represented by 1 byte or 2 bytes. In the example shown in FIG. 13, when the first bit of the byte is “1”, it is indicated that the next byte is continuous. For example, the keyword ID "10 ₍₁₀₎ " is "0a
₍₁₅₎ ", and the keyword ID" 400 ₍₁₀₎ "is expressed as" 8310 ₍₁₆₎ ". When all the keywords stored in the divided word storage buffer 252 are converted into keyword IDs, the adjacent information storage unit 208 is activated in step 406 to store the adjacent information and the document number stored in the optimum data storage buffer 255. Buffer 250
The document number (hereinafter referred to as “document ID”) of the original document used to create the main adjacency information stored in (4) is stored in the external storage device 4. What is stored in the external storage device 4 by the adjacent information storage unit 208 as shown in FIG. 11 is called an adjacent information index.

【００１６】次に、図１に示した装置の検索動作の具体
例を図５のフローチャートに従って説明する。まずステ
ップ５０１にて初期化部２０１は各バッファを初期化す
る。ステップ５０２にてインデックス読み込み部２２１
が起動し、外部記憶装置４から検索対象文書中に含まれ
ているキーワードを示しているインデックスをインデッ
クス格納バッファ２７１に格納する。インデックス格納
バッファ２７１には、キーワードとそのキーワードが含
まれている文書番号（文書ＩＤ）を表したキーワードイ
ンデックスが格納されている。キーワードインデックス
は例えば、図１４に示すように「キーワード文字列・文
書ＩＤ」から構成されている。又、文書ＩＤには、その
キーワード文字列が含まれている複数の文書の番号が並
べられている。この図１４の例において、キーワード
「文書」は文書ＩＤ「１、４、７、９、１７、・・・」
に含まれていることが示されている。Next, a specific example of the search operation of the apparatus shown in FIG. 1 will be described with reference to the flowchart of FIG. First, in step 501, the initialization unit 201 initializes each buffer. In step 502, the index reading unit 221
Starts and stores an index indicating a keyword included in the search target document from the external storage device 4 in the index storage buffer 271. The index storage buffer 271 stores a keyword index indicating a keyword and a document number (document ID) including the keyword. The keyword index is composed of, for example, "keyword character string / document ID" as shown in FIG. Further, in the document ID, the numbers of a plurality of documents including the keyword character string are arranged. In the example of FIG. 14, the keyword “document” has a document ID “1, 4, 7, 9, 17, ...”
It is shown to be included in.

【００１７】次に、キーワードリスト読み込み部２２２
がステップ５０３にて起動し、外部記憶装置４からキー
ワードリストを読み込んで、キーワードリスト格納バッ
ファ２７２に格納する。キーワードリスト格納バッファ
２７２には、キーワードＩＤに対応するキーワードが格
納されている。例えば図１５に示すように、キーワード
リスト格納バッファ２７２には「キーワードＩＤ・キー
ワード」から成るデータが格納されている。この図１５
の例では、キーワードＩＤ「０」番は「の」である。次
に、隣接情報読み込み部２２３がステップ５０４にて起
動し、外部記憶装置４から隣接情報インデックスを読み
込んで、これを隣接情報格納バッファ２７３に格納す
る。ここでこの「キーワードの並び」はその文書番号の
文章そのものである。この文章を図１２に示すようにキ
ーワード単位に分割し、分割された各キーワードをキー
ワードリストによって、そのキーワードに付与された番
号（キーワードＩＤ）で置き換えることにより、キーワ
ードの並びがキーワードＩＤで表現される。Next, the keyword list reading section 222
Starts in step 503, reads the keyword list from the external storage device 4, and stores it in the keyword list storage buffer 272. The keyword list storage buffer 272 stores keywords corresponding to keyword IDs. For example, as shown in FIG. 15, the keyword list storage buffer 272 stores data including “keyword ID / keyword”. This FIG.
In the example, the keyword ID “0” is “no”. Next, the adjacent information reading unit 223 is activated in step 504, reads the adjacent information index from the external storage device 4, and stores it in the adjacent information storage buffer 273. Here, this "keyword sequence" is the sentence itself of the document number. This sentence is divided into units of keywords as shown in FIG. 12, and each divided keyword is replaced by a number (keyword ID) assigned to the keyword in the keyword list, whereby the keyword sequence is expressed by the keyword ID. It

【００１８】その後、ステップ５０５にてユーザにより
入力装置２から［終了］を選択する情報が入力されたか
どうかが判断され、終了が選択された場合には、本装置
を終了する。しかし、ユーザが上記した終了を選択せ
ず、キーボード等の入力装置２から検索キーを入力した
場合は以下に述べる動作が行なわれる。この入力される
検索キーとしては例えば、「文書の作成」等のようなも
のである。このような検索キーが入力されると、検索キ
ーは入力部２２０にて検索キー文字列バッファ２７４に
格納される。図１６はこの検索キー文字列バッファ２７
４の格納例である。検索キーが入力されると、キーワー
ド抽出部２２４がステップ５０７にて起動し、検索キー
文字列をキーワードインデックスに登録されているキー
ワードの単位で分割する。ここで、前記検索キー文字列
は例えば、「文書／の／作成」と分割される。分割され
たキーワードはキーワード格納バッファ２７５に図１７
に示すように格納される。この時、キーワード隣接情報
変換部２２５がステップ５０９にて起動し、キーワード
リスト格納バッファ２５６に格納してあるキーワードリ
ストによって、キーワード格納バッファ２５６に格納さ
れているキーワード、例えば「文書／の／作成」をキー
ワードＩＤに変換して隣接情報変換バッファ２７６に図
１８に示すように格納する。例えば、「２３、００１・
０・２３、０００₍₁₀₎」のように変換する。以降、マッ
チング処理が行われる。Thereafter, in step 505, it is determined whether or not the user has input information for selecting [end] from the input device 2, and when the end is selected, the present device is ended. However, if the user does not select the end described above and inputs the search key from the input device 2 such as a keyboard, the following operation is performed. The input search key is, for example, "create document". When such a search key is input, the search key is stored in the search key character string buffer 274 by the input unit 220. FIG. 16 shows this search key character string buffer 27.
4 is a storage example of No. 4. When the search key is input, the keyword extraction unit 224 is activated in step 507, and the search key character string is divided into units of keywords registered in the keyword index. Here, the search key character string is divided into, for example, "document / no / create". The divided keywords are stored in the keyword storage buffer 275 in FIG.
It is stored as shown in. At this time, the keyword adjacency information conversion unit 225 is activated in step 509, and according to the keyword list stored in the keyword list storage buffer 256, the keyword stored in the keyword storage buffer 256, for example, “document / of / create”. Is converted into a keyword ID and stored in the adjacent information conversion buffer 276 as shown in FIG. For example, "23,001.
It is converted to "0.23,000 ₍₁₀₎ ". After that, the matching process is performed.

【００１９】マッチング処理では、最初にキーワードマ
ッチング部２２６がステップ５１０にて起動し、キーワ
ード格納バッファ２７５に格納されている検索キーワー
ドと、インデックス格納バッファ２７１に格納されてい
るインデックスとでキーワードのマッチングを取り、キ
ーワードを含む文書の文書ＩＤを抽出する。この例で
は、「文書」、「の」及び「作成」というキーワードが
文書中に出現している文書の文書ＩＤとして「１、４」
が抽出される。キーワードマッチング部２２６により抽
出されたこれら文書ＩＤは、文書ＩＤ格納バッファ２７
７に図１９に示すように格納される。ステップ５１１に
て文書ＩＤが抽出されたか否かが判定され、抽出されな
い場合はステップ５０５に戻り、抽出された場合はステ
ップ５１２に進む。In the matching process, first, the keyword matching unit 226 is activated in step 510 to match the keywords between the search keyword stored in the keyword storage buffer 275 and the index stored in the index storage buffer 271. Then, the document ID of the document including the keyword is extracted. In this example, the keywords "document", "no", and "creation" are "1, 4" as the document ID of the document appearing in the document.
Is extracted. These document IDs extracted by the keyword matching unit 226 are stored in the document ID storage buffer 27.
7 is stored as shown in FIG. In step 511, it is determined whether or not the document ID is extracted. If not extracted, the process returns to step 505, and if extracted, the process proceeds to step 512.

【００２０】次に隣接情報マッチング部２２７がステッ
プ５１２にて起動し、文書ＩＤ格納バッファ２７７に格
納されている文書ＩＤの隣接情報インデックスをサーチ
して、隣接情報変換バッファ２７６に格納されているキ
ーワードＩＤの並びと同一の並びの隣接情報を持つ文書
ＩＤを検索し、該当する文書ＩＤのみを抽出し、抽出し
た文書ＩＤを検索回答バッファ２７８に図２０に示すよ
うに格納する。該当する文書ＩＤを検索回答バッファ２
７８に格納すると、検索回答部２２８がステップ５１４
にて起動し、検索回答バッファ２７８に格納されている
文書ＩＤを検索回答として取得する。図１４の例では、
検索回答は文書ＩＤ「４」となる。その後、ステップ５
１５にて出力部２２９が起動し、表示装置３に検索結果
を表示する。その後、ステップ５１６にて引き続きその
検索結果に対して絞り込みを行なうかどうかが判定さ
れ、行う場合にはステップ５０５に戻る。絞り込みを行
なわず、新たな検索を行なう場合には、ユーザが入力装
置２により［クリア］を選択することによって、ステッ
プ５０１に戻る。Next, the adjacency information matching unit 227 is activated in step 512, searches the adjacency information index of the document ID stored in the document ID storage buffer 277, and searches the keyword stored in the adjacency information conversion buffer 276. A document ID having adjacent information in the same sequence as the sequence of IDs is searched, only the relevant document ID is extracted, and the extracted document ID is stored in the search response buffer 278 as shown in FIG. Search for the corresponding document ID Response buffer 2
When it is stored in 78, the search response section 228 stores it in step 514.
Then, the document ID stored in the search response buffer 278 is acquired as a search response. In the example of FIG. 14,
The search response has the document ID “4”. Then step 5
The output unit 229 is activated at 15 and the search result is displayed on the display device 3. After that, in step 516, it is determined whether or not to continue narrowing down the search result, and if so, the process returns to step 505. When performing a new search without narrowing down, the user selects [Clear] with the input device 2 and returns to step 501.

【００２１】本実施例によれば、ユーザが文書検索のキ
ーワードとして、単語と文字の合成語や文章を入力した
場合、前記キーワードを複数のキーワードＩＤに分解
し、得られたキーワードＩＤを全て含む文書ＩＤを図１
４に示したキーワードインデックスから抽出し、図１１
に示すような隣接情報インデックスから得られた文書Ｉ
ＤのキーワードＩＤの並びと、前記入力したキーワード
のキーワードＩＤの並びとを比較して、入力したキーワ
ードのキーワードＩＤの並びと同一のキーワードＩＤの
並びを持つ文書ＩＤを検索し、こうして検索された文書
ＩＤで示される文書を検索回答とするため、被検索対象
文書を入力キーワードを構成する単語と文字の合成語や
文章により総嘗めする必要がなくなり、検索速度を著し
く速めて、検索効率を向上させることができる。従っ
て、このような検索効率が高い文書検索装置を搭載した
文書作成装置では、外部記憶装置４等に記憶されている
保存文書等の検索を迅速に行うことができるため、文書
作成能率を向上させることができる。According to the present embodiment, when the user inputs a word-character composite word or a sentence as a document search keyword, the keyword is decomposed into a plurality of keyword IDs and all the obtained keyword IDs are included. Figure 1 for the document ID
11 is extracted from the keyword index shown in FIG.
Document I obtained from the adjacent information index as shown in
The arrangement of the keyword IDs of D and the arrangement of the keyword IDs of the input keywords are compared, a document ID having the same arrangement of the keyword IDs as the arrangement of the keyword IDs of the input keywords is searched, and thus searched. Since the document indicated by the document ID is used as the search answer, there is no need to revise the document to be searched by the compound words and sentences that compose the input keyword, and the search speed is significantly increased to improve the search efficiency. Can be made. Therefore, in the document creation device equipped with such a document search device having high search efficiency, it is possible to quickly search the saved document or the like stored in the external storage device 4 or the like, thereby improving the document creation efficiency. be able to.

【００２２】[0022]

【発明の効果】以上記述した如く本発明によれば、検索
キーが単語と文字の合成語や文章であり、且つ被検索文
書が大量であっても、高速に目的の文書を検索して、検
索効率を向上させることができ、ひいては文書作成能率
を向上させることができる。As described above, according to the present invention, even if the search key is a compound word or sentence of words and characters and the number of documents to be searched is large, the target document can be searched at high speed, The search efficiency can be improved, and the document creation efficiency can be improved.

[Brief description of drawings]

【図１】本発明の文書検索方法を適用した本発明の文書
検索装置と、この文書検索装置を搭載した本発明の文書
作成装置の一実施例を示した概略ブロック図。FIG. 1 is a schematic block diagram showing an embodiment of a document search device of the present invention to which a document search method of the present invention is applied and a document creation device of the present invention equipped with this document search device.

【図２】図１に示した制御装置の詳細例を示したブロッ
ク図。FIG. 2 is a block diagram showing a detailed example of a control device shown in FIG.

【図３】図１の装置にてキーワードリストを作成する処
理の流れを示したフローチャート。FIG. 3 is a flowchart showing a flow of processing for creating a keyword list in the apparatus of FIG.

【図４】図１に示した装置における隣接情報の作成処理
の流れを示したフローチャート。4 is a flowchart showing a flow of a process of creating adjacent information in the device shown in FIG.

【図５】図１に示した装置における文書検索処理の流れ
を示したフローチャート。5 is a flowchart showing a flow of document search processing in the apparatus shown in FIG.

【図６】図２に示した装置で用いられる分割辞書例を示
した図。FIG. 6 is a diagram showing an example of a division dictionary used in the apparatus shown in FIG.

【図７】図１に示した装置による出現頻度順のキーワー
ドＩＤ付与例を示した図。7 is a diagram showing an example of assigning keyword IDs in order of appearance frequency by the device shown in FIG.

【図８】図２に示した分割ワード格納バッファの格納例
を示した図。FIG. 8 is a diagram showing a storage example of a divided word storage buffer shown in FIG.

【図９】図２に示した頻度格納バッファの格納例を示し
た図。FIG. 9 is a diagram showing a storage example of a frequency storage buffer shown in FIG.

【図１０】図２に示した頻度ソート格納バッファの格納
例を示した図。10 is a diagram showing a storage example of the frequency sort storage buffer shown in FIG.

【図１１】図２に示した頻度ソート格納バッファの格納
例を示した図。FIG. 11 is a diagram showing a storage example of a frequency sort storage buffer shown in FIG.

【図１２】図１に示した装置で作成される隣接情報イン
デックス例を示した図。FIG. 12 is a diagram showing an example of an adjacent information index created by the device shown in FIG.

【図１３】図１に示した装置で作成されるキーワードＩ
Ｄの表現法を示した図。FIG. 13 is a keyword I created by the apparatus shown in FIG.
The figure which showed the expression method of D.

【図１４】図１に示した装置で作成されるキーワードイ
ンデックスの例を示した図。FIG. 14 is a diagram showing an example of a keyword index created by the device shown in FIG.

【図１５】図１に示した装置で作成されるキーワードリ
ストの例を示した図。15 is a diagram showing an example of a keyword list created by the device shown in FIG.

【図１６】図２に示した検索キー文字列バッファの格納
例を示した図。16 is a diagram showing a storage example of a search key character string buffer shown in FIG.

【図１７】図２に示したキーワード格納バッファの格納
例を示した図。、17 is a diagram showing a storage example of the keyword storage buffer shown in FIG. ,

【図１８】図２に示した隣接情報変換バッファの格納例
を示した図。、18 is a diagram showing a storage example of the adjacent information conversion buffer shown in FIG. ,

【図１９】図２に示した文書ＩＤ格納バッファの格納例
を示した図。FIG. 19 is a diagram showing a storage example of a document ID storage buffer shown in FIG.

【図２０】図２に示した検索回答バッファの格納例を示
した図。20 is a diagram showing a storage example of the search response buffer shown in FIG.

[Explanation of symbols]

１…制御装置２…入力装置３…表示装置４…外部記憶
装置５…分割辞書２０１…初期
化部２０２…文書取り込み部２０３…ワー
ド切り出し部２０４…ワード頻度カウント部２０５…キー
ワードソート部２０６…隣接情報変換部２０７…最適
データ変換部２０８…隣接情報保存部２０９…キー
ワードリスト保存部２２０…入力部２２１…イン
デックス読み込み部２２２…キーワードリスト読み込み部２２３…隣接
情報読み込み部２２４…キーワード抽出部２２５…キーワード隣接情報変換部２２６…キーワードマッチング部２２７…隣接
情報マッチング部２２８…検索回答部２２９…出力
部２５０…文書番号格納バッファ２５１…文書
格納バッファ２５２…分割ワード格納バッファ２５３…頻度
格納部２５４…頻度ソート格納バッファ２５５…最適
データ格納バッファ２７１…インデックス格納バッファ２５６，２７２…キーワードリスト格納バッファ２７３…隣接情報格納バッファ２７４…検索
キー文字列バッファ２７５…キーワード格納バッファ２７６…隣接
情報変換バッファ２７７…文書ＩＤ格納バッファ２７８…検索
回答バッファDESCRIPTION OF SYMBOLS 1 ... Control device 2 ... Input device 3 ... Display device 4 ... External storage device 5 ... Divided dictionary 201 ... Initialization unit 202 ... Document acquisition unit 203 ... Word cutout unit 204 ... Word frequency counting unit 205 ... Keyword sorting unit 206 ... Adjacent Information conversion unit 207 ... Optimal data conversion unit 208 ... Adjacent information storage unit 209 ... Keyword list storage unit 220 ... Input unit 221 ... Index reading unit 222 ... Keyword list reading unit 223 ... Adjacent information reading unit 224 ... Keyword extraction unit 225 ... Keywords Adjacent information conversion unit 226 ... Keyword matching unit 227 ... Adjacent information matching unit 228 ... Search response unit 229 ... Output unit 250 ... Document number storage buffer 251 ... Document storage buffer 252 ... Divided word storage buffer 253 ... Frequency storage unit 254 ... Frequency sort Case Storage buffer 255 ... Optimal data storage buffer 271 ... Index storage buffer 256, 272 ... Keyword list storage buffer 273 ... Adjacent information storage buffer 274 ... Search key character string buffer 275 ... Keyword storage buffer 276 ... Adjacent information conversion buffer 277 ... Document ID storage Buffer 278 ... Search response buffer

───────────────────────────────────────────────────── フロントページの続き (72)発明者野上謙一東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者中本幸夫東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者尾崎敏宏東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者湯澤敏惠東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Kenichi Nogami 1381 Shinmachi, Ome-shi, Tokyo 1 Toshiba Computer Engineering Co., Ltd. (72) Inventor Yukio Nakamoto 1381 Shinmachi, Ome-shi, Tokyo 1 Toshiba Computer In data engineering Co., Ltd. (72) Toshihiro Ozaki 1381 Shinmachi, Ome-shi, Tokyo Within Toshiba Computer Engineering Co., Ltd. (72) Toshie Yuzawa 1381 Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd.

Claims

[Claims]

1. When each search key has index information in which an identification number of a document to be searched including the search key is associated and a search keyword of two or more words is input, the search keyword is used as a word unit. In the document search device having a function of searching the index information for a document including all identification numbers constituting the search keyword after disassembling each word unit into an identification number, Adjacency information creating means for decomposing the entire contents of the document into word units and replacing the word units with identification numbers to create adjacency information arranged in word order; and the adjacency corresponding to the document retrieved from the index information. A search unit that searches the information for the same sequence as the sequence of the identification numbers obtained by decomposing the input search keyword, and a search unit that uses this search unit. A document search device, comprising: an output unit that outputs a document having a corresponding sequence that has been reached as a search response document.

2. When assigning an identification number to each word,
2. The document search device according to claim 1, wherein one bit in one byte expressing the identification number is used as a flag.

3. When assigning an identification number to each word,
3. The document search device according to claim 2, wherein numbers are given in the range of a natural number such as "0.1.2" in order from a word having a high frequency in the searched document.

4. A document search method for searching a document including an input search key from a plurality of searched documents, wherein the entire contents of the document are decomposed into word units for each document to be searched, The adjacency information in which the units are replaced with the identification numbers and arranged in the word order is created in advance, the input search key is decomposed into word units, and each word unit is replaced with the identification number, and the input search is performed. A document search method, characterized in that a key is made into a plurality of identification number sequences, and then a document having an identification number sequence in the same sequence as this identification number sequence is retrieved from the adjacent information.