JPH03123973A

JPH03123973A - Document retrieval method

Info

Publication number: JPH03123973A
Application number: JP1262498A
Authority: JP
Inventors: Shinsuke Teramura; 信介寺村
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-10-06
Filing date: 1989-10-06
Publication date: 1991-05-27

Abstract

PURPOSE:To speed up the retrieval of a document even when a large number of the documents are retrieved by calculating the file likelihood of only the document containing a key word related to a retrieving key word by referring to an inverted file. CONSTITUTION:A key word connection table processing part 4 generates a key word connection table 8 in which relation between the related key word and the degree of association are described together with the necessary key word. All the key words related to the given retrieving key word are extracted by the table 8, and the list of the related key words is generated. The retrieving key word itself inputted by a user is included in this list. In respect of each element of the related key word list, the files containing these key words are checked from the inverted file 10, and the file likelihood of only the document related to these key words is calculated. The document number of the document whose likelihood has been calculated is written in a check list so as not to repeat the calculation of the same document plural times. Thus, only a portion proportional to the number of connections needs to be calculated.

Description

【発明の詳細な説明】産業上の利用分野本発明は１１文書管理の分野において、特に大量の文書
をキーワードを用いて高速に検索する文書検索方法に関
する。DETAILED DESCRIPTION OF THE INVENTION FIELD OF INDUSTRIAL APPLICATION The present invention is in the field of 11 document management, and particularly relates to a document search method for rapidly searching a large number of documents using keywords.

従来の技術従来、文書検索装置においては種々の検索方式があるが
、その一つとして各キーワード間の関連情報を記述した
キーワードコネクション表を利用したものがある。これ
は、まず、利用者から与えられた検索キーワードからこ
のキーワードコネクション表を引いて関連するキーワー
ドを探す。これらのキーワード（検索キーワードを含む
）から全ての文書についての評価値であるファイル確度
を計算しする。そこで、利用者の要求ファイル確度を閾
値として閾値以上のファイル確度を持つ文書を選択し、
又は、利用者の要求文書数に応じてファイル確度の大き
い順に文書を選択し、検索結果とする。このようにして
検索キーワードに関連する関連度の大きい文書が検索さ
れることになる。BACKGROUND OF THE INVENTION Conventionally, there have been various search methods for document retrieval devices, one of which uses a keyword connection table that describes information related to each keyword. First, this keyword connection table is used to search for related keywords from the search keyword given by the user. File certainty, which is an evaluation value for all documents, is calculated from these keywords (including search keywords). Therefore, the user's requested file accuracy is set as a threshold, and documents with file accuracy greater than or equal to the threshold are selected.
Alternatively, documents are selected in descending order of file certainty according to the number of documents requested by the user, and are used as the search results. In this way, documents with a high degree of relevance related to the search keyword are retrieved.

発明が解決しようとする課題ところが、このような処理方式によると、登録文書数に
比例したファイル確度の計算時間が必要となってしまい
、処理時間の長いものとなる。よって、大量の文書が登
録されている場合には不適なものである。Problems to be Solved by the Invention However, according to such a processing method, it is necessary to calculate the file accuracy in proportion to the number of registered documents, resulting in a long processing time. Therefore, it is not suitable when a large number of documents are registered.

課題を解決するための手段文書情報をファイルに登録する際に作成されて各キーワ
ード間の関連情報を記述したキーワードコネクション表
と登録文書とキーワードとの間の関連を示すインバーテ
ツドファイルとを備え、このインバーテツドファイルを
参照して検索キーワードに対して関連するキーワードを
含む文書のみについてファイル確度を計算し、このファ
イル確度に応じて文書を検索するようにした。Means for Solving the Problem The system is equipped with a keyword connection table that is created when document information is registered in a file and describes related information between each keyword, and an inverted file that shows the relationship between registered documents and keywords. This inverted file is referred to, and the file accuracy is calculated only for documents containing keywords related to the search keyword, and documents are searched according to this file accuracy.

作用検索キーワードに関連するキーワードを含まない文書に
ついてはファイル確度の計算が行われないので、ファイ
ル確度の計算時間が、登録文書数ではなく、関連するキ
ーワードのコネクション数に比例したもので済み、大量
の文書検索時であっても高速化できる。Since file accuracy is not calculated for documents that do not contain keywords related to the effect search keyword, the time required to calculate file accuracy is proportional to the number of connections for related keywords, not the number of registered documents. Even when searching for documents, the speed can be increased.

実施例本発明の一実施例を図面に基づいて説明する。Example An embodiment of the present invention will be described based on the drawings.

第１図に文書検索装置のシステム構成の一例を示す。そ
の構成及び作用の概要を説明する。まず、キーワード抽
出部１は登録文書２を入力するとそのキーワードを抽出
し、キーワード及び登録文書２の情報を文書情報管理部
３、キーワードコネクション表処理部４及びインバーテ
ツドファイル作成部５へ出力する。文書情報管理部３は
抽出されたキーワードと書誌情報６とをファイル７に格
納し、検索時に利用可能な形にデータベース化する。FIG. 1 shows an example of the system configuration of a document retrieval device. An outline of its configuration and operation will be explained. First, when the keyword extraction unit 1 inputs the registered document 2, it extracts the keyword, and outputs the keyword and information about the registered document 2 to the document information management unit 3, the keyword connection table processing unit 4, and the inverted file creation unit 5. . The document information management unit 3 stores the extracted keywords and bibliographic information 6 in a file 7, and creates a database in a format that can be used during a search.

キーワードコネクション表処理部４は第２図に示すよう
に必要なキーワードとともに関連性のあるキーワード同
士のつながり及びその関連度を記述したキーワードコネ
クション表８を作成しファイル９に格納するものである
。ここに、キーワードコネクション表８は第２図図示例
のようにリスト構造とされ、関連度の大きい順にソート
される。As shown in FIG. 2, the keyword connection table processing unit 4 creates a keyword connection table 8 that describes necessary keywords, the connections between related keywords, and the degree of association thereof, and stores it in a file 9. Here, the keyword connection table 8 has a list structure as shown in the example shown in FIG. 2, and is sorted in descending order of relevance.

インバーテツドファイル作成部５は第３図に示すように
各キーワードと登録文書２とを対応付けるインバーテツ
ドファイルｌＯを作成し、ファイル１１に格納するもの
である。即ち、インバーテツドファイルＩＯはキーワー
ドからそ・のキーワードを含む文書を指すポインタを集
合させたものである。The inverted file creation section 5 creates an inverted file 1O that associates each keyword with the registered document 2, as shown in FIG. 3, and stores it in the file 11. That is, the inverted file IO is a collection of pointers pointing from a keyword to a document containing that keyword.

ついで、文書選出部１２が設けられている。この文書選
出部１２は検索利用者の要求主題・要求概念により近い
文書ファイルをキーワード群によって抽出することを目
的としたもので、キーボード１３からアクセスする利用
者に対してキーワード−覧表をデイスプレィ１４に出力
する。その中から利用者は要求主題に必要なキーワード
を選択するか、自由キーワードを選択して、再びキーボ
ード１３から表示検索要求を入力するものである。Next, a document selection section 12 is provided. This document selection section 12 is intended to extract document files that are closer to the subject matter/required concept of the search user using a group of keywords. Output to. From among these, the user selects a keyword necessary for the requested subject, or selects a free keyword, and inputs a display search request again from the keyboard 13.

このような文書選出部１２は、要求処理部１５とソート
部１６と表示管理部１７とキーワード間関速度計算部１
８とファイル確度計算部１９とよりなる。要求処理部１
５はキーボード１３から受理したキーワードをキーワー
ド間関速度計算部１８に転送する。キーワード間関速度
計算部１８では転送されたキーワードに関する関連キー
ワードとその関連↑青報をキーワードコネクション表８
から抽出する。抽出されたキーワード群はソート部１６
で関連の強い順にソート部１６でソートされて表示管理
部１７へ出力される。表示管理部１７はこの関連キーワ
ード群をデイスプレィ１４に出力し、利用者に対して表
示する。この表示に従い、利用者がさらに必要なキーワ
ードを選択入力することにより、最終的なキーワード群
が文書選択要求とともに要求処理部１５へ送られる。Such a document selection section 12 includes a request processing section 15, a sorting section 16, a display management section 17, and a keyword relationship speed calculation section 1.
8 and a file accuracy calculation section 19. Request processing unit 1
5 transfers the keyword received from the keyboard 13 to the keyword relation speed calculation section 18. The keyword relationship speed calculation unit 18 calculates the related keywords related to the transferred keywords and their related ↑ blue reports in the keyword connection table 8.
Extract from. The extracted keyword group is sorted by the sorting unit 16
The sorting unit 16 sorts the information in descending order of relevance and outputs it to the display management unit 17. The display management unit 17 outputs this related keyword group to the display 14 and displays it to the user. When the user further selects and inputs necessary keywords according to this display, the final keyword group is sent to the request processing unit 15 together with the document selection request.

要求処理部１５では文書選択要求を受けると、ファイル
確度計算部１９にキーワード群を転送させる。同時に、
最終的なキーワード群に関する関連情報の重み変更を行
うようにキーワードコネクション表処理部４に指示する
。ファイル確度計算部１９では受理したキーワード群と
キーワードコネクション表８及びインバーテツドファイ
ル１０を用いて、登録文書２のファイル７についてファ
イル確度を計算し、結果をソート部、１６に転送する。Upon receiving the document selection request, the request processing section 15 causes the file probability calculation section 19 to transfer the keyword group. at the same time,
The keyword connection table processing unit 4 is instructed to change the weight of related information regarding the final keyword group. The file probability calculation section 19 calculates the file probability for the file 7 of the registered document 2 using the received keyword group, keyword connection table 8, and inverted file 10, and transfers the result to the sorting section 16.

必要なファイル確度の計算が終了すると、ソート部１６
によるソートを経てデイスプレィ１４に表示され、検索
結果とされる。When the calculation of the necessary file accuracy is completed, the sorting unit 16
The search results are sorted by , and then displayed on the display 14 as search results.

しかして、本実施例におけるファイル確度の計算処理の
対象を説明する。まず、キーワードコネクション表８に
より、与えられた検索キーワードに関連性を持つキーワ
ードを全て抽出し、関連キーワードのリストを作成する
。このリストには利用者により入力された検索キーワー
ド自身も含まれる。関連キーワードリストの各要素につ
いて、インバーテツドファイル１０からこれらのキーワ
ードを含むファイルを調べ、関連性を持つ文書について
のみ、ファイル確度の計算を行う。この時、確度計算を
行った文書番号はチエツクリストに書込み、同一文書に
ついて複数回計算を行なわないようにする。これにより
、関連キーワードのコネクション数に比例する分の計算
で済む。The object of the file accuracy calculation process in this embodiment will now be explained. First, all keywords that are related to a given search keyword are extracted using the keyword connection table 8, and a list of related keywords is created. This list also includes the search keywords entered by the user. For each element of the related keyword list, files that include these keywords are checked from the inverted file 10, and file certainty is calculated only for documents that are related. At this time, the document number for which the accuracy calculation was performed is written in the check list to avoid performing calculations multiple times on the same document. This allows calculations to be made in proportion to the number of connections for related keywords.

ところで、データベース内の文書数が増えてくると、キ
ーワード間のコネクション数も増えることが容易に予想
され、本実施例方法によっても計算対象となる文書が多
くなる可能性がある。このような場合には、キーワード
コネクション表８から関連するキーワードを取り出す際
に上位ｎ個だけ取り出すようにする。このｎは検索深度
を示し、ｎの値が大きければ関速度の小さいキーワード
しか持たない文書についてもファイル確度の計算を行う
。また、検索深度ｎを指定する代わりに、ある閾値より
大きい関速度を持つキーワードだけを取り出すようにし
てもよい。何れにしても、これらの操作はキーワードコ
ネクション表８が関速度の大きい順にソートされている
ため、非常に容易である。また、キーワード間の関速度
が変化した場合や新たに追加されるような場合も、リス
ト構造であるため、変更が容易である。By the way, as the number of documents in the database increases, it is easily expected that the number of connections between keywords will also increase, and the method of this embodiment may also increase the number of documents to be calculated. In such a case, when extracting related keywords from the keyword connection table 8, only the top n keywords are extracted. This n indicates the search depth, and if the value of n is large, the file accuracy is calculated even for documents that have only keywords with a low search speed. Furthermore, instead of specifying the search depth n, only keywords having a search rate greater than a certain threshold may be retrieved. In any case, these operations are very easy because the keyword connection table 8 is sorted in descending order of connection speed. Furthermore, even if the relationship speed between keywords changes or new keywords are added, the list structure allows for easy changes.

発明の効果本発明は、上述したようにインバ、−テッドファイルを
参照して検索キーワードに対して関連するキーワードを
含む文書のみについてファイル確度を計算するようにし
たので、検索キーワードに関連するキーワードを含まな
い文書についてはファイル確度の計算が行われないため
、ファイル確度の計算時間が、登録文書数ではなく、関
連するキーワードのコネクション数に比例したもので済
むことになり、大量の文書検索時であっても高速化でき
るものである。Effects of the Invention As described above, the present invention calculates the file probability only for documents that include keywords related to the search keyword by referring to the Inva-Ted file. Since the file accuracy is not calculated for documents that are not included, the time required to calculate the file accuracy is proportional to the number of connections for related keywords, rather than the number of registered documents, making it easier to search for a large number of documents. Even if there is, the speed can be increased.

[Brief explanation of drawings]

図面は本発明の一実施例を示し、第１図は文書検索装置
のシステム構成を示すブロック図、第２図はキーワード
コネクション表の内容例を示す説明図、第３図はインバ
ーテツドファイルの内容例を示す説明図である。８・・・キーワードコネクション表、ｌＯ・・・インバ
ーテツドファイルThe drawings show an embodiment of the present invention; FIG. 1 is a block diagram showing the system configuration of a document search device, FIG. 2 is an explanatory diagram showing an example of the contents of a keyword connection table, and FIG. 3 is an illustration of an inverted file. It is an explanatory diagram showing an example of contents. 8... Keyword connection table, lO... Inverted file

Claims

[Claims]

It is equipped with a keyword connection table that is created when document information is registered in a file and describes related information between each keyword, and an inverted file that shows the relationship between registered documents and keywords, and this inverted file is referenced. A document search method characterized in that the file probability is calculated only for documents containing keywords related to a search keyword, and documents are searched according to this file probability.