JPH10289246A

JPH10289246A - Similar document retrieving device and similar document retrieving method

Info

Publication number: JPH10289246A
Application number: JP9097630A
Authority: JP
Inventors: Yasuo Tanosaki; 康雄田野崎; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科; Naohide Kubota; 直秀久保田
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1997-04-15
Filing date: 1997-04-15
Publication date: 1998-10-27

Abstract

PROBLEM TO BE SOLVED: To provide a similar document retrieving device which efficiently retrieves document that is similar to a retrieval key document even when there are many documents. SOLUTION: This device inputs a retrieval key document from an input device 2, extracts document remarked context information that suggests the contents from each document in an external storage device 4 and the retrieval key document, extracts category remarked context information with a document group of different contents in the device 4 all together and calculates category similarity between the document remarked context information and the category remarked context information that is extracted from the document group of different contents in a lump. It selects a document group that calculates document similarity with the retrieval key document from a document database in accordance with the calculated category similarity and outputs identification information of a document that is retrieved based on each document similarity which is calculated by the document similarity calculating means as a retrieval result to a display device 3.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、類似文書検索装置
及び類似文書検索方法に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a similar document search device and a similar document search method.

【０００２】[0002]

【従来の技術】従来、文書を検索キーとして、その文書
の内容に類似している文書を検索対象文書データベース
から抽出する類似文書検索装置が提案されている。この
類似文書検索装置は、検索キーとする文書中に含まれて
いる単語と、検索対象文書データベースに格納されてい
る各文書中に含まれている単語とを比較し、検索キーと
する文書と検索対象文書データベースに格納されている
各文書との類似度を算出し、その類似度の高低に応じて
類似文書の抽出を行っている。2. Description of the Related Art Conventionally, there has been proposed a similar document search apparatus for extracting a document similar to the contents of a document from a search target document database using the document as a search key. This similar document search device compares a word included in a document serving as a search key with a word included in each document stored in the search target document database, and The similarity with each document stored in the search target document database is calculated, and similar documents are extracted according to the degree of the similarity.

【０００３】このような類似度の算出方法としては、検
索キーとする文書と、検索対象文書データベースに格納
されている各文書に含まれている単語の種類や出現回
数、出現場所等とから空間ベクトル法を使用して算出す
る方法が採用されている。[0003] As a method of calculating the similarity, a document as a search key and the type, appearance frequency, appearance location, and the like of a word contained in each document stored in the search target document database are used. A calculation method using a vector method is employed.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述し
た従来技術においては、検索対象文書データベースに格
納されている各文書と、検索キーとする文書との類似度
算出を、検索対象文書データベースに格納されている文
書数分行うため、検索対象文書データベースに格納され
ている文書数が多い場合には、必然的に検索処理時間が
増加するという課題があった。However, in the above-described prior art, similarity calculation between each document stored in the search target document database and a document serving as a search key is stored in the search target document database. Therefore, when the number of documents stored in the search target document database is large, the search processing time is inevitably increased.

【０００５】本発明は上記の課題を解決するためになさ
れたものであり、検索対象文書データベースに格納され
ている文書数が多い場合であっても、検索処理時間を短
縮でき、効率良く検索キーに類似している文書を検索す
ることが可能な類似文書検索装置及び類似文書検索方法
を提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problem. Even when the number of documents stored in the search target document database is large, the search processing time can be reduced and the search key can be efficiently saved. It is an object of the present invention to provide a similar document search device and a similar document search method capable of searching for a document similar to.

【０００６】[0006]

【課題を解決するための手段】請求項１記載の発明は、
一文書を検索キーとして、文書群が内容別に格納されて
いる文書データベース中から類似文書を抽出する類似文
書検索装置において、検索キー文書を入力する入力手段
と、文書データベース中の各文書及び前記検索キー文書
からその内容を示唆する文書注目文脈情報を抽出する文
書注目文脈情報抽出手段と、文書データベース中の内容
別の文書群を一纏まりとしてカテゴリ注目文脈情報を抽
出するカテゴリ注目文脈情報情報抽出手段と、検索キー
文書から抽出した文書注目文脈情報と、内容別の一纏ま
りの文書群から抽出したカテゴリ注目文脈情報とのカテ
ゴリ類似度を算出するカテゴリ類似度算出手段と、この
カテゴリ類似度算出手段により算出したカテゴリ類似度
に応じて前記文書データベース中から検索キー文書との
間で文書類似度を算出する文書群を選抜する類似度算出
文書選抜手段と、この類似度算出文書選抜手段より選抜
した文書群の文書注目文脈情報と検索キー文書の文書注
目文脈情報とを基に選抜した文書群の各文書類似度を算
出する文書類似度算出手段と、この文書類似度算出手段
により算出した各文書類似度を基に検索した文書の識別
情報を検索結果として出力する出力手段とを有すること
を特徴とするものである。According to the first aspect of the present invention,
In a similar document search apparatus for extracting a similar document from a document database in which a document group is stored for each content by using one document as a search key, an input means for inputting a search key document, each document in the document database and the search Document attention context information extraction means for extracting document attention context information suggesting the content from the key document, and category attention context information extraction means for extracting category attention context information by grouping the content-based documents in the document database Category similarity calculating means for calculating category similarity between document attention context information extracted from a search key document and category attention context information extracted from a group of documents for each content; and category similarity calculation means The document similarity between the document key database and the search key document according to the category similarity calculated by A similarity calculation document selection means for selecting a group of documents to be output, and a document group selected based on the document attention context information of the document group selected by the similarity calculation document selection means and the document attention context information of the search key document. Document similarity calculating means for calculating each document similarity, and output means for outputting, as a search result, identification information of a document searched based on each document similarity calculated by the document similarity calculating means. It is assumed that.

【０００７】請求項２記載の発明は、一文書を検索キー
として、文書群が内容別に格納されている文書データベ
ース中から類似文書を抽出する類似文書検索方法におい
て、検索キー文書を作成し、文書データベース中の各文
書及び前記検索キー文書からその内容を示唆する文書注
目文脈情報を抽出し、文書データベース中の内容別の文
書群を一纏まりとしてカテゴリ注目文脈情報を抽出し、
検索キー文書から抽出した文書注目文脈情報と、内容別
の一纏まりの文書群から抽出したカテゴリ注目文脈情報
とのカテゴリ類似度を算出し、算出したカテゴリ類似度
に応じて前記文書データベース中から検索キー文書との
間で文書類似度を算出する文書群を選抜し、選抜した文
書群の文書注目文脈情報と検索キー文書の文書注目文脈
情報とを基に選抜した文書群の各文書類似度を算出し、
算出した各文書類似度を基に検索した文書の識別情報を
検索結果として出力することを特徴とするものである。According to a second aspect of the present invention, there is provided a similar document search method for extracting a similar document from a document database in which a group of documents is stored for each content by using one document as a search key. Extracting document attention context information that suggests the contents from each document in the database and the search key document, extracting category attention context information as a group of content-specific documents in the document database,
Calculate the category similarity between the document attention context information extracted from the search key document and the category attention context information extracted from the group of documents classified by content, and search the document database according to the calculated category similarity. A document group for calculating the document similarity with the key document is selected, and each document similarity of the selected document group is determined based on the document attention context information of the selected document group and the document attention context information of the search key document. Calculate,
It is characterized in that identification information of a document searched based on each calculated document similarity is output as a search result.

【０００８】請求項１記載の発明に係る類似文書検索装
置の構成を使用した請求項２記載の発明に係る類似文書
検索方法は、入力手段により、検索キー文書を作成し、
文書注目文脈情報抽出手段により文書データベース中の
各文書及び前記検索キー文書からその内容を示唆する文
書注目文脈情報を抽出し、カテゴリ注目文脈情報情報抽
出手段により文書データベース中の内容別の文書群を一
纏まりとしてカテゴリ注目文脈情報を抽出する。According to a second aspect of the present invention, there is provided a similar document searching method using the configuration of the similar document searching apparatus according to the first aspect of the present invention.
Document attention context information extracting means extracts document attention context information suggesting the contents from each document in the document database and the search key document, and a category attention context information information extraction means extracts a document group for each content in the document database. The category attention context information is extracted as a group.

【０００９】さらに、カテゴリ類似度算出手段により、
検索キー文書から抽出した文書注目文脈情報と内容別の
一纏まりの文書群から抽出したカテゴリ注目文脈情報と
のカテゴリ類似度を算出し、類似度算出文書選抜手段に
より算出したカテゴリ類似度に応じて前記文書データベ
ース中から検索キー文書との間で文書類似度を算出する
文書群を選抜し、出力手段により算出した各文書類似度
を基に検索した文書の識別情報を検索結果として出力す
るものである。Further, the category similarity calculating means calculates
Calculate the category similarity between the document attention context information extracted from the search key document and the category attention context information extracted from the group of documents classified by content, and according to the category similarity calculated by the similarity calculation document selection means. A document group for calculating a document similarity with the search key document is selected from the document database, and identification information of the document searched based on each document similarity calculated by the output unit is output as a search result. is there.

【００１０】これにより、文書データベースに格納され
ている文書数の増大とともに、検索処理時間が増加する
という従来の課題を解決し、文書数の多少の如何を問わ
ず、文書類似度の計算量が大幅に減少し、ユーザの類似
文書検索効率の大幅向上を図ることができる。This solves the conventional problem that the search processing time increases as the number of documents stored in the document database increases, and the amount of calculation of the document similarity can be reduced regardless of the number of documents. This greatly reduces the efficiency of similar document search by the user.

【００１１】請求項３記載の発明は、一文書を検索キー
として、文書群が内容別に格納されている文書データベ
ース中から類似文書を抽出する類似文書検索装置におい
て、検索キー文書を入力する入力手段と、検索対象文書
のカテゴリ類似度分布を設定するカテゴリ類似度分布設
定手段と、文書データベース中の各文書及び前記検索キ
ー文書からその内容を示唆する文書注目文脈情報を抽出
する文書注目文脈情報抽出手段と、文書データベース中
の内容別の文書群を一纏まりとしてカテゴリ注目文脈情
報を抽出するカテゴリ注目文脈情報情報抽出手段と、検
索キー文書から抽出した文書注目文脈情報と、内容別の
一纏まりの文書群から抽出したカテゴリ注目文脈情報と
のカテゴリ類似度を算出するカテゴリ類似度算出手段
と、このカテゴリ類似度算出手段により算出したカテゴ
リ類似度とカテゴリ類似度分布設定手段により設定した
カテゴリ類似度分布とを基に、前記文書データベース中
から検索キー文書に近似した文書が含まれるカテゴリに
属する文書類似度算出用の文書群を選抜する類似度算出
文書選抜手段と、この類似度算出文書選抜手段より選抜
した文書群の文書注目文脈情報と検索キー文書の文書注
目文脈情報とを基に選抜した文書群の各文書類似度を算
出する文書類似度算出手段と、この文書類似度算出手段
により算出した各文書類似度に基づき検索した文書の識
別情報を検索結果として出力する出力手段とを有するこ
とを特徴とするものである。According to a third aspect of the present invention, there is provided a similar document search apparatus for extracting a similar document from a document database in which a group of documents is stored by content using one document as a search key. Category similarity distribution setting means for setting a category similarity distribution of a search target document; and document attention context information extraction for extracting document attention context information suggesting the content from each document in the document database and the search key document. Means, a category attention context information information extracting means for extracting a category attention context information as a group of documents according to content in the document database, a document attention context information extracted from the search key document, and a content A category similarity calculating means for calculating a category similarity with category attention context information extracted from a document group; Calculating a document similarity belonging to a category including a document similar to a search key document from the document database based on the category similarity calculated by the degree calculating means and the category similarity distribution set by the category similarity distribution setting means. Similarity calculation document selecting means for selecting a group of documents for use, and a document group selected based on the document attention context information of the document group selected by the similarity calculation document selection means and the document attention context information of the search key document. Document similarity calculating means for calculating each document similarity; and output means for outputting, as a search result, identification information of a document searched based on each document similarity calculated by the document similarity calculating means. Is what you do.

【００１２】請求項４記載の発明は、一文書を検索キー
として、文書群が内容別に格納されている文書データベ
ース中から類似文書を抽出する類似文書検索方法におい
て、検索キー文書を作成し、検索対象文書のカテゴリ類
似度分布を設定し、文書データベース中の各文書及び前
記検索キー文書からその内容を示唆する文書注目文脈情
報を抽出し、文書データベース中の内容別の文書群を一
纏まりとしてカテゴリ注目文脈情報を抽出し、検索キー
文書から抽出した文書注目文脈情報と、内容別の一纏ま
りの文書群から抽出したカテゴリ注目文脈情報とのカテ
ゴリ類似度を算出し、算出したカテゴリ類似度と、設定
したカテゴリ類似度分布とを基に、前記文書データベー
ス中から検索キー文書に近似した文書が含まれるカテゴ
リに属する文書類似度算出用の文書群を選抜し、選抜し
た文書群の文書注目文脈情報と検索キー文書の文書注目
文脈情報とを基に選抜した文書群の各文書類似度を算出
し、算出した各文書類似度に基づき検索した文書の識別
情報を検索結果として出力することを特徴とするもので
ある。According to a fourth aspect of the present invention, there is provided a similar document search method for extracting a similar document from a document database in which a group of documents is stored for each content by using one document as a search key. A category similarity distribution of a target document is set, document attention context information indicating the contents is extracted from each document in the document database and the search key document, and a document group according to the contents in the document database is grouped as a group. Attention context information is extracted, and the category similarity between the document attention context information extracted from the search key document and the category attention context information extracted from the group of documents according to the content is calculated. Based on the set category similarity distribution, documents belonging to a category including a document that is similar to a search key document from the document database The document group for degree calculation is selected, and the document similarity of the selected document group is calculated based on the document attention context information of the selected document group and the document attention context information of the search key document, and the calculated document similarity is calculated. It is characterized in that the identification information of the document searched based on the degree is output as a search result.

【００１３】請求項３記載の発明に係る類似文書検索装
置の構成を使用した請求項４記載の発明に係る類似文書
検索方法は、基本的には請求項１、２記載の発明と同様
な作用を発揮することに加え、予め検索対象文書のカテ
ゴリ類似度分布を設定しておくことにより、検索キー文
書に最も近い文書が含まれているカテゴリを任意選抜し
て検索漏れを防ぎつつユーザの文書検索効率を大幅に向
上させることが可能となる作用を発揮する。A similar document search method according to a fourth aspect of the present invention using the configuration of the similar document search apparatus according to the third aspect of the invention basically has the same operation as the first and second aspects of the invention. In addition to the above, by setting the category similarity distribution of the search target document in advance, it is possible to arbitrarily select the category including the document closest to the search key document and prevent the search document from being omitted. It exerts the effect that the search efficiency can be greatly improved.

【００１４】請求項５記載の発明は、一文書を検索キー
として、文書群が内容別に格納されている文書データベ
ース中から類似文書を抽出する類似文書検索装置におい
て、検索キー文書を入力する入力手段と、検索対象文書
のカテゴリ類似度分布を設定するカテゴリ類似度分布設
定手段と、検索対象文書の文書数分布を設定する文書数
分布設定手段と、文書データベース中の各文書及び前記
検索キー文書からその内容を示唆する文書注目文脈情報
を抽出する文書注目文脈情報抽出手段と、文書データベ
ース中の内容別の文書群を一纏まりとしてカテゴリ注目
文脈情報を抽出するカテゴリ注目文脈情報情報抽出手段
と、検索キー文書から抽出した文書注目文脈情報と、内
容別の一纏まりの文書群から抽出したカテゴリ注目文脈
情報とのカテゴリ類似度を算出するカテゴリ類似度算出
手段と、このカテゴリ類似度算出手段により算出したカ
テゴリ類似度とカテゴリ類似度分布設定手段により設定
したカテゴリ類似度分布とを基に、前記文書データベー
ス中から検索キー文書に近似した文書が含まれるカテゴ
リに属する文書類似度算出用の文書群を選抜するととも
に、前記文書数分布設定手段により設定した検索対象文
書の文書数分布に応じた数の文書群に絞る類似度算出文
書選抜手段と、この類似度算出文書選抜手段より絞った
文書群の文書注目文脈情報と検索キー文書の文書注目文
脈情報とを基に各文書類似度を算出する文書類似度算出
手段と、この文書類似度算出手段により算出した各文書
類似度に基づき検索した文書の識別情報を検索結果とし
て出力する出力手段とを有することを特徴とするもので
ある。According to a fifth aspect of the present invention, there is provided a similar document search apparatus for extracting a similar document from a document database in which a document group is stored for each content by using one document as a search key. A category similarity distribution setting means for setting a category similarity distribution of a search target document; a document number distribution setting means for setting a document number distribution of a search target document; and a document in a document database and the search key document. Document attention context information extraction means for extracting document attention context information suggesting the contents, category attention context information information extraction means for extracting category attention context information by grouping documents according to content in a document database, and retrieval The category of the document attention context information extracted from the key document and the category attention context information extracted from a group of documents classified by content Category similarity calculating means for calculating similarity; and a search key from the document database based on the category similarity calculated by the category similarity calculating means and the category similarity distribution set by the category similarity distribution setting means. A similarity of selecting a document group for calculating a document similarity belonging to a category including a document similar to the document and narrowing down the number of document groups according to the document number distribution of the search target document set by the document number distribution setting means. Document calculation means for calculating each document similarity based on the document attention context information of the document group narrowed down by the similarity calculation document selection means and the document attention context information of the search key document; Output means for outputting, as a search result, identification information of a document searched based on each document similarity calculated by the document similarity calculation means. It is an feature.

【００１５】請求項６記載の発明は、一文書を検索キー
として、文書群が内容別に格納されている文書データベ
ース中から類似文書を抽出する類似文書検索方法におい
て、検索キー文書を作成し、検索対象文書のカテゴリ類
似度分布を設定し、検索対象文書の文書数分布を設定
し、文書データベース中の各文書及び前記検索キー文書
からその内容を示唆する文書注目文脈情報を抽出し、文
書データベース中の内容別の文書群を一纏まりとしてカ
テゴリ注目文脈情報を抽出し、検索キー文書から抽出し
た文書注目文脈情報と、内容別の一纏まりの文書群から
抽出したカテゴリ注目文脈情報とのカテゴリ類似度を算
出し、算出したカテゴリ類似度と、設定したカテゴリ類
似度分布とを基に、前記文書データベース中から検索キ
ー文書に近似した文書が含まれるカテゴリに属する文書
類似度算出用の文書群を選抜するとともに、設定した検
索対象文書の文書数分布に応じた数の文書群に絞り、絞
った文書群の文書注目文脈情報と検索キー文書の文書注
目文脈情報とを基に各文書類似度を算出し、この文書類
似度算出手段により算出した各文書類似度に基づき検索
した文書の識別情報を検索結果として出力することを特
徴とするものである。According to a sixth aspect of the present invention, there is provided a similar document search method for extracting a similar document from a document database in which a group of documents is stored for each content by using one document as a search key. A category similarity distribution of the target document is set, a document number distribution of the search target document is set, and document attention context information indicating the content is extracted from each document in the document database and the search key document. Category similarity context information extracted from the search key document as a group of document groups according to the content of the document, and the category similarity between the category attention context information extracted from the group of documents according to the content and the category similarity context information Based on the calculated category similarity and the set category similarity distribution, a sentence similar to the search key document from the document database. The document group for calculating the document similarity belonging to the category that includes is selected and narrowed down to the number of documents according to the set document number distribution of the search target documents. Document similarity is calculated based on document attention context information of the document, and the identification information of the document searched based on each document similarity calculated by the document similarity calculation means is output as a search result. Things.

【００１６】請求項５記載の発明に係る類似文書検索装
置の構成を使用した請求項６記載の発明に係る類似文書
検索方法は、基本的には請求項３、４記載の発明と同様
な作用を発揮することに加え、文書データベースに格納
されている文書数又は指定文書数によって、検索キーと
各文書との類似度算出を行う最高文書数を定め、この最
高文書数を基準として類似度算出を行う文書に属するカ
テゴリを定めて文書類似度算出を行う文書数を絞り込ん
で検索することができ、ユーザの類似文書検索効率を大
幅に向上させることができる作用を発揮する。The similar document search method according to the sixth aspect of the present invention using the configuration of the similar document search apparatus according to the fifth aspect of the invention basically has the same operation as the third and fourth aspects of the invention. In addition to the above, the maximum number of documents for calculating the similarity between the search key and each document is determined based on the number of documents stored in the document database or the number of designated documents, and the similarity is calculated based on the maximum number of documents. Can be narrowed down and the number of documents for which the document similarity calculation is performed can be narrowed down, and the similar document search efficiency of the user can be greatly improved.

【００１７】[0017]

【発明の実施の形態】以下に、本発明の実施の形態を図
面を参照しながら説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１８】図１は、本実施の形態の類似文書検索装置
の構成を示すブロック図である。本実施の形態の類似文
書検索装置は、ＣＰＵ、メモリーから構成される制御装
置１と、キーボード等の入力装置２と、類似文書の検索
結果等を表示する表示装置３と、検索対象となる文書デ
ータ等を格納する外部記憶装置４とを有している。FIG. 1 is a block diagram showing a configuration of a similar document search apparatus according to the present embodiment. The similar document search device according to the present embodiment includes a control device 1 including a CPU and a memory, an input device 2 such as a keyboard, a display device 3 for displaying search results of similar documents, and a document to be searched. An external storage device 4 for storing data and the like.

【００１９】図２は、制御装置１の詳細構成例及び入力
装置２、表示装置３を示すブロック図である。FIG. 2 is a block diagram showing a detailed configuration example of the control device 1, the input device 2, and the display device 3.

【００２０】制御装置１は制御部とメモリ部から構成さ
れている。The control device 1 comprises a control unit and a memory unit.

【００２１】制御部は各種制御や処理を実行する部分
で、メイン処理部２００、初期化部２０１、入力部２０
２、出力部２０３、カテゴリ別文書数設定部２０４、類
似度分布設定部２０５、文書数分布設定部２０６、文書
データベース注Ｚ目文脈情報抽出部２０７、検索キー文
書注目文脈情報抽出部２０８、カテゴリ注目文脈情報情
報抽出部２０９、カテゴリ類似度算出部２１０、類似度
算出文書選抜部２１１、カテゴリ選抜部２１２、カテゴ
リ文書選抜部２１３、文書類似度算出部２１４、カテゴ
リ文書設定部２１５、検索結果出力部２１６等を具備し
ている。The control unit performs various controls and processes, and includes a main processing unit 200, an initialization unit 201, and an input unit 20.
2. Output unit 203, category-specific document number setting unit 204, similarity distribution setting unit 205, document number distribution setting unit 206, document database Note Z-th context information extraction unit 207, search key document attention context information extraction unit 208, category Attention context information information extraction unit 209, category similarity calculation unit 210, similarity calculation document selection unit 211, category selection unit 212, category document selection unit 213, document similarity calculation unit 214, category document setting unit 215, search result output A part 216 and the like are provided.

【００２２】メモリ部は、カテゴリ別文書数バッファ部
２２０、類似度分布設定バッファ部２２１、文書数分布
設定バッファ部２２２、文書データベース注目文脈情報
バッファ部２２３、検索キー文書注目文脈情報バッファ
部２２４、カテゴリ注目文脈情報バッファ部２２５、カ
テゴリ類似度バッファ部２２６、類似度算出文書バッフ
ァ部２２７、選抜カテゴリバッファ部２２８、選抜文書
バッファ部２２９、文書類似度バッファ部２３０、カテ
ゴリ文書バッファ部２３１、作業バッファ部２４０等を
具備している。The memory section includes a category-specific document number buffer section 220, a similarity distribution setting buffer section 221, a document number distribution setting buffer section 222, a document database focused context information buffer section 223, a search key document focused context information buffer section 224, Category attention context information buffer unit 225, category similarity buffer unit 226, similarity calculation document buffer unit 227, selected category buffer unit 228, selected document buffer unit 229, document similarity buffer unit 230, category document buffer unit 231, work buffer And the like.

【００２３】ここで、前記初期化部２０１は、各バッフ
ァ部の初期化を行う。入力部２０２は、入力装置２から
ユーザによって入力される検索キー文書の情報その他各
種の情報を作業バッファ部２４０へ出力する。Here, the initialization section 201 initializes each buffer section. The input unit 202 outputs information of a search key document and other various information input by the user from the input device 2 to the work buffer unit 240.

【００２４】出力部２０３は、入力部２０２により行っ
た検索キー文書や各種設定の内容を表示装置３に出力し
表示させる。The output unit 203 outputs the contents of the search key document and various settings made by the input unit 202 to the display device 3 for display.

【００２５】カテゴリ別文書数設定部２０４は、文書の
内容に応じてカテゴリ分けされている文書データベース
に対して、その文書データベース中のカテゴリ別の文書
数の設定を行う。ここで設定されたカテゴリ別の文書数
は、カテゴリ別文書数バッファ部２２０に格納される。The number-of-categories document number setting unit 204 sets the number of documents for each category in the document database which is classified into categories according to the contents of the documents. The number of documents for each category set here is stored in the category-specific document number buffer unit 220.

【００２６】類似度分布設定部２０５は、検索キー文書
と文書データベース中の各文書における各カテゴリとの
類似度値から、その各カテゴリに属する文書群と検索キ
ー文書との類似度算出を行うカテゴリを決定するための
設定を行う。A similarity distribution setting unit 205 calculates a similarity between a group of documents belonging to each category and a search key document from a similarity value between the search key document and each category in each document in the document database. Make settings to determine.

【００２７】この場合の設定では、ユーザの指定した類
似度値以上のカテゴリに属する文書群と検索キー文書と
の類似度算出を行うように設定したり、全カテゴリとの
類値度値分布の統計情報（偏差値等）により文書群まで
類似度算出を行うようカテゴリを設定したりする。ここ
で設定された内容は、類似度分布設定バッファ部２２１
に格納される。In this case, the setting is made so as to calculate the similarity between a group of documents belonging to a category higher than the similarity value specified by the user and the search key document, or the similarity value distribution of all the categories. A category is set so that similarity calculation is performed up to a document group based on statistical information (deviation value or the like). The content set here is the similarity distribution setting buffer unit 221.
Is stored in

【００２８】文書数分布設定部２０６は、類似度分布設
定部２０５により、検索キー文書と文書データベース中
の各カテゴリとの類似度値から、その各カテゴリに属す
る文書群と検索キー文書との類似度算出まで行うカテゴ
リを決めるがこれと合わせて、検索キー文書との類似度
算出する文書数の最大値を設定する。The document number distribution setting unit 206 uses the similarity distribution setting unit 205 to determine the similarity between a document group belonging to each category and the search key document based on the similarity value between the search key document and each category in the document database. The category up to the calculation of the degree is determined, and the maximum value of the number of documents for which the degree of similarity to the search key document is calculated is set accordingly.

【００２９】類似度分布設定部２０５により、各カテゴ
リに属する文書群と検索キー文書との類似度算出まで行
う対象カテゴリが設定されても、そのカテゴリに属する
文書数が、文書数分布設定部２０６により設定された値
を超えた場合には、文書数分布設定部２０６により設定
された値以下の文書数となるように、検索キー文書との
類似度が低いカテゴリが、各カテゴリに属する文書群と
検索キー文書との類似度算出まで行う対象カテゴリから
除外する。ここで設定された内容は、文書数分布設定バ
ッファ部２２２に格納される。Even if the similarity distribution setting unit 205 sets a target category for calculating the similarity between a document group belonging to each category and a search key document, the number of documents belonging to that category remains the document number distribution setting unit 206 When the value exceeds the value set by the document number distribution setting unit 206, the category having a low similarity to the search key document is set to a document group belonging to each category so that the number of documents is equal to or less than the value set by the document number distribution setting unit 206. Is excluded from the target categories up to calculation of similarity between the search key document and the search key document. The content set here is stored in the document number distribution setting buffer unit 222.

【００３０】文書データベース注目文脈情報抽出部２０
７は、文書データベース中に格納されている各文書から
その文書の内容を表す上で注目できる文脈情報を抽出す
る。Document Database Attention Context Information Extraction Unit 20
7 extracts, from each document stored in the document database, context information that can be noticed in representing the content of the document.

【００３１】この場合の文脈情報には、単語種、出現回
数、出現位置、共起情報等がある。The context information in this case includes word type, number of appearances, appearance position, co-occurrence information, and the like.

【００３２】文脈情報のデータは、各文書単位に作成さ
れ、文書データベース注目文脈情報バッファ部２２３に
格納される。The context information data is created for each document and stored in the document database attention context information buffer unit 223.

【００３３】検索キー文書注目文脈情報抽出部２０８
は、入力部２０２から入力される検索キー文書に対し
て、その文書の内容を表す上で注目できる文脈情報を抽
出する。Search key document attention context information extraction unit 208
Extracts, from a search key document input from the input unit 202, context information that can be noticed in representing the contents of the document.

【００３４】この場合の文脈情報には、単語種、出現回
数、出現位置、共起情報等がある。The context information in this case includes word type, number of appearances, appearance position, co-occurrence information, and the like.

【００３５】文脈情報のデータは、検索キー文書注目文
脈情報バッファ部２２４に格納される。The context information data is stored in the search key document attention context information buffer unit 224.

【００３６】カテゴリ注目文脈情報情報抽出部２０９
は、文書データベース注目文脈情報抽出部２０７によっ
て抽出された文脈情報を、文書データベース注目文脈情
報バッファ部２２３から呼び出し、また、カテゴリ文書
設定部２１５によって作成された文書ＩＤ−カテゴリ情
報をカテゴリ文書バッファ部２３１から呼び出し、同カ
テゴリの各文書の文脈情報を一纏めにし、カテゴリ注目
文脈情報バッファ部２２５に格納する。Category attention context information information extraction unit 209
Retrieves the context information extracted by the document database focused context information extracting unit 207 from the document database focused context information buffer unit 223, and also converts the document ID-category information created by the category document setting unit 215 into the category document buffer unit. The context information of each document of the same category is collected and stored in the category attention context information buffer unit 225.

【００３７】カテゴリ類似度算出部２１０は、検索キー
文書注目文脈情報抽出部２０８によって作成された検索
キー文書注目文脈情報バッファ部２２４と、カテゴリ注
目文脈情報情報抽出部２０９によって作成された文書デ
ータベース注目文書情報バッファ部２２３から、検索キ
ー文書と各カテゴリとの類似度を算出する。算出したカ
テゴリ類似度の値は、カテゴリ類似度バッファ部２２６
に格納される。The category similarity calculation unit 210 includes a search key document attention context information buffer unit 224 created by the search key document attention context information extraction unit 208 and a document database attention context information created by the category attention context information information extraction unit 209. The similarity between the search key document and each category is calculated from the document information buffer unit 223. The calculated category similarity value is stored in the category similarity buffer unit 226.
Is stored in

【００３８】類似度算出文書選抜部２１１は、選抜カテ
ゴリバッファ部２２８に格納されているカテゴリに属す
る文書ＩＤをカテゴリ文書バッファ部２３１を参照する
ことにより抽出し、類似度算出文書バッファ部２２７か
ら対応する文書ＩＤを選抜して選抜文書バッファ部２２
９に格納する。The similarity calculation document selection unit 211 extracts the document ID belonging to the category stored in the selection category buffer unit 228 by referring to the category document buffer unit 231, and extracts the document ID from the similarity calculation document buffer unit 227. Document ID to be selected and selected document buffer unit 22
9 is stored.

【００３９】カテゴリ選抜部２１２は、カテゴリ類似度
算出部２１０によって算出されたカテゴリ別類似度値を
カテゴリ類似度バッファ部２２６から呼び出し、類似度
分布設定部２０５によって指定した類似度分布設定バッ
ファ部２２１に格納されている条件に合致するカテゴリ
を選抜カテゴリバッファ部２２８に格納する。The category selection unit 212 calls the category-based similarity value calculated by the category similarity calculation unit 210 from the category similarity buffer unit 226, and specifies the similarity distribution setting buffer unit 221 specified by the similarity distribution setting unit 205. Are stored in the selected category buffer unit 228.

【００４０】カテゴリ文書選抜部２１３は、カテゴリ文
書バッファ部２３１に、検索キー文書と文書データベー
ス中に格納されている各文書との類似度算出を行う文書
数が格納されている場合、選抜カテゴリバッファ部２２
８に格納されているカテゴリに属する文書数をカテゴリ
別文書数バッファ部２２０を参照し類似度算出を行う文
書数を算出する。The category document selection unit 213 stores the number of documents for calculating the similarity between the search key document and each document stored in the document database in the category document buffer unit 231. Part 22
The number of documents belonging to the category stored in 8 is referred to the category-specific document number buffer unit 220 to calculate the number of documents for which the similarity is calculated.

【００４１】そして、若しその文書数が、選抜文書数バ
ッファ部２２９に格納されている文書数の条件に合致し
ていない場合は、合致するように選抜カテゴリバッファ
部２２８に格納されている幾つかのカテゴリを削除す
る。削除するカテゴリは、カテゴリ類似度バッファ部２
２６を参照し、類似度が最も低いカテゴリから順次削除
するようにする。If the number of documents does not match the condition of the number of documents stored in the selected document number buffer unit 229, the number of documents stored in the selected category buffer unit 228 is determined to match. Delete the category. The category to be deleted is the category similarity buffer unit 2
26, the categories having the lowest similarity are sequentially deleted.

【００４２】文書類似度算出部２１４は、選抜文書バッ
ファ部２２９に格納されている文書ＩＤに対応する文脈
情報を文書データベース注目文脈情報バッファ部２２３
から抽出し、各文書ＩＤの文脈情報と検索キー文書注目
文脈情報バッファ部２２４に格納されている検索キー文
書の文脈情報とから類似度を求め、検索キー文書と文書
データベースに格納されている各文書との類似度をそれ
ぞれ文書類似度バッファ部２３０に格納する。The document similarity calculation section 214 converts the context information corresponding to the document ID stored in the selected document buffer section 229 into the document database attention context information buffer section 223.
From the context information of each document ID and the context information of the search key document stored in the search key document attention context information buffer unit 224 to obtain a similarity. The similarity with the document is stored in the document similarity buffer unit 230.

【００４３】カテゴリ文書設定部２１５は、文書データ
ベース中の各文書ファイルに対して一意に決まる文書Ｉ
Ｄとカテゴリを設定して、カテゴリ文書バッファ部２３
１に格納する。The category document setting section 215 is a document I that is uniquely determined for each document file in the document database.
D and a category are set, and the category document buffer unit 23 is set.
1 is stored.

【００４４】検索結果出力部２１６は、文書類似度バッ
ファ部２３０に格納されている各文書類似度値を参照
し、最も高い類似度の文書から順に表示装置３に出力す
る。The search result output unit 216 refers to each document similarity value stored in the document similarity buffer unit 230 and outputs the documents with the highest similarity to the display device 3 in order.

【００４５】次に、本実施の形態装置による文書データ
ベース作成手順を図３を参照して、また、類似文書検索
手順を図４を参照して各々説明する。Next, a procedure for creating a document database by the present embodiment will be described with reference to FIG. 3, and a similar document search procedure will be described with reference to FIG.

【００４６】まず、文書データベース作成手順につい
て、図３を参照して説明する。First, the procedure for creating a document database will be described with reference to FIG.

【００４７】まず、初期化部２０１が起動し、メモリ部
のクリア等を行う（ステップＳ１０１）。そして、カテ
ゴリ文書設定部２１５が起動し、文書データベースの登
録文書に対して、文書ＩＤとカテゴリの設定を行う（ス
テップＳ１０２）。文書ＩＤは文書データベース中の文
書を一意に決めるためのもので重複はない。カテゴリは
文書データベース中の各文書を文書の内容ごとに一纏め
にするためのもので、文書データベースの文書内容の分
類数によリカテゴリ数が決まる。また、１文書に１種類
のカテゴリが割り当てられる。文書ＩＤ−カテゴリー文
書名（ファイル名）がリンク付けされて、図１６に示す
ように、カテゴリ文書バッファ部２３１に格納される。First, the initialization section 201 is started up, and the memory section is cleared (step S101). Then, the category document setting unit 215 starts up, and sets a document ID and a category for the registered document in the document database (step S102). The document ID is for uniquely determining a document in the document database, and there is no duplication. The category is for grouping the documents in the document database for each document content, and the number of categories is determined by the number of classifications of the document content in the document database. One category is assigned to one document. The document ID-category document name (file name) is linked and stored in the category document buffer unit 231 as shown in FIG.

【００４８】次に、カテゴリ別文書数設定部２０４が起
動され、文書データベース中のカテゴリ別文書数が設定
され（ステップＳ１０３）、カテゴリ別文書数バッファ
部２２０に格納される。Next, the category-specific document number setting section 204 is activated, the number of categories-specific documents in the document database is set (step S103), and stored in the category-specific document number buffer section 220.

【００４９】そして、文書データベース注目文脈情報抽
出部２０７が起動し、外部記憶装置４に格納されている
文書データベースの各文書からその文書の内容を表す注
目文脈情報を作成し、文書データベース注目文脈情報バ
ッファ部２２３に格納する（ステップＳ１０４）。注目
文脈情報には、共起情報、例えば、「新製品の発表に関
する文書」の場合は、「新製品…発表」が、単語情報と
してその文書の内容を表す上で重要な単語がある。Then, the document database attention context information extraction unit 207 starts up, creates attention context information representing the contents of the document from each document in the document database stored in the external storage device 4, and generates the document database attention context information. The data is stored in the buffer unit 223 (step S104). In the attention context information, in the case of co-occurrence information, for example, in the case of "document relating to the announcement of a new product", "new product ... announcement" has an important word as word information in representing the contents of the document.

【００５０】文書データベースの各文書について行い、
図８に示すように、文書データベース注目文脈情報バッ
ファ部２２３に格納する。This is performed for each document in the document database.
As shown in FIG. 8, it is stored in the document database attention context information buffer unit 223.

【００５１】続いて、カテゴリ注目文脈情報抽出部２０
９が起動し、カテゴリ文書バッファ部２３１と文書デー
タベース注目文脈情報バッファ部２２３を参照し、カテ
ゴリ単位にそのカテゴリに属する文書の注目文脈情報を
纏め、カテゴリ注目文脈情報バッファ部２２５に格納す
る（ステップＳ１０５）。カテゴリ注目文脈情報バッフ
ァ部２２５には、図１０に示すように、カテゴリと注目
文脈情報がリンク付けられ格納される。これで、文書デ
ータベース作成手順をすべて終了する。Subsequently, the category attention context information extraction unit 20
9 starts, refers to the category document buffer unit 231 and the document database attention context information buffer unit 223, collects the attention context information of the documents belonging to the category for each category, and stores it in the category attention context information buffer unit 225 (step). S105). As shown in FIG. 10, the category attention context information is linked and stored in the category attention context information buffer unit 225. This completes the document database creation procedure.

【００５２】次に、類似文書検索手順について、図４を
参照して説明する。Next, a similar document search procedure will be described with reference to FIG.

【００５３】まず、初期化部２０１が起動し、メモリ部
のクリア等を行う（ステップＳ２０１）。そして、検索
キー文書の入力か、検索実行か、検索設定かを選択する
（ステップＳ２０２）。First, the initialization unit 201 is activated to clear the memory unit (step S201). Then, the user selects whether to input a search key document, execute a search, or set a search (step S202).

【００５４】ステップＳ２０２で検索キー文書を選択し
た場合は、入力部２０２が起動し、入力装置２より図１
７に示すような検索キー文書の入力を行い（ステップＳ
２０５）、検索キー文書のデータが作業バッファ部２４
０に格納される。When a search key document is selected in step S202, the input unit 202 is activated, and the input device 2 inputs the search key document shown in FIG.
A search key document as shown in FIG. 7 is input (step S7).
205), the data of the search key document is stored in the work buffer
0 is stored.

【００５５】そして、検索キー文書注目文脈情報抽出部
２０８が起動し、作業バッファ部２４０から検索キー文
書データを呼び出しその文書の内容を表す注目文脈情報
を作成し、図９に示すように、検索キー文書注目文脈情
報バッファ部２２４に格納する（ステップＳ２０６）。
そして、ステップＳ１０２に戻る。Then, the search key document attention context information extraction unit 208 is activated, calls the search key document data from the work buffer unit 240, and creates attention context information representing the contents of the document. As shown in FIG. It is stored in the key document attention context information buffer unit 224 (step S206).
Then, the process returns to step S102.

【００５６】また、ステップＳ２０２で検索設定が選択
された場合は、類似度分布設定部２０５が起動し、検索
キー文書とカテゴリとの類似度の算出結果から、どのカ
テゴリに属する文書に対して検索キー文書とのカテゴリ
類似度を算出するかを設定する。例えば、図６に示すよ
うに検索キー文書と各カテゴリとのカテゴリ類似度があ
る基準値以上のカテゴリのみに絞りたい場合のその基準
値（例えば類似度０．５）を設定する（ステップＳ２０
３）。又は、検索キー文書と各カテゴリとのカテゴリ類
似度の統計をとり、標準偏差等から設定することもでき
る。類似度分布の設定内容は図６に示すように類似度分
布設定バッファ部２２１に格納される。When the search setting is selected in step S202, the similarity distribution setting unit 205 is activated, and based on the calculation result of the similarity between the search key document and the category, a search is performed for a document belonging to any category. Sets whether to calculate the category similarity with the key document. For example, as shown in FIG. 6, a reference value (for example, a similarity of 0.5) is set when the category similarity between the search key document and each category is to be narrowed down to only a category having a certain reference value or more (step S20).
3). Alternatively, statistics of the category similarity between the search key document and each category may be obtained and set based on the standard deviation or the like. The setting content of the similarity distribution is stored in the similarity distribution setting buffer unit 221 as shown in FIG.

【００５７】次に、文書数分布設定部２０６が起動し、
検索キー文書と各カテゴリとのカテゴリ類似度からの絞
り込みでも文書数が期待値以下にすることができなかっ
た場合のために、検索キー文書との文書類似度を算出す
る文書数（例えば４０００以下）を設定する（ステップ
Ｓ２０４）。Next, the document number distribution setting unit 206 is activated,
The number of documents for which the document similarity with the search key document is to be calculated (for example, 4000 or less) in case the number of documents cannot be reduced below the expected value even by narrowing down the category similarity between the search key document and each category. ) Is set (step S204).

【００５８】設定した文書数のデータは図７に示すよう
に文書数分布設定バッファ部２２２に格納される。そし
て、ステップＳ１０２に戻る。Data of the set number of documents is stored in the document number distribution setting buffer section 222 as shown in FIG. Then, the process returns to step S102.

【００５９】また、ステップＳ２０２で検索実行が選択
された場合は、カテゴリ類似度算出部２１０が起動し、
カテゴリ注目文脈情報バッファ部２２５と検索キー文書
注目文脈情報バッファ部２２４を参照し、検索キー文書
と各カテゴリとのカテゴリ類似度を算出し（ステップＳ
２０７）、図１１に示すように、カテゴリ類似度バッフ
ァ部２２６に格納する。If search execution is selected in step S202, the category similarity calculation unit 210 starts up, and
The category similarity between the search key document and each category is calculated with reference to the category attention context information buffer unit 225 and the search key document attention context information buffer unit 224 (step S).
207), and is stored in the category similarity buffer unit 226 as shown in FIG.

【００６０】そして、カテゴリ選抜部２１２が起動し、
類似度分布設定バッファ部２２１に類似度分布が設定さ
れているか否かを判断し（ステップＳ２０８）、もし、
ステップＳ２０８において、類似度分布設定バッファ部
２２１に類似度分布が設定されている場合は、類似度分
布設定バッファ部２２１に格納されている条件に合致し
たカテゴリ類似度のカテゴリ（例えば、２、４等）を、
図１３に示すように、選抜カテゴリバッファ部２２８に
格納する（ステップＳ２０９）。Then, the category selection section 212 is activated,
It is determined whether or not the similarity distribution is set in the similarity distribution setting buffer unit 221 (step S208).
If the similarity distribution is set in the similarity distribution setting buffer unit 221 in step S208, the category of the category similarity that matches the condition stored in the similarity distribution setting buffer unit 221 (for example, 2, 4, or 4). Etc.),
As shown in FIG. 13, the data is stored in the selection category buffer unit 228 (step S209).

【００６１】つぎに、カテゴリ文書選抜部２１３が起動
し、文書数分布設定バッファ部２２２に文書数分布のデ
ータが設定されている場合は（ステップＳ２１０）、選
抜カテゴリバッファ部２２８とカテゴリ別文書数バッフ
ァ部２２０を呼び出し、選抜カテゴリバッファ部２２８
に格納されてる各カテゴリに属する文書数の合計が、文
書数分布設定バッファ部２２２に格納される条件に合致
していない場合は、カテゴリ類似度バッファ部２２６を
呼び出し、選抜カテゴリバッファ部２２８に格納されて
いる最も低い類似度のカテゴリを選抜カテゴリバッファ
部２２８から順次削除し、文書数分布設定バッファ部２
２２に格納される条件に合致する文書数となるようにす
る（ステップＳ２１１）。Next, when the category document selection section 213 is activated and the data of the document number distribution is set in the document number distribution setting buffer section 222 (step S210), the selection category buffer section 228 and the document number by category are set. The buffer unit 220 is called, and the selection category buffer unit 228 is called.
If the total number of documents belonging to each category stored in the category does not match the condition stored in the document number distribution setting buffer 222, the category similarity buffer 226 is called and stored in the selected category buffer 228. The categories having the lowest similarity are sequentially deleted from the selection category buffer unit 228, and the document number distribution setting buffer unit 2 is deleted.
The number of documents meeting the condition stored in the storage 22 is set (step S211).

【００６２】ステップＳ２０８において、類似度分布設
定バッファ部２２１に類似度分布が設定されていない場
合は、カテゴリ選抜部２１２が起動し、カテゴリ類似度
バッファ部２２６において最も高いカテゴリ類似度のカ
テゴリを選抜して（ステップＳ２１２）、選抜カテゴリ
バッファ部２２８に格納する。If no similarity distribution is set in the similarity distribution setting buffer unit 221 in step S208, the category selecting unit 212 is activated, and the category with the highest category similarity is selected in the category similarity buffer unit 226. Then (step S212), it is stored in the selection category buffer unit 228.

【００６３】次に、類似度算出文書選抜部２１１が起動
し、選抜カテゴリバッファ部２２８に格納されているカ
テゴリに対応する文書ＩＤをカテゴリ文書バッファ部２
３１を参照して図１３に示す選抜文書バッファ部２２９
に格納する（ステップＳ２１３）。Next, the similarity calculation document selection unit 211 is activated, and the document ID corresponding to the category stored in the selection category buffer unit 228 is stored in the category document buffer unit 2.
The selected document buffer unit 229 shown in FIG.
(Step S213).

【００６４】次に、文書類似度算出部２１４が起動し、
選抜文書バッファ部２２９に格納されている各文書ＩＤ
についてその文書に対応する注目文脈情報を文書データ
ベース注目文脈情報バッファ部２２３から呼び出し、検
索キー文書注目文脈情報バッファ部２２４との文書類似
度を算出し（ステップＳ２１４）、その結果を文書類似
度バッファ部２３０に図１５に示すように「文書ＩＤ−
類似度」として対応付けて格納する。Next, the document similarity calculating section 214 is activated,
Each document ID stored in the selected document buffer unit 229
Is retrieved from the document database focused context information buffer unit 223, and the document similarity with the search key document focused context information buffer unit 224 is calculated (step S214), and the result is stored in the document similarity buffer. As shown in FIG. 15, “document ID-
And stored as “similarity”.

【００６５】次に、選択文書バッファ部２２９に格納さ
れている全ての文書ＩＤの文書と検索キー文書との文書
類似度を算出し、各文書ＩＤごとにその文書類似度を文
書類似度バッファ部２３０に格納する（ステップＳ２１
４）。Next, the document similarity between the documents of all the document IDs stored in the selected document buffer unit 229 and the retrieval key document is calculated, and the document similarity for each document ID is stored in the document similarity buffer unit. 230 (step S21).
4).

【００６６】そして、検索結果出力部２１６が起動し、
文書類似度バッファ部２３０に格納されている高類似度
の文書を結検索処理結果として、図１８に示すように、
出力部２０３から出力装置３に出力する（ステップＳ２
１５）。Then, the search result output unit 216 starts up,
As shown in FIG. 18, a document having a high similarity stored in the document similarity buffer 230 is set as a result of the search processing, as shown in FIG.
Output from the output unit 203 to the output device 3 (step S2
15).

【００６７】最後に、類似検索の処理を続けるか否か判
断し（ステップＳ２１６）、処理を継続する場合はステ
ップＳ２０２に移行し、処理を終了する場合は全ての検
索処理の終了となる。Finally, it is determined whether or not to continue the similarity search process (step S216). If the process is to be continued, the process proceeds to step S202, and if the process is to be ended, all search processes are ended.

【００６８】なお、本発明は上記の実施の形態に限定さ
れるものではない。The present invention is not limited to the above embodiment.

【００６９】[0069]

【発明の効果】請求項１及び２記載の発明によれば、文
書数の多少の如何を問わず、文書類似度の計算量が大幅
に減少し、ユーザの類似文書検索効率の大幅向上を図る
ことができる類似文書検索装置及びこの類似文書検索装
置を用いた類似文書検索方法を提供できる。According to the first and second aspects of the present invention, the amount of calculation of the document similarity is greatly reduced regardless of the number of documents, and the user's similar document search efficiency is greatly improved. And a similar document search method using the similar document search device.

【００７０】請求項３及び４記載の発明によれば、請求
項１、２記載の発明と同様な効果を奏することに加え、
検索キー文書に最も近い文書が含まれているカテゴリを
任意選抜して検索漏れを防ぎつつ類似文書検索効率の大
幅向上を図ることができる類似文書検索装置及びこの類
似文書検索装置を用いた類似文書検索方法を提供でき
る。According to the third and fourth aspects of the present invention, the same effects as those of the first and second aspects of the invention can be obtained.
A similar document search apparatus capable of arbitrarily selecting a category including a document closest to a search key document to prevent omission of search and greatly improving similar document search efficiency, and a similar document using the similar document search apparatus Can provide a search method.

【００７１】請求項５及び６記載の発明によれば、請求
項３、４記載の発明と同様な効果を奏することに加え、
文書類似度算出を行う文書数を絞り込んで検索すること
で類似文書検索効率の大幅向上を図ることができる類似
文書検索装置及びこの類似文書検索装置を用いた類似文
書検索方法を提供できる。According to the fifth and sixth aspects of the invention, the same effects as those of the third and fourth aspects of the invention can be obtained.
It is possible to provide a similar document search device and a similar document search method using the similar document search device, which can greatly improve similar document search efficiency by narrowing down and searching the number of documents for which the document similarity calculation is performed.

[Brief description of the drawings]

【図１】本発明の実施の形態装置の概略構成を示すブロ
ック図である。FIG. 1 is a block diagram showing a schematic configuration of an apparatus according to an embodiment of the present invention.

【図２】本発明の実施の形態装置の制御部のブロック図
である。FIG. 2 is a block diagram of a control unit of the apparatus according to the embodiment of the present invention.

【図３】本実施の形態の文書データベース作成の手順を
示すフロチャートである。FIG. 3 is a flowchart showing a procedure for creating a document database according to the embodiment.

【図４】本実施の形態の類似文書検索の手順を示すフロ
チャートである。FIG. 4 is a flowchart showing a similar document search procedure according to the embodiment.

【図５】本実施の形態のカテゴリ別文書数バッファ部格
納例を示す図である。FIG. 5 is a diagram illustrating an example of storing a document count buffer unit for each category according to the present embodiment;

【図６】本実施の形態の類似度分布設定バッファ部格納
例を示す図である。FIG. 6 is a diagram illustrating a storage example of a similarity distribution setting buffer unit according to the present embodiment;

【図７】本実施の形態の文書数分布設定バッファ部格納
例を示す図である。FIG. 7 is a diagram illustrating an example of storage of a document number distribution setting buffer unit according to the present embodiment;

【図８】本実施の形態の文書データベース注目文脈情報
バッファ部格納例を示す図である。FIG. 8 is a diagram illustrating an example of storage of a document database attention context information buffer unit according to the present embodiment.

【図９】本実施の形態の検索キー文書注目文脈情報バッ
ファ部格納例を示す図である。FIG. 9 is a diagram illustrating a storage example of a search key document attention context information buffer unit according to the present embodiment;

【図１０】本実施の形態のカテゴリ注目文脈情報バッフ
ァ部格納例を示す図である。FIG. 10 is a diagram illustrating an example of storage of a category attention context information buffer unit according to the present embodiment.

【図１１】本実施の形態のカテゴリ類似度バッファ部格
納例を示す図である。FIG. 11 is a diagram illustrating an example of storage of a category similarity buffer unit according to the present embodiment.

【図１２】本実施の形態の類似度算出文書バッファ部格
納例を示す図である。FIG. 12 is a diagram illustrating an example of storing a similarity calculation document buffer unit according to the present embodiment;

【図１３】本実施の形態の選抜カテゴリバッファ部格納
例を示す図である。FIG. 13 is a diagram illustrating an example of storage of a selected category buffer unit according to the present embodiment.

【図１４】本実施の形態の選抜文書バッファ部格納例を
示す図である。FIG. 14 is a diagram illustrating an example of storage of a selected document buffer unit according to the present embodiment.

【図１５】本実施の形態の文書類似度バッファ部格納例
を示す図である。FIG. 15 is a diagram illustrating a storage example of a document similarity buffer unit according to the present embodiment.

【図１６】本実施の形態のカテゴリ文書バッファ部格納
例を示す図である。FIG. 16 is a diagram illustrating an example of storage of a category document buffer unit according to the present embodiment.

【図１７】本実施の形態の検索キー文書の入力例を示す
図である。FIG. 17 is a diagram illustrating an input example of a search key document according to the present embodiment.

【図１８】本実施の形態の検索結果の出力例を示す図で
ある。FIG. 18 is a diagram illustrating an output example of a search result according to the present embodiment.

[Explanation of symbols]

１制御装置２入力装置３表示装置４外部記憶装置２００メイン処理部２０１初期化部２０２入力部２０３出力部２０４カテゴリ別文書数設定部２０５類似度分布設定部２０６文書数分布設定部２０７文書データベース注目文脈情報抽出部２０８検索キー文書注目文脈情報抽出部２０９カテゴリ注目文脈情報情報抽出部２１０カテゴリ類似度算出部２１１類似度算出文書選抜部２１２カテゴリ選抜部２１３カテゴリ文書選抜部２１４文書類似度算出部２１５カテゴリ文書設定部２１６検索結果出力部２２０カテゴリ別文書数バッファ部２２１類似度分布設定バッファ部２２２文書数分布設定バッファ部２２３文書データベース注目文脈情報バッファ部２２４検索キー文書注目文脈情報バッファ部２２５カテゴリ注目文脈情報バッファ部２２６カテゴリ類似度バッファ部２２７類似度算出文書バッファ部２２８選抜カテゴリバッファ部２２９選抜文書バッファ部２３０文書類似度バッファ部２３１カテゴリ文書バッファ部２４０作業バッファ部 REFERENCE SIGNS LIST 1 control device 2 input device 3 display device 4 external storage device 200 main processing unit 201 initialization unit 202 input unit 203 output unit 204 category-specific document number setting unit 205 similarity distribution setting unit 206 document number distribution setting unit 207 document database attention Context information extraction unit 208 Search key document attention context information extraction unit 209 Category attention context information information extraction unit 210 Category similarity calculation unit 211 Similarity calculation document selection unit 212 Category selection unit 213 Category document selection unit 214 Document similarity calculation unit 215 Category document setting section 216 Search result output section 220 Document number buffer section by category 221 Similarity distribution setting buffer section 222 Document number distribution setting buffer section 223 Document database attention context information buffer section 224 Search key document attention context information buffer section 225 Gori attention contextual information buffer unit 226 category similarity buffer unit 227 similarity calculation document buffer unit 228 selected category buffer unit 229 selected document buffer unit 230 document similarity buffer unit 231 category document buffer 240 work buffer unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者中本幸夫東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者仁科卓哉東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 (72)発明者久保田直秀東京都青梅市新町1381番地１東芝コンピュータエンジニアリング株式会社内 ──────────────────────────────────────────────────続き Continuing from the front page (72) Inventor Yukio Nakamoto 1381-1, Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd. (72) Takuya Nishina 1381-1, Shinmachi, Ome-shi, Tokyo Toshiba Computer (72) Inventor Naohide Kubota 1381 Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd.

Claims

[Claims]

1. A similar document search apparatus for extracting a similar document from a document database in which a document group is stored for each content by using one document as a search key, an input means for inputting a search key document, Document attention context information extraction means for extracting document attention context information suggesting the contents from each document and the search key document, and category attention for extracting category attention context information as a group of content-specific documents in a document database Context information information extracting means; category similarity calculating means for calculating category similarity between document attention context information extracted from a search key document and category attention context information extracted from a group of documents classified by content; According to the category similarity calculated by the category similarity calculating means, a search key document and A similarity calculation document selection means for selecting a group of documents for which the document similarity is to be calculated, and document attention context information of a document group selected by the similarity calculation document selection means and document attention context information of a search key document. Document similarity calculating means for calculating each document similarity of the selected document group; output means for outputting, as a search result, identification information of a document searched based on each document similarity calculated by the document similarity calculating means; A similar document search device comprising:

2. A similar document search method for extracting a similar document from a document database in which a document group is stored for each content by using one document as a search key, wherein a search key document is created, and each document in the document database is created. Document attention context information that suggests the contents is extracted from the search key document, category attention context information is extracted as a group of content-specific documents in the document database, and the document attention context information extracted from the search key document and Calculates the category similarity with the category attention context information extracted from the group of documents according to the content, and calculates the document similarity between the document database and the search key document according to the calculated category similarity Each document of the document group selected based on the document attention context information of the selected document group and the document attention context information of the search key document Calculating a similarity score, outputting each document similarity calculated as a result searches the identification information of documents retrieved based, similar document search method according to claim.

3. A similar document search apparatus for extracting a similar document from a document database in which a group of documents is stored for each content by using one document as a search key, wherein: an input means for inputting a search key document; Category similarity distribution setting means for setting a category similarity distribution; document attention context information extracting means for extracting document attention context information suggesting the contents from each document in the document database and the search key document; Category attention context information information extraction means for extracting category attention context information as a group of documents according to content, document attention context information extracted from a search key document, and a category extracted from a group of content-based documents A category similarity calculating means for calculating a category similarity with the attention context information, and the category similarity calculating means Based on the calculated category similarity and the category similarity distribution set by the category similarity distribution setting means, a document group for calculating a document similarity belonging to a category including a document similar to a search key document from the document database. A similarity calculation document selecting means for selecting a document, and each document similarity of a document group selected based on the document attention context information of the document group selected by the similarity calculation document selection means and the document attention context information of the search key document A similar document characterized by comprising: document similarity calculating means for calculating the document similarity; and output means for outputting, as a search result, identification information of a document searched based on each document similarity calculated by the document similarity calculating means. Search device.

4. A similar document search method for extracting a similar document from a document database in which a document group is stored for each content by using one document as a search key, wherein a search key document is created, and a category similarity of the search target document is created. A distribution is set, and document attention context information suggesting the contents is extracted from each document in the document database and the search key document, and a category attention context information is extracted as a group of documents according to the contents in the document database. Calculates the category similarity between the document attention context information extracted from the search key document and the category attention context information extracted from a group of documents classified by content, and calculates the calculated category similarity and the set category similarity distribution A document group for calculating a document similarity belonging to a category including a document that is similar to a search key document from the document database based on The document similarity of the selected document group is calculated based on the document attention context information of the selected document group and the document attention context information of the search key document, and the document retrieved based on the calculated document similarity Outputting a search result of the identification information of the similar document.

5. A similar document search apparatus for extracting a similar document from a document database in which a document group is stored for each content by using one document as a search key, wherein: an input means for inputting a search key document; Category similarity distribution setting means for setting a category similarity distribution; document number distribution setting means for setting a document number distribution of documents to be searched; documents suggesting the contents from each document in the document database and the search key document Document attention context information extraction means for extracting attention context information; category attention context information extraction means for extracting category attention context information by grouping documents according to content in a document database; and documents extracted from search key documents A method for calculating the category similarity between the attention context information and the category attention context information extracted from a group of documents for each content. A document similar to the retrieval key document from the document database based on the category similarity calculating means and the category similarity calculated by the category similarity calculating means and the category similarity distribution set by the category similarity distribution setting means; Means for selecting a document group for calculating the document similarity belonging to the category including the number of documents, and narrowing down the number of documents according to the document number distribution of the search target document set by the document number distribution setting means. Document similarity calculating means for calculating each document similarity based on the document attention context information of the document group narrowed down by the similarity calculation document selection means and the document attention context information of the search key document; Output means for outputting, as a search result, identification information of a document searched based on each document similarity calculated by the calculation means. Similar document search device.

6. A similar document search method for extracting a similar document from a document database in which a group of documents is stored by content using one document as a search key, wherein a search key document is created, and a category similarity of the search target document is created. Setting a distribution, setting a document number distribution of a search target document, extracting document attention context information suggesting the content from each document in the document database and the search key document, and a document group for each content in the document database And collects the category attention context information, and calculates the category similarity between the document attention context information extracted from the search key document and the category attention context information extracted from the group of documents classified by content. Based on the category similarity and the set category similarity distribution, a category including a document that is similar to the retrieval key document from the document database is included. The document group for calculating the document similarity belonging to the selected document group is selected, and the number of documents is reduced to the number of documents according to the set document number distribution of the search target document. Calculating each document similarity based on the attention context information, and outputting, as a search result, identification information of a document searched based on each document similarity calculated by the document similarity calculation means; retrieval method.