JPH10269235A

JPH10269235A - Device and method for similar document retrieval

Info

Publication number: JPH10269235A
Application number: JP9071930A
Authority: JP
Inventors: Yasuo Tanosaki; 康雄田野崎; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科; Naohide Kubota; 直秀久保田
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1997-03-25
Filing date: 1997-03-25
Publication date: 1998-10-09

Abstract

PROBLEM TO BE SOLVED: To retrieve a document by precise similarity calculation by finding the similarity according to the distribution state of other words, which are present nearby a word present in both a comparison source document and a comparison destination document in common. SOLUTION: A similarity calculation part 20b calculates the similarity between the comparison source document and comparison destination document by using a weight coefficient calculated by a weight coefficient determination part 20C. The weight coefficient determination part 20C calculates a weight coefficient large as to a common word in both the comparison source document and comparison destination document according to the total number of words, which are present within a certain distance from the word common to both the comparison source document and comparison destination document, which is decided by a word deciding function and present in both the comparison source document and comparison destination document in common. Then, a document list display part 20f displays a list of document titles on the screen of a display device in order from a document having large similarity together with corresponding similarity information.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、比較元文書と比較
先文書との類似度をもとに文書を検索する類似文書検索
装置及び類似文書検索方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similar document search apparatus and a similar document search method for searching for a document based on the similarity between a comparison source document and a comparison destination document.

【０００２】[0002]

【従来の技術】近年、大量の電子化された文書データが
流通するようになり、自動分類等を行なう目的で、文書
データベース中から指定された文書に類似するものを検
索するシステムが実用化されてきている。2. Description of the Related Art In recent years, a large amount of electronic document data has been distributed, and a system for retrieving a document similar to a designated document from a document database for the purpose of automatic classification and the like has been put to practical use. Is coming.

【０００３】これらのシステムでは指定された文書（こ
れを比較元文書と呼ぶ）に類似する類似文書を検索する
にあたって、比較元文書と、データベース中の各文書
（これらを比較先文書と呼ぶ）との間で類似度を計算
し、この類似度の値の大きな比較先文書を検索結果とし
て出力する。In these systems, when searching for a similar document similar to a designated document (which is called a comparison source document), a comparison source document and each document in a database (which is called a comparison destination document) are searched. Is calculated, and a comparison destination document having a large similarity value is output as a search result.

【０００４】類似度の計算方式としては、２つの文書が
共通した単語を多く含むほど類似度を大きくするベクト
ル空間モデルによる方法が一般的に用いられている。こ
の方式では、まず比較元文書と比較先文書の双方から単
語を抽出し、それぞれから単語の重みを要素としたベク
トルデータを作成する。As a method of calculating the degree of similarity, a method based on a vector space model that increases the degree of similarity as two documents include more common words is generally used. In this method, first, words are extracted from both a comparison source document and a comparison destination document, and vector data is created from each of them by using the word weight as an element.

【０００５】単語に対する重みとしては、その単語が存
在する場合には正の値が用いられ、その単語が存在しな
い場合には０が設定される。重みの値としては、各単語
の文書内での出現頻度が用いられている。比較元文書と
比較先文書の双方から抽出した単語についてベクトルデ
ータを計算した後、これらの内積を計算し、これを類似
度としている。As a weight for a word, a positive value is used when the word exists, and 0 is set when the word does not exist. As the value of the weight, the frequency of occurrence of each word in the document is used. After calculating vector data for words extracted from both the comparison source document and the comparison destination document, the inner product of them is calculated, and this is used as the similarity.

【０００６】図１３には、従来の類似度計算方式の概念
を示している。図１３に示すように、比較元文書には、
単語Ａ，Ｂ，Ｃ，Ｄ，Ｅ，Ｆが含まれている。これに対
して、比較先文書には、単語Ａ，Ｃ，Ｅ，Ｇ，Ｈ，Ｉが
含まれている。この場合、単語Ａ、単語Ｃ、及び単語Ｅ
が、比較元文書と比較先文書のそれぞれに１つずつ存在
するために、各単語の組み合わせに対する類似度が１で
あり、比較元文書と比較先文書との類似度が３となる。FIG. 13 shows the concept of a conventional similarity calculation method. As shown in FIG. 13, the comparison source document includes
Words A, B, C, D, E, and F are included. On the other hand, the comparison target document includes the words A, C, E, G, H, and I. In this case, word A, word C, and word E
However, since one exists for each of the comparison source document and the comparison destination document, the similarity for each combination of words is 1, and the similarity between the comparison source document and the comparison destination document is 3.

【０００７】また、重みの値としては、単語間の共起関
係に応じて重みを与えることができる。この場合、単語
間の共起関係を予め辞書化しておき、この辞書に登録さ
れた共起関係のある単語については大きな重みを与える
ことで、関連のある単語の有無を反映した類似度を計算
することができる。[0007] In addition, weights can be given according to co-occurrence relationships between words. In this case, a co-occurrence relationship between words is pre-dictionaryed, and a word having a co-occurrence relationship registered in this dictionary is given a large weight to calculate a similarity reflecting presence or absence of a related word. can do.

【０００８】[0008]

【発明が解決しようとする課題】このように従来の類似
度の計算方式によると、文書の類似度を計算する際に各
単語について重み付けを行なう必要があるが、例えば、
単語Ａが比較元文書と比較先文書の双方に存在している
場合に、その重みとしては、単語Ａの比較元文書及び比
較先文書中での出現頻度の積の値が専ら採用されてお
り、実際に文書の内容を特徴づけるような単語Ａの近傍
での他の単語の出現状況は考慮されていない。As described above, according to the conventional similarity calculation method, it is necessary to weight each word when calculating the similarity of a document.
When the word A exists in both the comparison source document and the comparison destination document, the value of the product of the appearance frequency of the word A in the comparison source document and the comparison destination document is exclusively used as the weight. However, the appearance of other words near word A that actually characterizes the contents of the document is not taken into account.

【０００９】共起関係に応じて重みを与えることもでき
るが、単語間の共起関係を予め辞書化しておかなければ
ならず、また辞書に登録されていなければ他の単語の出
現状況は考慮されなくなってしまう。Although weights can be given according to co-occurrence relations, co-occurrence relations between words must be made in a dictionary in advance, and if not registered in the dictionary, the appearance of other words should be considered. Will not be done.

【００１０】そのため、従来の類似度計算を用いた類似
文書検索装置では、十分な精度による文書の検索結果を
得ることが困難であった。本発明は上記の事情を考慮し
てなされたもので、比較元文書と比較先文書の双方に存
在する単語の近傍に存在する他の単語の分布状況を考慮
して、精度の良い類似度算出により文書を検索すること
が可能な類似文書検索装置及び類似文書検索方法を提供
することを目的とする。[0010] For this reason, it is difficult for the similar document search apparatus using the similarity calculation in the related art to obtain a document search result with sufficient accuracy. The present invention has been made in consideration of the above circumstances, and has high accuracy similarity calculation in consideration of the distribution of other words near the words existing in both the source document and the destination document. It is an object of the present invention to provide a similar document search device and a similar document search method capable of searching for a document by using the same.

【００１１】[0011]

【課題を解決するための手段】本発明は、比較元文書を
もとにして、前記比較元文書と類似した比較先文書を検
索する類似文書検索装置において、比較元文書と比較先
文書の双方に共通して存在する単語の近傍に存在する他
の単語の分布状況をもとにして、前記比較元文書と前記
比較先文書との類似度を求め文書を検索することを特徴
とする。SUMMARY OF THE INVENTION According to the present invention, there is provided a similar document retrieval apparatus for retrieving a comparison destination document similar to the comparison source document based on the comparison source document. The similarity between the comparison source document and the comparison destination document is obtained based on the distribution state of other words existing in the vicinity of the word that is commonly present, and the document is searched.

【００１２】また本発明は、比較元文書をもとにして、
前記比較元文書と類似した比較先文書を検索する類似文
書検索装置において、比較元文書及び比較先文書の各文
書中から単語を抽出する単語抽出手段と、前記単語抽出
手段によって抽出された単語を、それぞれの文書中での
出現位置情報と共に単語情報として格納する単語情報格
納手段と、比較元文書と比較先文書に共通して存在する
共通単語を抽出する単語判定手段と、前記単語情報格納
手段によって格納された単語情報をもとに、比較元文書
と比較先文書の双方に対して、前記単語判定手段によっ
て抽出された共通単語から一定の距離内に共通して存在
する単語を検索する単語検索手段と、前記単語検索手段
によって検索された前記共通単語から一定の距離内に存
在する単語の総数に応じて、比較元文書と比較先文書の
双方における共通単語に対する重みを算出する重み算出
手段とを具備したことを特徴とする。Further, according to the present invention, based on a comparison source document,
In a similar document search device that searches for a comparison destination document similar to the comparison source document, a word extraction unit that extracts a word from each of the comparison source document and the comparison destination document, and a word extracted by the word extraction unit Word information storage means for storing word information together with appearance position information in each document; word determination means for extracting common words which are present in common in a comparison source document and a comparison destination document; and the word information storage means For searching for a word that is present within a certain distance from the common word extracted by the word determination means for both the comparison source document and the comparison destination document based on the word information stored by A search unit and a common word in both the comparison source document and the comparison target document according to the total number of words existing within a certain distance from the common word searched by the word search unit. Characterized by comprising a weight calculation unit for calculating the weights for the word.

【００１３】また、前記比較元文書と類似した比較先文
書を、ベクトル空間モデルによる方法を用いて検索する
ものであって、前記重み算出手段によって算出された重
みを重み係数とし、前記重み算出手段によって得られた
重み係数を用いて、前記単語抽出手段によって抽出され
た単語に対する単語ベクトル間の内積のひとつの項デー
タを計算する項データ算出手段と、前記重み係数を用い
て内積データの正規化を行なう内積データ正規化手段と
を具備したことを特徴とする。[0013] Further, a comparison target document similar to the comparison source document is searched by using a method based on a vector space model, wherein the weight calculated by the weight calculation means is used as a weight coefficient, and Term data calculating means for calculating one term data of an inner product between word vectors for the word extracted by the word extracting means, using the weighting factor obtained by the word extracting means, and normalizing the inner product data by using the weighting factor. And an inner product data normalizing means for performing the following.

【００１４】このような構成により、類似度を計算する
過程で、共通単語から一定の距離内に共通に存在する単
語が多いほど共通単語の重みを大きく設定できるように
なり、個々の単語の出現数情報のみでなく、単語の周囲
の他の単語の出現状況まで考慮した精度の高い類似度計
算が可能になる。さらに従来の技術では共起関係にある
単語間について大きな重みを与えるために、共起関係を
あらかじめ辞書化しておく必要があったが本発明ではそ
の必要がなくなる。With such a configuration, in the process of calculating the similarity, the weight of the common word can be set larger as the number of words that are commonly present within a certain distance from the common word is increased. It is possible to calculate the similarity with high accuracy in consideration of not only the numerical information but also the appearance of other words around the word. Furthermore, in the prior art, in order to give a large weight between words having a co-occurrence relation, it was necessary to previously compile the co-occurrence relation into a dictionary.

【００１５】また、前記重み係数は、１以上の実数とな
るようにし、比較元文書と比較先文書の双方に対して前
記共通単語から一定の距離内に共通して存在する各単語
の、比較元文書及び比較先文書における前記共通単語と
の距離が小さいほど大きな値となるようにしたことを特
徴とする。Further, the weighting factor is a real number of 1 or more, and a comparison is made between each word existing within a certain distance from the common word with respect to both the comparison source document and the comparison destination document. It is characterized in that the smaller the distance from the common word in the original document and the compared document, the larger the value.

【００１６】これにより、内積計算時に、共通単語の周
囲に、比較元文書と比較先文書の双方に共通して出現す
る単語がより集中して存在しているほど、共通単語に対
する重みを大きく設定することが可能になるため正確な
重みづけが可能になる。In this way, the weight of the common word is set to be larger as the words that appear in both the comparison source document and the comparison target document are more concentrated around the common word in calculating the inner product. , And accurate weighting becomes possible.

【００１７】[0017]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態について説明する。図１は本実施形態に係わる
類似文書検索装置の構成を示すブロック図である。類似
文書検索装置は、例えば磁気ディスク等の記録媒体に記
録されたプログラムを読み込み、このプログラムによっ
て動作が制御されるコンピュータによって実現される。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration of a similar document search device according to the present embodiment. The similar document retrieval apparatus reads a program recorded on a recording medium such as a magnetic disk, and is realized by a computer whose operation is controlled by the program.

【００１８】図１に示すように、本実施形態における類
似文書検索装置は、制御装置１０、入力装置１１、表示
装置１２、外部記憶装置１３、メモリ１４、及び通信装
置１５によって構成され、バスを介して相互に結合され
ている。As shown in FIG. 1, the similar document search apparatus according to the present embodiment includes a control device 10, an input device 11, a display device 12, an external storage device 13, a memory 14, and a communication device 15. Are interconnected via

【００１９】制御装置１０は、ＣＰＵから構成されるも
ので、各種のハードウェア装置とバスを介して接続され
ており、各装置の制御、装置間のデータの転送などの処
理を行なうものである。The control device 10 is composed of a CPU, is connected to various hardware devices via a bus, and performs processes such as control of each device and transfer of data between the devices. .

【００２０】入力装置１１は、キーボード及びマウス等
からなり、本装置に対する各種のデータ及び命令を入力
して制御装置１０に出力する。表示装置１２は、カラー
液晶ディスプレイ及びそのコントローラから構成されて
おり、制御装置１０の制御のもとで文書や検索結果（タ
イトル一覧）の表示等を行なう。The input device 11 is composed of a keyboard, a mouse, and the like. The input device 11 inputs various data and commands to the device and outputs the data and commands to the control device 10. The display device 12 is composed of a color liquid crystal display and its controller, and displays a document and a search result (title list) under the control of the control device 10.

【００２１】外部記憶装置１３は、ハードディスク等の
記録媒体及びコントローラからなり、検索対象となる文
書データや各文書に対応する単語情報データ等が格納さ
れる。外部記憶装置１３に格納されている文書データ及
び単語情報データの格納形式については後述する（図
３）。The external storage device 13 comprises a recording medium such as a hard disk and a controller, and stores document data to be searched and word information data corresponding to each document. The storage format of the document data and word information data stored in the external storage device 13 will be described later (FIG. 3).

【００２２】メモリ１４は、ダイナミックＲＡＭからな
り、制御装置１０が各種制御や処理を実行するためのプ
ログラムを格納するプログラム部と、処理の際に必要な
データを格納するためのバッファ部が設けられている。
プログラム部とバッファ部の詳細な構成については後述
する（図２）。The memory 14 is composed of a dynamic RAM, and is provided with a program section for storing programs for the control device 10 to execute various controls and processes, and a buffer section for storing data necessary for the processes. ing.
Detailed configurations of the program section and the buffer section will be described later (FIG. 2).

【００２３】通信装置１５は、制御装置１０の制御のも
とで、通信回線を介して外部とデータのやり取りを行な
う装置であり、例えばＬＡＮ回線とＬＡＮコントローラ
等から構成される。The communication device 15 is a device for exchanging data with the outside through a communication line under the control of the control device 10, and includes, for example, a LAN line and a LAN controller.

【００２４】次に、図２を参照しながらメモリ１４に設
けられたプログラム部２０とバッファ部２２の詳細につ
いて説明する。図２に示すように、プログラム部２０
は、メイン処理部２０ａの他、メイン処理部２０ａによ
って呼び出されるサブルーチンとして類似度算出部２０
ｂ、重み係数決定部２０ｃ、項データ加算部２０ｄ、内
積データ正規化部２０ｅ、文書一覧表示部２０ｆ、文書
選択部２０ｇ、文書内容表示部２０ｈが設けられてい
る。Next, the details of the program section 20 and the buffer section 22 provided in the memory 14 will be described with reference to FIG. As shown in FIG.
Is a similarity calculation unit 20 as a subroutine called by the main processing unit 20a in addition to the main processing unit 20a.
b, a weight coefficient determining unit 20c, a term data adding unit 20d, an inner product data normalizing unit 20e, a document list displaying unit 20f, a document selecting unit 20g, and a document content displaying unit 20h.

【００２５】メイン処理部２０ａは、処理全体の制御を
司るもので、サブルーチンとして各部を呼び出して実行
させる。メイン処理部２０ａは、処理対象とする文書
（比較元文書、比較先文書）から例えば日本語解析処理
によって単語を抽出する単語抽出機能、この単語抽出機
能によって抽出した単語を文書中での出現回数及び出現
位置を示す情報と共に単語情報データとして保存する単
語情報格納機能を実現するものとする。The main processing section 20a controls the entire processing and calls and executes each section as a subroutine. The main processing unit 20a includes a word extraction function of extracting words from documents to be processed (a comparison source document and a comparison target document) by, for example, Japanese analysis processing, and the number of appearances of the words extracted by the word extraction function in the document. And a word information storage function of storing word information data together with information indicating an appearance position.

【００２６】類似度算出部２０ｂは、重み計数決定部２
０ｃによって算出された重み計数を用いて比較元文書と
比較先文書との類似度を算出する。また、類似度算出部
２０ｂは、比較元文書と比較先文書に共通する単語（共
通単語）が存在するか判定する単語判定機能を実現する
ものとする。The similarity calculating section 20b includes a weight counting determining section 2
The similarity between the comparison source document and the comparison destination document is calculated using the weight count calculated by 0c. In addition, the similarity calculation unit 20b implements a word determination function of determining whether a word (common word) common to the comparison source document and the comparison destination document exists.

【００２７】重み係数決定部２０ｃは、単語判定機能に
よって判定された、比較元文書と比較先文書に共通する
単語（共通単語）から一定の距離内に存在する、比較元
文書と比較先文書に共通して存在する単語の総数に応じ
て、比較元文書と比較先文書の双方における共通単語に
ついての重み係数を大きく算出する。また、重み計数決
定部２０ｃは、係数の値を１以上の実数となるように計
算して求め、比較元文書と比較先文書の双方に対して、
共通単語から一定の距離内に共通して存在する各単語の
比較元文書、比較先文書のそれぞれにおける共通単語と
の距離が小さいほど大きな値となるようにする。The weighting factor determining unit 20c determines whether or not the comparison source document and the comparison destination document, which are within a certain distance from a word (common word) common to the comparison source document and the comparison destination document, determined by the word determination function. According to the total number of commonly existing words, the weight coefficient for the common word in both the comparison source document and the comparison destination document is calculated to be large. Further, the weight count determining unit 20c calculates and calculates the value of the coefficient so as to be a real number of 1 or more, and calculates the coefficient value for both the comparison source document and the comparison destination document.
The larger the distance between the common word and the common word in each of the comparison source document and the comparison target document within a certain distance from the common word, the larger the value.

【００２８】項データ加算部２０ｄは、重み計数決定部
２０ｃによって得られた単語に対する重み係数を用い
て、単語ベクトル間の内積のひとつの項データを計算し
て求める。The term data adder 20d calculates and obtains one term data of the inner product between the word vectors by using the weight coefficient for the word obtained by the weight count determiner 20c.

【００２９】内積データ正規化部２０ｅは、重み計数決
定部２０ｃによって得られた重み係数を用いて内積デー
タの正規化を行なう。文書一覧表示部２０ｆは、類似度
格納バッファ２２ｄに格納された内容（類似度情報）、
及び各文書データ中の文書タイトルのデータを参照し
て、類似度の大きな文書から順に、その文書タイトルの
一覧を、対応する類似度情報と共に表示装置１２の画面
上に表示する。The inner product data normalizing section 20e normalizes the inner product data using the weight coefficient obtained by the weight count determining section 20c. The document list display unit 20f stores the content (similarity information) stored in the similarity storage buffer 22d,
Then, referring to the document title data in each document data, a list of the document titles is displayed on the screen of the display device 12 together with the corresponding similarity information in order from the document having the highest similarity.

【００３０】文書選択部２０ｇは、文書一覧表示部２０
ｆによって表示装置１２の画面上に表示されたタイトル
の一覧から、入力装置１１を用いて１つ以上の文書タイ
トルをユーザに選択させる。The document selection section 20g is provided with a document list display section 20.
The user is caused to select one or more document titles from the list of titles displayed on the screen of the display device 12 by using the input device 11 by f.

【００３１】文書内容表示部２０ｈは、文書選択部２０
ｇによって選択された文書タイトルに対応する文書の内
容を表示装置１２の画面上に表示する。また、バッファ
部２２は、比較元文書データ格納バッファ２２ａ、比較
元単語情報格納バッファ２２ｂ、比較先単語情報格納バ
ッファ２２ｃ、類似度格納バッファ２２ｄ、作業用変数
のための領域２２ｅが設けられている。The document content display section 20h is provided with a document selection section 20.
The content of the document corresponding to the document title selected by g is displayed on the screen of the display device 12. The buffer unit 22 is provided with a comparison source document data storage buffer 22a, a comparison source word information storage buffer 22b, a comparison destination word information storage buffer 22c, a similarity storage buffer 22d, and an area 22e for work variables. .

【００３２】比較元文書データ格納バッファ２２ａは、
比較元文書の文書データを格納するためのもので、後述
する図３に示す構造で文書データを格納できるようにな
っている。The comparison source document data storage buffer 22a
It is for storing document data of a comparison source document, and can store document data in a structure shown in FIG. 3 described later.

【００３３】比較元単語情報格納バッファ２２ｂは、比
較元文書の文書データから抽出された単語を、文書中で
の出現回数及び出現位置を示す情報と共に格納するため
に用いられる。The comparison source word information storage buffer 22b is used to store words extracted from the document data of the comparison source document together with information indicating the number of appearances and the appearance position in the document.

【００３４】比較先単語情報格納バッファ２２ｃは、比
較先文書の文書データから抽出された単語を、文書中で
の出現回数及び出現位置を示す情報と共に格納するため
に用いられる。The comparison destination word information storage buffer 22c is used to store words extracted from the document data of the comparison destination document together with information indicating the number of appearances and the appearance position in the document.

【００３５】比較元単語情報格納バッファ２２ｂと比較
先単語情報格納バッファ２２ｃは、抽出された単語と、
各単語の文書中での出現回数及び出現位置を示す情報と
を、後述する図５に示す形式の単語情報データとして格
納できる構造となっている。The comparison source word information storage buffer 22b and the comparison destination word information storage buffer 22c store the extracted words,
Information indicating the number of appearances and the appearance position of each word in the document is stored as word information data in a format shown in FIG. 5 described later.

【００３６】類似度格納バッファ２２ｄは、比較先文書
に対する各文書毎の類似度を格納するために用いられ
る。バッファ部２２にはその他、各作業用変数のための
領域２２ｅ（作業用変数領域２２ｅ）が確保されてい
る。作業用変数領域２２ｅには、後述する処理で用いら
れる文書ＩＤカウント用変数ｉｄＤｏｃ、カウント用変
数ｉＳｒｃ，ｉＤｓｔ，ｋＳｒｃ，ｋＤｓｔ、類似度格
納変数Ｓ、重み係数変数ｃｏｅｆＳｒｃ，ｃｏｅｆＤｓ
ｔなど、各変数用の領域が確保される。The similarity storage buffer 22d is used to store the similarity of each document with respect to the comparison destination document. In addition, an area 22e (work variable area 22e) for each work variable is secured in the buffer unit 22. The work variable area 22e includes a document ID count variable idDoc, a count variable iSrc, iDst, kSrc, kDst, a similarity storage variable S, and a weight coefficient variable coefSrc, coefDs used in a process described later.
An area for each variable such as t is reserved.

【００３７】次に、外部記憶装置１３に格納される文書
データ及び単語情報データの格納形式について図３を参
照しながら説明する。各文書データは、へッダ部とテキ
ストデータ部からなっており、ヘッダ部にはタイトルデ
ータ、作成日時データ、作成者データなどの文書の各属
性を表わすデータが含まれている。文書データは、ＩＤ
番号順（０，１，２，…，Ｎ−１）に格納されている。
また、文書データには、それぞれ単語情報データが対応
づけられて格納されている。単語情報データには、対応
する文書データ（テキストデータ）に対して形態素解析
によって抽出された単語（本実施形態では名詞）が格納
されている。また、各単語に対して、文書中における出
現回数と出現位置を示す情報が格納されている。Next, the storage format of document data and word information data stored in the external storage device 13 will be described with reference to FIG. Each document data includes a header section and a text data section, and the header section includes data representing each attribute of the document such as title data, creation date / time data, and creator data. Document data is ID
Are stored in numerical order (0, 1, 2, ..., N-1).
Further, word information data is stored in the document data in association with each other. The word information data stores words (nouns in the present embodiment) extracted by morphological analysis on the corresponding document data (text data). In addition, information indicating the number of appearances and the appearance position in the document is stored for each word.

【００３８】図４には文書（テキスト）の一例を示し、
図５には図４に示す文書に対応する単語情報データの一
例を示している。図５に示すように単語情報データ中に
は、各単語のテキスト中での出現回数及び各出現位置が
格納されている。FIG. 4 shows an example of a document (text).
FIG. 5 shows an example of word information data corresponding to the document shown in FIG. As shown in FIG. 5, in the word information data, the number of appearances of each word in the text and each appearance position are stored.

【００３９】単語情報データは、対応する文書データと
共に提供されるものとしても良いし、比較元文書及び比
較先文書の文書データに対して単語抽出機能及び単語情
報格納機能によって取得されるものであっても良い。The word information data may be provided together with the corresponding document data, or may be obtained from the document data of the comparison source document and the comparison target document by the word extraction function and the word information storage function. May be.

【００４０】次に、本実施形態における類似文書検索装
置の全体の動作について、図７に示すフローチャートを
参照しながら説明する。本実施形態における類似文書検
索装置は、比較元文書と比較先文書の双方に共通する単
語（共通単語）が存在する場合に、その共通単語に対す
る重みとして共通単語の近傍に存在する単語の分布状況
を考慮して類似度を求めるものである。図６には、本実
施形態における類似度計算方式の概念を示している。図
６に示すように、比較元文書と比較先文書に共通する単
語Ａが存在する場合に、単語Ａから一定距離内に共通し
て同じ単語、例えば単語Ｃ，Ｅが存在していれば、比較
元文書と比較先文書とは類似していると判別できる。本
実施形態の類似文書検索装置は、こうした、共通単語の
近傍に存在する単語の分布状況を、単語Ａに対する重み
（重み係数）に反映させることで、精度の高い類似度検
索ができるようにするものである。Next, the overall operation of the similar document search apparatus according to this embodiment will be described with reference to the flowchart shown in FIG. The similar document search device according to the present embodiment, when a word (common word) common to both the comparison source document and the comparison target document exists, as a weight for the common word, the distribution state of the word existing near the common word Is calculated in consideration of the similarity. FIG. 6 illustrates the concept of the similarity calculation method according to the present embodiment. As shown in FIG. 6, when a word A common to the comparison source document and the comparison destination document exists, if the same word, for example, the words C and E, exists within a certain distance from the word A, It can be determined that the comparison source document and the comparison destination document are similar. The similar document search device according to the present embodiment enables a highly accurate similarity search to be performed by reflecting the distribution state of the words existing in the vicinity of the common word in the weight (weight coefficient) for the word A. Things.

【００４１】ここでは、予め外部記憶装置１３中に、図
３に示す形式の文書データ及び単語情報データが、ＩＤ
番号０からＮ−１のものまで合計Ｎ個格納されているも
のとする。Here, document data and word information data in the format shown in FIG.
It is assumed that a total of N numbers from 0 to N−1 are stored.

【００４２】処理全体の制御は、メモリ１４のプログラ
ム部２０に格納されたメイン処理部２０ａが担当する。
まず、ユーザにより入力装置１１あるいは通信装置１５
より入力した文書データが比較元文書データ格納バッフ
ァ２２ａに格納される（ステップＡ１）。比較元文書デ
ータには、タイトルデータ、作成日時データ、作成者デ
ータなどの文書の各属性を表わすヘッダ部と、テキスト
データ部が含まれている。The main processing unit 20a stored in the program unit 20 of the memory 14 controls the entire processing.
First, the input device 11 or the communication device 15
The input document data is stored in the comparison source document data storage buffer 22a (step A1). The comparison source document data includes a header portion representing each attribute of the document such as title data, creation date / time data, and creator data, and a text data portion.

【００４３】次に、メイン処理部２０ａは、単語抽出機
能により、比較元文書データ格納バッファ２２ａに格納
された比較元文書データのテキストデータに対して形態
素解析を行ない、テキスト中に含まれる名詞を抽出する
（ステップＡ２）。Next, the main processing unit 20a performs a morphological analysis on the text data of the comparison source document data stored in the comparison source document data storage buffer 22a by the word extraction function, and identifies the nouns contained in the text. Extract (step A2).

【００４４】メイン処理部２０ａは、単語情報格納機能
により、比較元単語情報データを作成し、比較元単語情
報格納バッファ２２ｂに格納する（ステップＡ３）。比
較元単語情報データの構造は、先に図５に示した単語情
報データの構造と同様であり、各単語の出現回数及びテ
キスト中での出現位置情報が格納されている。The main processing section 20a creates the comparison source word information data by the word information storage function and stores it in the comparison source word information storage buffer 22b (step A3). The structure of the comparison source word information data is the same as the structure of the word information data previously shown in FIG. 5, and stores the number of appearances of each word and the appearance position information in the text.

【００４５】次に、メイン処理部２０ａは、複数の比較
先文書のそれぞれについて順次処理していくために、比
較先文書を示す文書ＩＤカウント用変数ｉｄＤｏｃの値
を初期値０にして作業用変数領域２２ｅに格納する（ス
テップＡ４）。Next, in order to sequentially process each of the plurality of comparison target documents, the main processing unit 20a sets the value of the document ID counting variable idDoc indicating the comparison target document to the initial value 0, and sets the work variable It is stored in the area 22e (step A4).

【００４６】また、メイン処理部２０ａは、文書ＩＤカ
ウント用変数ｉｄＤｏｃに対応する単語情報データを外
部記憶装置１３から読み出し、比較先単語情報格納バッ
ファ２２ｃに格納する（ステップＡ５）。The main processing section 20a reads the word information data corresponding to the document ID counting variable idDoc from the external storage device 13 and stores it in the comparison target word information storage buffer 22c (step A5).

【００４７】続いてメイン処理部２０ａは、類似度算出
部２０ｂを起動させる。類似度算出部２０ｂは、比較元
単語情報格納バッファ２２ｂに格納されている比較元文
書の単語情報データと、比較先単語情報格納バッファ２
２ｃに格納されている比較元文書の単語情報データとを
参照して、比較元文書と文書ＩＤカウント用変数ｉｄＤ
ｏｃが示す比較先文書との類似度を算出する（ステップ
Ａ６）。Subsequently, the main processing section 20a activates the similarity calculating section 20b. The similarity calculation unit 20b compares the word information data of the comparison source document stored in the comparison source word information storage buffer 22b with the comparison target word information storage buffer 2b.
With reference to the word information data of the comparison source document stored in 2c, the comparison source document and the document ID counting variable idD
The similarity with the comparison destination document indicated by oc is calculated (step A6).

【００４８】以下、ステップＡ６における、類似度算出
部２０ｂによる類似度算出処理の詳細について、図８に
示すフローチャートを参照しながら説明する。ここで、
比較元文書単語情報データに格納されているｉ番目の単
語の見出し語をｗｏｒｄＳｒｃ［ｉ］、その出現回数を
ｎＡｐＳｒｃ［ｉ］、そのｊ番目の出現位置をｐｏｓＳ
ｒｃ［ｉ］［ｊ］、比較先文書単語情報データに格納さ
れているｉ番目の単語の見出し語をｗｏｒｄＤｓｔ
［ｉ］、その出現回数をｎＡｐＤｓｔ［ｉ］、そのｊ番
目の出現位置をｐｏｓＤｓｔ［ｉ］［ｊ］と表わすこと
にする。また、比較元文書の各単語ベクトルごとの重み
係数を実数変数ｃｏｅｆＳｒｃ［ｉＳｒｃ］、比較先文
書の各単語ベクトルごとの重み係数を実数変数ｃｏｅｆ
Ｄｓｔ［ｉＤｓｔ］、比較元文書データに格納されてい
る単語の総数をｎＳｒｃ、比較先の単語情報データに格
納されている単語の総数をｎＤｓｔで表わすことにす
る。Hereinafter, the details of the similarity calculation processing by the similarity calculation unit 20b in step A6 will be described with reference to the flowchart shown in FIG. here,
The headword of the i-th word stored in the comparison source document word information data is wordSrc [i], the number of occurrences is nApSrc [i], and the j-th occurrence position is posS
rc [i] [j], the headword of the i-th word stored in the comparison target document word information data is wordDst
[I], the number of appearances is represented by nApDst [i], and the jth appearance position is represented by posDst [i] [j]. The weight coefficient of each word vector of the source document is a real variable coefSrc [iSrc], and the weight coefficient of each word vector of the target document is a real variable coef.
Dst [iDst], the total number of words stored in the comparison source document data is represented by nSrc, and the total number of words stored in the comparison destination word information data is represented by nDst.

【００４９】まず、類似度算出部２０ｂは、現在処理対
象としている比較先文書についての類似度格納変数Ｓを
初期値０として作業用変数領域２２ｅに格納する（ステ
ップＢ１）。次に、類似度算出部２０ｂは、比較元文書
の単語を示す変数ｉＳｒｃに０を代入し（ステップＢ
２）、変数ｉＳｒｃが示す単語についての重み係数ｃｏ
ｅｆＳｒｃ［ｉＳｒｃ］に初期値１．０を代入する（ス
テップＢ３）。First, the similarity calculation unit 20b stores the similarity storage variable S of the comparison target document currently being processed as the initial value 0 in the work variable area 22e (step B1). Next, the similarity calculation unit 20b substitutes 0 for a variable iSrc indicating a word of the comparison source document (step B).
2) the weighting factor co for the word indicated by the variable iSrc
The initial value 1.0 is substituted for efSrc [iSrc] (step B3).

【００５０】次に、類似度算出部２０ｂは、比較先文書
の単語を示す変数ｉＤｓｔに０を代入し（ステップＢ
４）、変数ｉＤｓｔが示す単語についての重み係数ｃｏ
ｅｆＤｓｔ［ｉＤｓｔ］に初期値１．０を代入する（ス
テップＢ５）。Next, the similarity calculating section 20b substitutes 0 for a variable iDst indicating a word of the comparison destination document (step B).
4), the weighting coefficient co for the word indicated by the variable iDst
The initial value 1.0 is substituted for efDst [iDst] (step B5).

【００５１】次に、類似度算出部２０ｂは、単語判定機
能によって、比較元文書の単語ｗｏｒｄＳｒｃ［ｉＳｒ
ｃ］と、比較先文書の単語ｗｏｒｄＤｓｔ［ｉＤｓｔ］
とが等しいか、すなわち共通単語であるか否かを比較す
る（ステップＢ６）。等しかった場合には、ステップＢ
７に制御が移り、等しくなかった場合には、ステップＢ
９に制御が移る。Next, the similarity calculation unit 20b uses the word determination function to execute the word wordSrc [iSr
c] and the word wordDst [iDst] of the compared document
Are compared, that is, whether they are common words (step B6). If equal, step B
When control is transferred to step 7 and the values are not equal, step B
The control moves to 9.

【００５２】等しくなかった場合には、ここでのｉＳｒ
ｃ及びｉＤｓｔが示す単語に係わる重み係数重み係数ｃ
ｏｅｆＳｒｃ［ｉＳｒｃ］、及びｃｏｅｆＤｓｔ［ｉＤ
ｓｔ］の値は更新されず１．０のまま設定される。If not equal, iSr here
weighting factor for the word indicated by c and iDst weighting factor c
oefSrc [iSrc] and coefDst [iD
The value of [st] is not updated and is set as 1.0.

【００５３】一方、ステップＢ７では、重み係数決定部
２０ｃが起動される。重み計数決定部２０ｃは、比較先
文書の単語ベクトルのうち、共通単語である単語ｗｏｒ
ｄＳｒｃ［ｉＳｒｃ］、及び単語ｗｏｒｄＤｓｔ［ｉＤ
ｓｔ］に対応する要素に係わる、重み係数変数ｃｏｅｆ
Ｓｒｃ［ｉＳｒｃ］、及びｃｏｅｆＤｓｔ［ｉＤｓｔ］
を、共通単語から一定の距離内に比較元文書と比較先文
書部に共通して存在する単語の量に応じて更新する（単
語が多いほど重みを大きくする）。なお、重み計数決定
部２０ｃによる重み係数決定処理の詳細については後述
する（図９に示すフローチャート）。On the other hand, in step B7, the weight coefficient determining section 20c is activated. The weight count determination unit 20c determines the word wor which is a common word among the word vectors of the comparison target document.
dSrc [iSrc] and the word wordDst [iD
st], the weighting factor variable coef related to the element corresponding to
Src [iSrc] and coefDst [iDst]
Is updated according to the amount of words that are commonly present in the comparison source document and the comparison destination document within a certain distance from the common word (the weight increases as the number of words increases). The details of the weight coefficient determination process by the weight count determination unit 20c will be described later (the flowchart shown in FIG. 9).

【００５４】次に、項データ加算部２０ｄが起動され
る。項データ加算部２０ｄは、重み計数決定部２０ｃに
よる重み係数決定処理で決定された、重み係数変数ｃｏ
ｅｆＳｒｃ［ｉＳｒｃ］の値を参照して類似度格納変数
Ｓに単語ベクトルの内積のひとつの項データの加算を行
なう（ステップＢ８）。具体的には項データ加算部２０
ｄでの加算処理は、以下の計算式（１）に従って行なわ
れる。Next, the term data adder 20d is activated. The term data adding unit 20d calculates the weight coefficient variable co determined by the weight coefficient determining process by the weight counting determining unit 20c.
With reference to the value of efSrc [iSrc], one term data of the inner product of the word vector is added to the similarity storage variable S (step B8). Specifically, the term data adder 20
The addition processing at d is performed according to the following equation (1).

【００５５】[0055]

【数１】 (Equation 1)

【００５６】計算式（１）において、ｎＡｐＳｒｃ［ｉ
Ｓｒｃ］は比較元単語の出現回数、ｎＡｐＤｓｔ［ｉＤ
ｓｔ］は比較先単語の出現回数であり、これらは一般的
な内積計算で重みとして用いられているものである。こ
れらの積に、先に求めた重み係数変数ｃｏｅｆＳｒｃ
［ｉＳｒｃ］，ｃｏｅｆＤｓｔ［ｉＤｓｔ］を乗じてい
る。In the equation (1), nApSrc [i
Src] is the number of appearances of the comparison source word, nApDst [iD
[st] is the number of appearances of the comparison target word, which are used as weights in general inner product calculation. The weight coefficient variable coefSrc obtained above is added to these products.
[ISrc], coefDst [iDst].

【００５７】加算処理を終えると処理はステップＢ９に
移る。類似度算出部２０ｂは、ステップＢ９において、
比較先文書の次の単語を処理対象とするために変数ｉＤ
ｓｔの値に１を加える。ここで、変数ｉＤｓｔの値が比
較先文書の単語の総数ｎＤｓｔより小さければ、すなわ
ち比較先文書の全ての単語についての処理が完了してい
なければステップＢ５に戻り処理を繰り返す。When the addition process is completed, the process moves to step B9. In step B9, the similarity calculating unit 20b determines
In order to process the next word of the compared document, the variable iD
Add 1 to the value of st. Here, if the value of the variable iDst is smaller than the total number nDst of words in the comparison target document, that is, if the processing for all the words in the comparison target document has not been completed, the process returns to step B5 to repeat the processing.

【００５８】一方、変数ｉＤｓｔの値が総数ｎＤｓｔ以
上である場合、比較元文書のそれまで処理対象としてい
る変数ｉＳｒｃが示す単語について、比較先文書の各単
語についての処理が完了したので、比較元文書の次の単
語を処理対象とするために変数ｉＳｒｃの値に１を加え
る（ステップＢ１１）。ここで、変数ｉＳｒｃの値が比
較元文書の単語の総数ｎＳｒｃより小さければ、すなわ
ち比較元文書の全ての単語についての処理が完了してい
なければステップＢ３に戻り処理を繰り返す。On the other hand, when the value of the variable iDst is equal to or more than the total number nDst, the processing of each word of the comparison target document is completed for the word indicated by the variable iSrc which has been processed so far in the comparison source document. In order to process the next word in the document, 1 is added to the value of the variable iSrc (step B11). Here, if the value of the variable iSrc is smaller than the total number nSrc of words of the comparison source document, that is, if the processing for all the words of the comparison source document has not been completed, the process returns to step B3 and repeats the processing.

【００５９】また、変数ｉＳｒｃの値が総数ｎＳｒｃ以
上であった場合、内積データ正規化部２０ｅが起動され
る。内積データ正規化部２０ｅは、類似度格納変数Ｓの
値を各単語ベクトルの大きさの積で割ることで、内積デ
ータを正規化する（ステップＢ１３）。内積データ正規
化部２０ｅでの処理で用いる具体的な計算式（２）を以
下に示す。When the value of the variable iSrc is equal to or greater than the total number nSrc, the inner product data normalizing section 20e is activated. The inner product data normalizing unit 20e normalizes the inner product data by dividing the value of the similarity storage variable S by the product of the size of each word vector (step B13). The specific calculation formula (2) used in the processing in the inner product data normalization unit 20e is shown below.

【００６０】[0060]

【数２】 (Equation 2)

【００６１】内積データ正規化部２０ｅによる処理を終
了すると、図７のステップＡ６での類似度計算処理を終
える。次に、図８のステップＢ７において実行される重
み係数決定部２０ｃによる重み係数決定処理の詳細につ
いて、図９に示すフローチャートを参照しながら説明す
る。When the processing by the inner product data normalizing section 20e is completed, the similarity calculation processing in step A6 of FIG. 7 is completed. Next, details of the weighting factor determination process performed by the weighting factor determination unit 20c performed in step B7 of FIG. 8 will be described with reference to a flowchart illustrated in FIG.

【００６２】重み係数決定処理では、比較元文書と比較
先文書の何れにも存在し、かつ比較元文書と比較先文書
のそれぞれにおいて、比較元文書の単語ｗｏｒｄＳｒｃ
［ｉＳｒｃ］から一定距離以内に存在するかどうか調
べ、その数をもとに重み係数変数ｃｏｅｆＳｒｃ［ｉＳ
ｒｃ］の値を再設定（初期値１．０に設定されている）
し、また比較先文書の単語ｗｏｒｄＤｓｔ［ｉＤｓｔ］
から一定距離以内にも存在しているかどうか調べ、その
数をもとに重み係数変数ｃｏｅｆＤｓｔ［ｉＤｓｔ］の
値を再設定（初期値１．０に設定されている）する。In the weighting factor determination process, the word wordSrc of the comparison source document exists in both the comparison source document and the comparison destination document, and in each of the comparison source document and the comparison destination document.
It is checked whether it exists within a certain distance from [iSrc], and based on the number, a weighting coefficient variable coefSrc [iSrc
rc] is reset (initial value is set to 1.0)
And the word wordDst [iDst] of the compared document
It is checked whether it exists within a certain distance from the data, and the value of the weight coefficient variable coefDst [iDst] is reset (set to the initial value 1.0) based on the number.

【００６３】まず、重み計数決定部２０ｃは、比較元文
書の単語ｗｏｒｄＳｒｃ［ｉＳｒｃ］に対応する条件を
満たす単語の総数を表す総数ｓｕｍＳｒｃと、単語ｗｏ
ｒｄＤｓｔ［ｉＤｓｔ］に対応する条件を満たす単語の
総数を表す総数ｓｕｍＤｓｔとを、それぞれに０を代入
して初期化する（ステップＣ１，Ｃ２）。First, the weight count determining unit 20c calculates the total number sumSrc representing the total number of words satisfying the condition corresponding to the word wordSrc [iSrc] of the comparison source document, and the word wo
A total sumDst representing the total number of words satisfying the condition corresponding to rdDst [iDst] is initialized by substituting 0 for each (steps C1 and C2).

【００６４】また、重み計数決定部２０ｃは、比較元文
書の単語を表わす変数ｋＳｒｃに０を代入して初期化す
る（ステップＣ３）。次に、重み計数決定部２０ｃは、
変数ｉＳｒｃ（共通単語を示す）と変数ｋＳｒｃとを比
較し、２つの変数の値が同じなら同一単語を表わすので
制御をステップＣ１４に移す（ステップＣ４）。ステッ
プＣ１４では、変数ｋＳｒｃの値をインクリメントする
ことで比較元文書の次の単語に処理を移す。インクリメ
ントされた変数ｋＳｒｃの値が比較元文書の単語の総数
ｎＳｒｃよりも少なければステップＣ４に制御を移す
（ステップＣ１５）。Further, the weight count determining unit 20c initializes a variable kSrc representing a word of the comparison source document by substituting 0 (step C3). Next, the weight count determining unit 20c
The variable iSrc (indicating a common word) is compared with the variable kSrc, and if the values of the two variables are the same, the same word is represented, so that control is transferred to step C14 (step C4). In step C14, the process moves to the next word of the comparison source document by incrementing the value of the variable kSrc. If the value of the incremented variable kSrc is smaller than the total number nSrc of words of the comparison source document, the control is transferred to step C4 (step C15).

【００６５】一方、変数ｉＳｒｃと変数ｋＳｒｃとが異
なれば、重み計数決定部２０ｃは、比較先文書の単語を
表わす変数ｋＤｓｔに０を代入して初期化する（ステッ
プＣ５）。On the other hand, if the variable iSrc is different from the variable kSrc, the weight count determining unit 20c initializes the variable kDst by substituting 0 into a variable kDst representing a word of the document to be compared (step C5).

【００６６】次に、重み計数決定部２０ｃは、変数ｉＤ
ｓｔ（比較先文書の共通単語を示す）と変数ｋＤｓｔと
を比較する（ステップＣ６）。この結果、２つの変数の
値が同じなら同一単語を表わすので制御をステップＣ１
２に移す（ステップＣ６）。ステップＣ１２では、変数
ｋＤｓｔの値をインクリメントすることで比較先文書の
次の単語に処理を移す。インクリメントされた変数ｋＤ
ｓｔの値が比較先文書の単語の総数ｎＤｓｔよりも少な
ければステップＣ６に制御を移す（ステップＣ１３）。Next, the weight count determining unit 20c sets the variable iD
st (indicating a common word of the comparison target document) is compared with the variable kDst (step C6). As a result, if the values of the two variables are the same, they represent the same word, so control is performed in step C1.
2 (step C6). In step C12, the process is moved to the next word of the comparison target document by incrementing the value of the variable kDst. Incremented variable kD
If the value of st is smaller than the total number nDst of words in the comparison target document, the control is transferred to step C6 (step C13).

【００６７】一方、変数ｉＤｓｔと変数ｋＤｓｔとが異
なれば、重み計数決定部２０ｃは、変数ｋＳｒｃが示す
単語ｗｏｒｄＳｒｃ［ｋＳｒｃ］と、比較先文書の単語
ｗｏｒｄＤｓｔ［ｋＤｓｔ］とが等しいか比較する（ス
テップＣ７）。On the other hand, if the variable iDst is different from the variable kDst, the weight count determination unit 20c compares the word wordSrc [kSrc] indicated by the variable kSrc with the word wordDst [kDst] of the comparison target document (step S12). C7).

【００６８】重み計数決定部２０ｃは、２つの単語ｗｏ
ｒｄＳｒｃ［ｋＳｒｃ］と単語ｗｏｒｄＤｓｔ［ｋＤｓ
ｔ］とが等しかった場合には制御をステップＣ８に移
し、等しくなかった場合にはステップＣ１２に移す。The weight count determining unit 20c calculates two words wo
rdSrc [kSrc] and the word wordDst [kDs
If t] is equal, control is transferred to step C8, and if they are not equal, control is transferred to step C12.

【００６９】等しくなかった場合には、等しい単語を求
めるために変数ｋＤｓｔの値をインクリメントすること
で比較先文書の次の単語に処理を移す（ステップＣ１
２）。インクリメントされた変数ｋＤｓｔの値が比較先
文書の単語の総数ｎＤｓｔよりも少なければステップＣ
６に制御を移す（ステップＣ１３）。以下、単語ｗｏｒ
ｄＳｒｃ［ｋＳｒｃ］と単語ｗｏｒｄＤｓｔ［ｋＤｓ
ｔ］とが等しくなるか、比較先文書の全ての単語につい
ての判別が終了するまで同様の処理を繰り返して行な
う。If not equal, the value of the variable kDst is incremented in order to obtain an equal word, and the processing is shifted to the next word of the document to be compared (step C1).
2). If the incremented value of the variable kDst is smaller than the total number nDst of words of the compared document, step C
The control is transferred to 6 (step C13). Below, the word wor
dSrc [kSrc] and the word wordDst [kDs
t] is equalized, or the same processing is repeated until the determination for all words in the comparison target document is completed.

【００７０】ステップＣ８では、重み計数決定部２０ｃ
は、比較元文書中の共通単語ｗｏｒｄＳｒｃ［ｉＳｒ
ｃ］と、比較先文書中に存在した同じ単語ｗｏｒｄＳｒ
ｃ［ｋＳｒｃ］との距離、すなわち２つの単語の出現位
置の差の絶対値が一定の範囲内（例えば８０）にあるか
を判別する。単語ｗｏｒｄＳｒｃ［ｋＳｒｃ］の出現回
数が２つ以上であった場合には、全ての出現位置をもと
にして、共通単語ｗｏｒｄＳｒｃ［ｉＳｒｃ］との距離
が一定の範囲内にあるかを判別する。そして、重み計数
決定部２０ｃは、変数ｋＳｒｃによって表される単語に
ついての条件を満たす出現位置の総数ｃｏｕｎｔＳｒｃ
を求める。ここでの具体的な処理については後に図１０
のフローチャートを用いて説明する。In step C8, the weight count determining section 20c
Is the common word wordSrc [iSr
c] and the same word wordSr that existed in the compared document
It is determined whether or not the distance from c [kSrc], that is, the absolute value of the difference between the appearance positions of the two words is within a certain range (for example, 80). If the number of appearances of the word wordSrc [kSrc] is two or more, it is determined whether the distance from the common word wordSrc [iSrc] is within a certain range based on all occurrence positions. Then, the weight count determination unit 20c calculates the total number of appearance positions countSrc that satisfies the condition for the word represented by the variable kSrc.
Ask for. The specific processing here will be described later with reference to FIG.
This will be described with reference to the flowchart of FIG.

【００７１】重み計数決定部２０ｃは、ステップＣ８で
求められた、変数ｋＳｒｃによって表される単語につい
ての総数ｃｏｕｎｔＳｒｃを、総数ｓｕｍＳｒｃに加算
する（ステップＣ９）。The weight count determining unit 20c adds the total countSrc for the word represented by the variable kSrc obtained in step C8 to the total sumSrc (step C9).

【００７２】またステップＣ１０では、重み計数決定部
２０ｃは、比較先文書中の共通単語ｗｏｒｄＤｓｔ［ｉ
Ｄｓｔ］と、比較元文書中に存在した同じ単語ｗｏｒｄ
Ｄｓｔ［ｋＤｓｔ］との距離、すなわち２つの単語の出
現位置の差の絶対値が一定の範囲内（例えば８０）にあ
るかを判別する。単語ｗｏｒｄＤｓｔ［ｋＤｓｔ］の出
現回数が２つ以上であった場合には、全ての出現位置を
もとにして、共通単語ｗｏｒｄＤｓｔ［ｉＤｓｔ］との
距離が一定の範囲内にあるかを判別する。そして、重み
計数決定部２０ｃは、条件を満たす出現位置の総数ｃｏ
ｕｎｔＤｓｔを求める。ここでの具体的な処理について
は後に図１１のフローチャートを用いて説明する。In step C10, the weight count determining unit 20c determines the common word wordDst [i in the compared document.
Dst] and the same word word existing in the comparison source document.
It is determined whether the distance from Dst [kDst], that is, the absolute value of the difference between the appearance positions of the two words is within a certain range (for example, 80). If the number of appearances of the word wordDst [kDst] is two or more, it is determined whether the distance from the common word wordDst [iDst] is within a certain range based on all the appearance positions. Then, the weight count determination unit 20c calculates the total number of appearance positions co that satisfies the condition.
Find untDst. The specific processing here will be described later using the flowchart of FIG.

【００７３】重み計数決定部２０ｃは、ステップＣ１０
で求められた、変数ｋＤｓｔによって表される単語につ
いての総数ｃｏｕｎｔＤｓｔを、総数ｓｕｍＤｓｔに加
算する（ステップＣ１１）。The weight count determining unit 20c determines in step C10
Then, the total countDst for the word represented by the variable kDst calculated in step (1) is added to the total sumDst (step C11).

【００７４】次に、重み計数決定部２０ｃは、比較元文
書の次の単語に処理を移すために、ステップＣ１４に制
御を移して、変数ｋＳｒｃの値をインクリメントする。
インクリメントされた変数ｋＳｒｃの値が比較元文書の
単語の総数ｎＳｒｃよりも少なければステップＣ４に制
御を移し（ステップＣ１５）、以下、同様にして、比較
元文書の次の変数ｋＳｒｃが示す単語についての処理を
実行する。Next, the weight count determining unit 20c shifts the control to step C14 to increment the value of the variable kSrc in order to shift the processing to the next word of the comparison source document.
If the incremented value of the variable kSrc is smaller than the total number nSrc of words in the comparison source document, control is transferred to step C4 (step C15), and thereafter, similarly, for the word indicated by the next variable kSrc in the comparison source document Execute the process.

【００７５】なお、ステップＣ１２において変数ｋＤｓ
ｔの値をインクリメントした結果、その値が比較先文書
の単語の総数ｎＤｓｔより小さくない場合には、単語ｗ
ｏｒｄＳｒｃ［ｉＳｒｃ］に対して全ての単語ｗｏｒｄ
Ｄｓｔ［ｉＤｓｔ］についての処理が終了しているの
で、比較元文書の次の単語に処理を移すためにステップ
Ｃ１４に処理を移す。In step C12, the variable kDs
As a result of incrementing the value of t, if the value is not smaller than the total number nDst of words in the comparison target document, the word w
All words word for ordSrc [iSrc]
Since the processing for Dst [iDst] has been completed, the processing moves to step C14 in order to move the processing to the next word of the comparison source document.

【００７６】また、ステップＣ１４において変数ｋＳｒ
ｔの値をインクリメントした結果、その値が比較元文書
の単語の総数ｎＳｒｃより小さくない場合には、比較元
文書の全ての単語ｗｏｒｄＳｒｃ［ｉＳｒｃ］に対して
の処理が終了しているのでステップＣ１６に処理を移
す。In step C14, the variable kSr
If the value of t is not smaller than the total number of words nSrc in the comparison source document as a result of incrementing the value of t, the processing has been completed for all the words wordSrc [iSrc] in the comparison source document, so that step C16 is performed. Transfer processing to

【００７７】これまでの処理で、比較元文書と比較先文
書の何れにも存在し、かつ比較元文書の単語ｗｏｒｄＳ
ｒｃ［ｉＳｒｃ］から一定距離以内に存在する単語の総
数ｓｕｍＳｒｃが、条件に該当する変数ｋＳｒｃ（０〜
ｎＳｒｃ−１）によって表される各単語についての処理
（詳細については後述する図１０による説明）から求ま
る総数ｃｏｕｎｔＳｒｃの総和として求められる。In the processing so far, the word wordS of the comparison source document exists in both the comparison source document and the comparison destination document.
The total number of words sumSrc existing within a certain distance from rc [iSrc] is the variable kSrc (0 to 0) corresponding to the condition.
nSrc-1) is obtained as the sum of the total countSrc obtained from the processing for each word represented by (nSrc-1) (details will be described later with reference to FIG. 10).

【００７８】また、同様にして、比較元文書と比較先文
書の何れにも存在し、かつ比較先文書の単語ｗｏｒｄＤ
ｓｔ［ｉＤｓｔ］から一定距離以内にも存在している単
語の総数ｓｕｍＤｓｔが、条件に該当する変数ｋＤｓｔ
（０〜ｎＤｓｔ−１）によって表される各単語について
の処理（詳細については後述する図１１による説明）か
ら求まる総数ｃｏｕｎｔＤｓｔの総和として求められ
る。Similarly, the word wordD exists in both the comparison source document and the comparison destination document and is included in the comparison destination document.
The total number of words sumDst existing within a certain distance from st [iDst] is the variable kDst corresponding to the condition.
It is obtained as the sum total of the total countDst obtained from the processing (details will be described later with reference to FIG. 11) for each word represented by (0 to nDst-1).

【００７９】重み計数決定部２０ｃは、先に求めた総数
ｓｕｍＳｒｃの値を参照して、比較元文書の単語ｗｏｒ
ｄＳｒｃ［ｉＳｒｃ］に係わる重み係数変数ｃｏｅｆＳ
ｒｃ［ｉＳｒｃ］の値を計算する（ステップＣ１６）。
ｃｏｅｆＳｒｃ［ｉＳｒｃ］の値としては、１にｓｕｍ
Ｓｒｃの平方根の値を加えた値を用いている。The weight count determining unit 20c refers to the value of the total sumSrc obtained earlier and refers to the word wor of the comparison source document.
Weight coefficient variable coefS related to dSrc [iSrc]
The value of rc [iSrc] is calculated (step C16).
The value of coefSrc [iSrc] is sum to 1
The value obtained by adding the value of the square root of Src is used.

【００８０】同様にして、重み計数決定部２０ｃは、先
に求めたｓｕｍＤｓｔの値を参照して、比較先文書の単
語ｗｏｒｄＤｓｔ［ｉＤｓｔ］に係わる重み係数変数ｃ
ｏｅｆＤｓｔ［ｉＤｓｔ］の値を計算する（ステップＣ
１７）。本実施形態における類似文書検索装置では、ｃ
ｏｅｆＤｓｔ［ｉＤｓｔ］の値としては、１にｓｕｍＤ
ｓｔの平方根の値を加えた値を用いている。Similarly, the weight count determining unit 20c refers to the value of sumDst obtained earlier and refers to the weight coefficient variable c relating to the word wordDst [iDst] of the comparison target document.
Calculate the value of oefDst [iDst] (step C
17). In the similar document search device according to the present embodiment, c
The value of oefDst [iDst] is sumD to 1
The value obtained by adding the value of the square root of st is used.

【００８１】以上の処理によって重み計数決定部２０ｃ
による重み係数決定処理を終了する。次に、前述した図
９に示すステップＣ５の処理の詳細について、図１０に
示すフローチャートを参照しながら説明する。By the above processing, the weight count determining unit 20c
Terminates the weighting factor determination process according to. Next, the details of the processing in step C5 shown in FIG. 9 will be described with reference to the flowchart shown in FIG.

【００８２】ステップＣ５の処理は、比較元文書のテキ
スト中に現れる単語ｗｏｒｄＳｒｃ［ｉＳｒｃ］と、単
語ｗｏｒｄＳｒｃ［ｋＳｒｃ］の全ての出現位置を参照
して、その距離つまり出現位置の差の絶対値が８０以下
になるもの総数ｃｏｕｎｔＳｒｃを求める処理である。The process in step C5 refers to the word wordSrc [iSrc] appearing in the text of the comparison source document and all occurrence positions of the word wordSrc [kSrc], and determines the absolute value of the distance, that is, the difference between the appearance positions. This is a process of calculating the total countSrc that is less than or equal to 80.

【００８３】まず、重み計数決定部２０ｃは、総数ｃｏ
ｕｎｔＳｒｃの値を０にして初期化したあと（ステップ
Ｄ１）、ステップＤ２からステップＤ１０までのループ
で、単語ｗｏｒｄＳｒｃ［ｉＳｒｃ］の各出現位置に対
応するｊ１の値を、０から出現回数ｎＡｐＳｒｃ［ｉＳ
ｒｃ］まで変化させ、このループの内部のステップＤ３
からステップＤ８までのループで、単語ｗｏｒｄＳｒｃ
［ｋＳｒｃ］の各出現位置に対応するｊ０の値を、０か
ら出現回数ｎＡｐＳｒｃ［ｋＳｒｃ］まで変化させる。First, the weight count determining unit 20c calculates the total number co
After the value of untSrc is initialized to 0 (Step D1), in the loop from Step D2 to Step D10, the value of j1 corresponding to each appearance position of the word wordSrc [iSrc] is changed from 0 to the number of appearances nApSrc [iSrc].
rc], and step D3 inside this loop.
In the loop from to D8, the word wordSrc
The value of j0 corresponding to each appearance position of [kSrc] is changed from 0 to the number of appearances nApSrc [kSrc].

【００８４】すなわち、ｊ０とｊ１の全ての組み合わせ
に対して、変数ｋＳｒｃで表わされる単語の出現位置ｐ
ｏｓＳｒｃ［ｋＳｒｃ］［ｊ０］と、変数ｉＳｒｃで表
わされる単語の出現位置ｐｏｓＳｒｃ［ｉＳｒｃ］［ｊ
１］の距離、つまり差の絶対値ｄＳｒｃ（以下に示す計
算式（３））を計算して求める（ステップＤ４）。That is, for all combinations of j0 and j1, the appearance position p of the word represented by the variable kSrc
osSrc [kSrc] [j0] and the occurrence position posSrc [iSrc] [j of the word represented by the variable iSrc
1], that is, the absolute value of the difference dSrc (calculation formula (3) shown below) is calculated (step D4).

【００８５】[0085]

【数３】 (Equation 3)

【００８６】さらに、重み計数決定部２０ｃは、ステッ
プＤ５による比較で、ｄＳｒｃの値が予め決められた距
離８０より小の場合には、ステップＤ６の処理によって
総数ｃｏｕｎｔＳｒｃの値がインクリメントされる。Further, when the value of dSrc is smaller than the predetermined distance 80 in the comparison in step D5, the weight count determination unit 20c increments the value of the total countSrc by the processing in step D6.

【００８７】次に、前述した図９に示すステップＣ７の
処理の詳細について、図１１に示すフローチャートを参
照しながら説明する。ステップＣ７の処理は、比較先文
書のテキスト中に現れる単語ｗｏｒｄＤｓｔ［ｉＤｓ
ｔ］と、単語ｗｏｒｄＤｓｔ［ｋＤｓｔ］の全ての出現
位置を参照して、その距離つまり出現位置の差の絶対値
が８０以下になるもの総数ｃｏｕｎｔＤｓｔを求める処
理である。Next, details of the processing in step C7 shown in FIG. 9 will be described with reference to the flowchart shown in FIG. The processing in step C7 is performed by using the word wordDst [iDs
t] and all occurrence positions of the word wordDst [kDst], and a process of calculating the total countDst where the absolute value of the distance, that is, the difference between the appearance positions becomes 80 or less, is performed.

【００８８】まず、重み計数決定部２０ｃは、総数ｃｏ
ｕｎｔＤｓｔの値を０にして初期化したあと（ステップ
Ｅ１）、ステップＥ２からステップＥ１０までのループ
で、単語ｗｏｒｄＤｓｔ［ｉＤｓｔ］の各出現位置に対
応するｊ１の値を、０から出現回数ｎＡｐＤｓｔ［ｉＤ
ｓｔ］まで変化させ、このループの内部のステップＥ３
からステップＥ８までのループで、単語ｗｏｒｄＤｓｔ
［ｋＤｓｔ］の各出現位置に対応するｊ０の値を、０か
ら出現回数ｎＡｐＤｓｔ［ｋＤｓｔ］まで変化させる。First, the weight count determining unit 20c calculates the total number co
After the value of untDst is initialized to 0 (step E1), in the loop from step E2 to step E10, the value of j1 corresponding to each appearance position of the word wordDst [iDst] is changed from 0 to the number of appearances nApDst [iD
st], and step E3 inside this loop.
In the loop from step E8 to the word wordDst
The value of j0 corresponding to each appearance position of [kDst] is changed from 0 to the number of appearances nApDst [kDst].

【００８９】すなわち、ｊ０とｊ１の全ての組み合わせ
に対して、変数ｋＤｓｔで表わされる単語の出現位置ｐ
ｏｓＤｓｔ［ｋＤｓｔ］［ｊ０］と、変数ｉＤｓｔで表
わされる単語の出現位置ｐｏｓＤｓｔ［ｉＤｓｔ］［ｊ
１］の距離、つまり差の絶対値ｄＤｓｔ（以下に示す計
算式（４））を計算して求める（ステップＥ４）。That is, for all combinations of j0 and j1, the appearance position p of the word represented by the variable kDst
osDst [kDst] [j0] and the appearance position posDst [iDst] [j of the word represented by the variable iDst
1], that is, the absolute value of the difference dDst (calculation formula (4) shown below) is calculated (step E4).

【００９０】[0090]

【数４】 (Equation 4)

【００９１】さらに、重み計数決定部２０ｃは、ステッ
プＥ５による比較で、ｄＤｓｔの値が予め決められた距
離８０より小の場合には、ステップＥ６の処理によって
総数ｃｏｕｎｔＤｓｔ値がインクリメントされる。Further, if the value of dDst is smaller than the predetermined distance 80 in the comparison in step E5, the weight count determining unit 20c increments the total countDst value by the processing in step E6.

【００９２】前述した、図１０及び図１１のフローチャ
ートに示した処理では、比較元文書及び比較先文書の単
語の近傍に存在する単語の総数のみを考慮して、ｃｏｕ
ｎｔＳｒｃ及びｃｏｕｎｔＤｓｔの値を算出していた
が、本実施形態における類似文書検索装置では、指定に
より該当する各単語の比較元文書、及び比較先文書の単
語からの距離をｃｏｕｎｔＳｒｃ及びｃｏｕｎｔＤｓｔ
の値に反映させることが可能である。この場合に、図１
０のステップＤ６では、以下の計算式（５）により実数
型の変数ｃｏｕｎｔＳｒｃの値を算出する。In the above-described processing shown in the flowcharts of FIGS. 10 and 11, cou is considered by considering only the total number of words existing near the words of the comparison source document and the comparison destination document.
Although the values of ntSrc and countDst are calculated, the similar document search apparatus according to the present embodiment specifies the distances from the words of the comparison source document and the comparison target document of each corresponding word by countSrc and countDst.
Can be reflected in the value of In this case, FIG.
In step D6 of 0, the value of the real type variable countSrc is calculated by the following formula (5).

【００９３】[0093]

【数５】これに対応して、図１１のステップＥ６では、以下の計
算式（６）により実数型の変数ｃｏｕｎｔＤｓｔの値を
算出する。(Equation 5) In response to this, in step E6 of FIG. 11, the value of the real number type variable countDst is calculated by the following equation (6).

【００９４】[0094]

【数６】 (Equation 6)

【００９５】以上のようにして、図７に示すステップＡ
６の処理によって類似度を算出したのち、ここで得た類
似度を類似度格納バッファ２２ｄの文書ＩＤカウント用
変数ｉｄＤｏｃの値に対応する位置に格納する（ステッ
プＡ７）。As described above, step A shown in FIG.
After calculating the similarity by the processing of step 6, the obtained similarity is stored in the similarity storage buffer 22d at a position corresponding to the value of the document ID counting variable idDoc (step A7).

【００９６】次に、次の比較先文書の処理に移るために
文書ＩＤカウント用変数ｉｄＤｏｃの値をインクリメン
トする（ステップＡ８）。さらに、変数ｉｄＤｏｃの値
と外部記憶装置１３中に格納されている文書の総数Ｎと
を比較し（ステップＡ９）、変数ｉｄＤｏｃが文書の総
数Ｎより小さいならステップＡ５からの処理を繰り返
し、そうでなければステップＡ１０に制御を移す。Next, the value of the document ID counting variable idDoc is incremented in order to proceed to the processing of the next comparison destination document (step A8). Further, the value of the variable idDoc is compared with the total number N of documents stored in the external storage device 13 (step A9). If the variable idDoc is smaller than the total number N of documents, the processing from step A5 is repeated. If not, control is transferred to step A10.

【００９７】ステップＡ１０では、文書一覧表示部２０
ｆが起動される。文書一覧表示部２０ｆは、類似度格納
バッファ２２ｄに格納された内容、及び各文書データ中
のタイトルデータを参照して、類似度の大きな文書から
順に、そのタイトルの一覧を対応する類似度情報と共に
表示装置１２の画面上に表示する。タイトルー覧表示後
の画面の状況を図１２に示している。In step A10, the document list display section 20
f is activated. The document list display unit 20f refers to the content stored in the similarity storage buffer 22d and the title data in each document data, and sorts the title list together with the corresponding similarity information in order from the document having the highest similarity. It is displayed on the screen of the display device 12. FIG. 12 shows the state of the screen after the title list display.

【００９８】続いて文書選択部４ｇが起動される。文書
選択部２０ｇは、入力装置１１を用いて、画面上に表示
されているタイトルのひとつまたは複数をユーザに選択
させる（ステップＡ１１）。Subsequently, the document selection section 4g is activated. The document selection unit 20g allows the user to select one or more of the titles displayed on the screen using the input device 11 (step A11).

【００９９】次に、文書内容表示部２０ｈが起動され
る。文書内容表示部２０は、ステップＡ１１て選択され
たタイトルに対応する文書の内容を表示装置１２の画面
上に表示させる（ステップＡ１２）。Next, the document content display section 20h is activated. The document content display section 20 displays the content of the document corresponding to the title selected in step A11 on the screen of the display device 12 (step A12).

【０１００】このようにして、比較元文書と比較先文書
の双方に共通する単語（共通単語）が存在する場合に、
その共通単語に対する重みとして共通単語の近傍に存在
する単語の分布状況を考慮して類似度を求め、文書の検
索を行なうので、精度良く文書を検索することができる
ようになる。In this way, when a word (common word) common to both the comparison source document and the comparison destination document exists,
Since the similarity is obtained in consideration of the distribution of words existing near the common word as a weight for the common word and the document is searched, the document can be searched with high accuracy.

【０１０１】なお、本発明は上記の実施例に限定される
ものではない。例えば、項データ加算部２０ｄのステッ
プＢ８の加算処理では、結果に重み係数変数の値が反映
されていれば前述した計算式（２）を用いなくても良
い。The present invention is not limited to the above embodiment. For example, in the addition processing in step B8 of the term data addition unit 20d, the above-described calculation formula (2) may not be used if the value of the weight coefficient variable is reflected in the result.

【０１０２】また、重み係数決定部２０ｃによるステッ
プＣ１１，Ｃ１２の処理で用いる計算式としては、ｃｏ
ｕｎｔＳｒｃあるいはｃｏｕｎｔＤｓｔの値が大きいほ
ど重み係数が大きくなる性質を持つものであれば、説明
した計算式でなくても良い。The calculation formula used in the processing of steps C11 and C12 by the weight coefficient determination unit 20c is
The calculation formula described above may not be used as long as the weight coefficient increases as the value of untSrc or countDst increases.

【０１０３】さらに第７図のステップＣ５（図１０）、
及びＣ７（図１１）では距離の比較の際に８０という値
を用いたがこれも他の適当な値に設定することが可能で
ある。その他、発明の趣旨を逸脱しない範囲で種々の変
形が可能である。Further, step C5 in FIG. 7 (FIG. 10),
And C7 (FIG. 11) used a value of 80 when comparing the distances, but this can be set to any other appropriate value. In addition, various modifications can be made without departing from the spirit of the invention.

【０１０４】また、上述した実施形態において記載した
手法は、コンピュータに実行させることのできるプログ
ラムとして、例えば磁気ディスク（フロッピーディス
ク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、
ＤＶＤ等）、半導体メモリなどの記録媒体に書き込んで
各種装置に適用したり、通信媒体により伝送して各種装
置に適用することも可能である。本装置を実現するコン
ピュータは、記録媒体に記録されたプログラムを読み込
み、このプログラムによって動作が制御されることによ
り、上述した処理を実行する。Further, the method described in the above-described embodiment can be executed by a computer as a program such as a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM,
It is also possible to write the data on a recording medium such as a DVD or a semiconductor memory and apply it to various devices, or to transmit it via a communication medium and apply it to various devices. A computer that realizes the present apparatus reads the program recorded on the recording medium, and executes the above-described processing by controlling the operation of the program.

【０１０５】[0105]

【発明の効果】以上詳述したように本発明によれば、比
較元文書と比較先文書の双方に存在する単語の近傍に存
在する他の単語の分布状況を考慮して、精度の良い類似
度算出により文書を検索することが可能となるものであ
る。As described above in detail, according to the present invention, the similarity with high accuracy can be obtained in consideration of the distribution of other words near the words present in both the source document and the destination document. The document can be searched by the degree calculation.

[Brief description of the drawings]

【図１】本発明の実施形態に係わる類似文書検索装置の
構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a similar document search device according to an embodiment of the present invention.

【図２】メモリ１４に設けられたプログラム部２０とバ
ッファ部２２の詳細な構成を示す図。FIG. 2 is a diagram showing a detailed configuration of a program unit 20 and a buffer unit 22 provided in a memory 14.

【図３】外部記憶装置１３に格納される文書データ及び
単語情報データの格納形式について説明するための図。FIG. 3 is a view for explaining a storage format of document data and word information data stored in an external storage device 13;

【図４】文書（テキスト）の一例を示す図。FIG. 4 is a diagram showing an example of a document (text).

【図５】図４に示す文書に対応する単語情報データの一
例を示す図。FIG. 5 is a view showing an example of word information data corresponding to the document shown in FIG. 4;

【図６】本実施形態における類似度計算方式の概念を示
す図。FIG. 6 is a view showing the concept of a similarity calculation method according to the embodiment.

【図７】本実施形態における類似文書検索装置の全体の
動作について説明するためのフローチャート。FIG. 7 is a flowchart for explaining the overall operation of the similar document search device according to the embodiment;

【図８】類似度算出部２０ｂによる類似度算出処理の詳
細について説明するためのフローチャート。FIG. 8 is a flowchart illustrating details of a similarity calculation process performed by the similarity calculation unit 20b.

【図９】重み係数決定部２０ｃによる重み係数決定処理
の詳細について説明するためのフローチャート。FIG. 9 is a flowchart illustrating details of a weight coefficient determination process performed by a weight coefficient determination unit 20c.

【図１０】出現位置の総数ｃｏｕｎｔＳｒｃを求める処
理の詳細を説明するためのフローチャート。FIG. 10 is a flowchart for explaining details of a process for obtaining a total number countSrc of appearance positions.

【図１１】出現位置の総数ｃｏｕｎｔＤｓｔを求める処
理の詳細を説明するためのフローチャート。FIG. 11 is a flowchart for explaining details of a process for obtaining a total number countDst of appearance positions.

【図１２】タイトルー覧表示後の画面の状況を説明する
ための図。FIG. 12 is a view for explaining the state of the screen after the title list display.

【図１３】従来の類似度計算方式の概念を示す図。FIG. 13 is a view showing the concept of a conventional similarity calculation method.

[Explanation of symbols]

１０…制御装置１１…入力装置１２…表示装置１３…外部記憶装置１４…メモリ１５…通信装置２０…プログラム部２０ａ…メイン処理部２０ｂ…類似度算出部２０ｃ…重み計数決定部２０ｄ…項データ加算部２０ｅ…内積データ正規化部２０ｆ…文書一覧表示部２０ｇ…文書選択部２０ｈ…文書内容表示部２２…バッファ部２２ａ…比較元文書データ格納バッファ２２ｂ…比較元単語情報格納バッファ２２ｃ…比較先単語情報格納バッファ２２ｄ…類似度格納バッファ２２ｅ…作業用変数領域 DESCRIPTION OF SYMBOLS 10 ... Control device 11 ... Input device 12 ... Display device 13 ... External storage device 14 ... Memory 15 ... Communication device 20 ... Program part 20a ... Main processing part 20b ... Similarity calculation part 20c ... Weight count determination part 20d ... Term data addition Unit 20e: inner product data normalizing unit 20f: document list display unit 20g ... document selection unit 20h ... document content display unit 22 ... buffer unit 22a ... comparison source document data storage buffer 22b ... comparison source word information storage buffer 22c ... comparison destination word Information storage buffer 22d ... Similarity storage buffer 22e ... Work variable area

───────────────────────────────────────────────────── フロントページの続き (72)発明者中本幸夫東京都青梅市新町1381番地１東芝コンピュ―タエンジニアリング株式会社内 (72)発明者仁科卓哉東京都青梅市新町1381番地１東芝コンピュ―タエンジニアリング株式会社内 (72)発明者久保田直秀東京都青梅市新町1381番地１東芝コンピュ―タエンジニアリング株式会社内 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Yukio Nakamoto 1381-1, Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd. (72) Inventor Takuya Nishina 1381-1, Shinmachi, Ome-shi, Tokyo Toshiba Computer (72) Inventor Naohide Kubota 1381 Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd.

Claims

[Claims]

1. A similar document search device for searching a comparison destination document similar to the comparison source document based on the comparison source document, comprising: A similar document search apparatus, wherein a similarity between the comparison source document and the comparison destination document is obtained based on the distribution status of other words present in the vicinity, and the document is searched.

2. A similar document search apparatus for searching for a comparison destination document similar to the comparison source document based on the comparison source document, wherein a word for extracting a word from each of the comparison source document and the comparison destination document is provided. Extraction means; word information storage means for storing the word extracted by the word extraction means together with appearance position information in each document as word information; and a common word common to the comparison source document and the comparison destination document. A word determining unit for extracting a word; and, based on the word information stored by the word information storing unit, a constant from a common word extracted by the word determining unit for both a comparison source document and a comparison destination document. Word search means for searching for a word that is commonly present within a distance of, and according to the total number of words present within a certain distance from the common word searched for by the word search means, Similar document search apparatus characterized by comprising a weight calculation unit for calculating the weights for common words in both of the comparison target document and 較元 document.

3. Searching for a comparison destination document similar to the comparison source document using a method based on a vector space model, wherein the weight calculated by the weight calculation unit is used as a weight coefficient, Using the weighting factor obtained by
Term data calculating means for calculating one term data of an inner product between word vectors for the word extracted by the word extracting means; and inner product data normalizing means for normalizing the inner product data using the weighting coefficient. 3. A similar document search apparatus according to claim 2, wherein:

4. The weighting factor is set to be a real number of 1 or more, and for each word that is present within a certain distance from the common word for both the comparison source document and the comparison destination document,
4. The similar document search device according to claim 3, wherein the smaller the distance from the common word in the comparison source document and the comparison destination document, the larger the value.

5. A similar document search method for searching for a comparison destination document similar to the comparison source document based on the comparison source document, comprising: A similar document method, wherein a similarity between the comparison source document and the comparison destination document is obtained based on a distribution state of other words existing in the vicinity, and a document is searched.

6. A similar document search method for searching for a comparison destination document similar to the comparison source document based on the comparison source document, wherein a word is extracted from each of the comparison source document and the comparison destination document. The extracted words are stored as word information together with the appearance position information in each document, and common words that are common to the comparison source document and the comparison destination document are extracted. Then, for both the comparison source document and the comparison destination document, a word that exists within a certain distance from the extracted common word is searched, and a word that exists within a certain distance from the searched common word is searched. Calculating a weight for a common word in both the comparison source document and the comparison destination document according to the total number of words to be compared, and determining the similarity between the comparison source document and the comparison destination document using the calculated weight. To search for similar documents Law.