JP3690216B2

JP3690216B2 - Document similarity calculation method, system and apparatus, and recording medium recording similarity calculation program

Info

Publication number: JP3690216B2
Application number: JP33638099A
Authority: JP
Inventors: 直毅藤田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-11-26
Filing date: 1999-11-26
Publication date: 2005-08-31
Anticipated expiration: 2019-11-26
Also published as: JP2001155027A

Description

【０００１】
【発明の属する技術分野】
本発明は類似度計算技術に関し、特に、検索や分類に利用して好適とされる類似文書検索システム、類似文書検索方法および類似文書検索用プログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
情報検索の分野における従来の類似文書検索方法として、例えば、文献１(G.Salton,M.McGill, Introduction to Modern Information Retrieval, NewYork, McGraw-Hill, 1983)に記載されているように、文書における単語の出現頻度を基に、文書の距離あるいは類似度を計算する方法が知られている。この従来の類似文書検索方法においては、各文書における単語の頻度ベクトルを求め、各々のベクトルに、ＴＦ・ＩＤＦと呼ばれる重み付けを行ない、二つのベクトルのなす角度のコサイン値（cosθ＝（x・y）/｜x｜｜y｜、但し、（ｘ・ｙ）は二つのベクトルｘ、ｙの内積、｜x｜、｜y｜は絶対値）を文書間の類似度とする。ＴＦ・ＩＤＦ法において、単語の重要度は、出現頻度ｔｆと、出現件数ｄｆの逆数ｉｄｆを用いて定義され、カテゴリCiにおける単語tkの重要度（重み）W(tk，Ci)は、W(tk，Ci)＝tf(tk，Ci)log（Li／df(tk，Ci)+1）と表され、tf(tk，Ci)はカテゴリCiにおける単語ｔｋの出現頻度、df(tk，Ci)は、カテゴリCiにおける単語tkの出現件数、LiはカテゴリCiにおける総テキスト件数を表している。
【０００３】
また例えば特開平１１−１３４３５９号公報等には、２つの文書の類似度を計算するにあたり、入力された２組の文書及び概要の形態素解析を行って単語を抽出し、概要に含まれる単語に重み付けして、それぞれの文書に含まれる単語に基づいて２つの文書の類似度を計算するようにした方法が提案されている。
【０００４】
【発明が解決しようとする課題】
しなしながら、上記した従来の方法は下記記載の問題点を有している。
【０００５】
第１の問題点は、文書間の類似度の精度が低い、ということである。
【０００６】
その理由は、単語ベクトルを基準とした浅い意味処理しか行なっていないためである。
【０００７】
第２の問題点は、複合語の扱いが難かしい、ということである。
【０００８】
その理由は、文書を一つのベクトルで表現しようとしたためである。
【０００９】
したがって、本発明は、上記問題点に鑑みてなされたものであって、その目的は、より深い意味処理ができる類似文書検索システム及び方法並びに記録媒体を提供することにある。
【００１０】
本発明の他の目的は、複合語や連語の扱いが容易な類似文書検索システム及び方法並びに記録媒体を提供することにある。
【００１１】
【課題を解決するための手段】
前記目的を達成する本発明の類似検索システムは、単語辞書に基づく第１の類似度計算手段と、パタン辞書に基づく第２の類似度計算手段と、２つの類似度から１つの類似度を計算する手段とを備えている。
【００１２】
また本発明は、単語辞書に基づく第１の類似度計算手段と、第１のパタン辞書に基づく第２の類似度計算手段と、第２のパタン辞書に基づく第３の類似度計算手段と、３つの類似度から１つの類似度を計算するよう手段と、を備えている。
【００１３】
【発明の実施の形態】
本発明の実施の形態について説明する。本発明の装置は、その好ましい一実施の形態において、図１を参照すると、記憶された文書の単語辞書による第１のベクトルを記憶する第１のベクトル記憶手段（３１）と、単語辞書を記憶する単語辞書記憶手段（３２）と、記憶された文書のパタン辞書による第２のベクトルを記憶する第２のベクトル記憶手段（３３）と、パタン辞書を記憶するパタン辞書記憶手段（３４）と、を記憶装置（３）が備え、文書を入力する文書入力手段（１）を備え、データ処理装置が、前記単語辞書記憶手段（３４）に記憶された単語により、入力された文書を単語列に分解する単語検索手段（２１）と、分解された単語列の単語頻度を計数し、入力された文書の単語辞書による第１のベクトルを生成する第１のベクトル生成手段（２２）と、前記第１のベクトル生成手段（２２）で生成された第１のベクトルと前記第１のベクトル記憶手段（３１）に記憶された第１のベクトルとを比較し、その類似度を計算する第１の類似度計算手段（２３）と、前記単語検索手段（２１）により分解された単語の配列を文書パタンとする文書パタン生成手段（２４）と、前記パタン辞書記憶手段（３４）に記憶されたパタンにより、文書パタンを走査し、パタンおよび単語の配列に分解するパタン検索手段（２５）と、パタンおよび単語の配列のパタン頻度を計数し、前記入力された文書の前記パタン辞書による第２のベクトルを生成する第２のベクトル生成手段（２６）と、第２のベクトル生成手段（２６）で生成された第２のベクトルと前記第２のベクトル記憶手段（３３）に記憶された第２のベクトルとを比較し、その類似度を計算する第２の類似度計算手段（２７）と、第１の類似度計算手段（２３）の出力と第２の類似度計算手段（２７）の出力を統合して１つの類似度として出力する類似度統合手段（２０）と、を備え、さらに、類似度統合手段（２０）から出力された類似度を出力する手段（４）を備えている。
【００１４】
本発明は、別の実施の形態において、図３を参照すると、記憶装置（３）が、記憶された文書の単語辞書による第１のベクトルを記憶する第１のベクトル記憶手段（３１）と、単語辞書を記憶する単語辞書記憶手段（３２）と、記憶された文書のパタン辞書による第２のベクトルを記憶する第２のベクトル記憶手段（３３）と、第１のパタン辞書を記憶するパタン辞書記憶手段（３４）と、記憶された文書の第２のパタン辞書による第３のベクトルを記憶する第３のベクトル記憶手段（３５）と、第２のパタン辞書を記憶する第２のパタン辞書記憶手段（３６）と、をさらに備え、データ処理装置が、前記単語辞書記憶手段に記憶された単語により、入力された文書を単語列に分解する単語検索手段（２１）と、分解された単語列の単語頻度を計数し、入力された文書の単語辞書によるベクトルを生成する第１のベクトル生成手段（２２）と、前記第１のベクトル生成手段（２２）で生成された第１のベクトルと前記第１のベクトル記憶手段（３１）に記憶された第１のベクトルとを比較し、その類似度を計算する第１の類似度計算手段（２３）と、前記単語検索手段（２１）により分解された単語の配列を文書パタンとする文書パタン生成手段（２４）と、第１のパタン辞書記憶手段（３４）に記憶されたパタンにより、文書パタンを走査し、パタンおよび単語の配列に分解する第１のパタン検索手段（２５）と、パタンおよび単語の配列のパタン頻度を計数し、前記入力された文書の前記第１のパタン辞書による第２のベクトルを生成する第２のベクトル生成手段（２６）と、第２のベクトル生成手段（２６）で生成された第２のベクトルと前記第２のベクトル記憶手段（３３）に記憶された第２のベクトルとを比較し、その類似度を計算する第２の類似度計算手段（２７）と、第２のパタン辞書記憶手段（３６）に記憶されたパタンにより、文書パタンを走査し、パタンおよび単語の配列に分解する第２のパタン検索手段（２８）と、パタンおよび単語の配列のパタン頻度を計数し、前記入力された文書の前記第２のパタン辞書による第３のベクトルを生成する第３のベクトル生成手段（２９）と、第３のベクトル生成手段（２９）で生成された第３のベクトルと第３のベクトル記憶手段（３５）に記憶された第３のベクトルとを比較し、その類似度を計算する第３の類似度計算手段（２Ａ）と、第１乃至第３の類似度計算手段の出力を統合して１つの類似度として出力する類似度統合手段（２０）と、を備えている。
【００１５】
本発明の実施の形態において、データ処理装置が具備する各手段は、データ処理装置で実行されるプログラムによりその処理・機能が実現される。この場合、該プログラムを記録した記録媒体から所定の読み出し装置を介してデータ処理装置の主記憶に実行形式のプログラムをロードして実行することで、本発明を実施することができる。
【００１６】
【実施例】
上記した本発明の実施の形態についてさらに詳細に説明すべく、本発明の実施例について図面を参照して詳細に説明する。図１は、本発明の第１の実施例の構成を示す図である。図１を参照すると、本発明の第１の実施例は、文書入力手段１と、プログラム制御により動作するデータ処理装置２と、記憶装置３と、類似度出力手段４と、を備えている。
【００１７】
データ処理装置２は、図１を参照すると、類似度統合手段２０と、単語検索手段２１と、第１のベクトル生成手段２２と、第１の類似度計算手段２３と、文書パタン生成手段２４と、パタン検索手段２５と、第２のベクトル生成手段２６と、第２の類似度計算手段２７とを含む。
【００１８】
記憶装置３は、第１のベクトル記憶手段３１と、単語辞書記憶手段３２と、第２のベクトル記憶手段３３と、パタン辞書記憶手段３４とを含む。
【００１９】
これらの手段はそれぞれ概略つぎのように動作する。
【００２０】
文書入力手段１は、文書をデータ処理装置２に入力する。
【００２１】
類似度出力手段４は、記憶された文書と入力された文書の類似度を出力する。
【００２２】
類似度統合手段２０は、第１の類似度計算手段２３の出力と、第２の類似度計算手段２７の出力を統合して１つの類似度として出力する。
【００２３】
単語検索手段２１は、単語辞書記憶手段３２に記憶された単語により入力された文書を単語列に分解する。
【００２４】
第１のベクトル生成手段２２は、分解された単語列の単語頻度を計数し、入力された文書の単語辞書によるベクトルを生成する。
【００２５】
第１の類似度計算手段２３は、第１のベクトル生成手段２２で生成されたベクトルと第１のベクトル記憶手段３１に記憶されたベクトルを比較し、その類似度を計算する。
【００２６】
文書パタン生成手段２４は、単語検索手段２１により分解された単語の配列を文書パタンとする。
【００２７】
パタン検索手段２５は、パタン辞書記憶手段３４に記憶されたパタンにより、文書パタンを走査し、パタンおよび単語の配列に分解する。
【００２８】
第２のベクトル生成手段２６は、パタンおよび単語の配列のパタン頻度を計数し、入力された文書のパタン辞書によるベクトルを生成する。
【００２９】
第２の類似度計算手段２７は、第２のベクトル生成手段２６で生成されたベクトルと第２のベクトル記憶手段３３に記憶されたベクトルを比較し、その類似度を計算する。
【００３０】
第１のベクトル記憶手段３１は、記憶された文書の単語辞書によるベクトルを記憶する。
【００３１】
単語辞書記憶手段３２は、単語検索手段２１で利用する単語辞書を記憶する。
【００３２】
第２のベクトル記憶手段３３は、記憶された文書のパタン辞書によるベクトルを記憶する。
【００３３】
パタン辞書記憶手段３４は、パタン検索手段２５で利用するパタン辞書を記憶する。
【００３４】
次に図２は、本発明の第１の実施例の処理手順を示す流れ図である。図１及び図２を参照して、本発明の第１の実施例の全体の動作について詳細に説明する。
【００３５】
まず、文書をデータ処理装置２に入力する（図２のステップＳ１）。
【００３６】
次に、単語検索を行ない文書を単語列に置き換える（ステップＳ２）。
【００３７】
さらに、単語の頻度を計数し第１のベクトルを生成する（ステップＳ３）。
【００３８】
さらに、生成されたベクトルと第１のベクトル記憶手段３１に記憶されたベクトルとの間で、第１の類似度の計算を行なう（ステップＳ４）。
【００３９】
次に、単語列を文書パタンとみなし（ステップＳ５）、文書パタンに対してパタン検索を行ない文書をパタン列に置き換える（ステップＳ６）。
【００４０】
パタンの頻度を計数し第２のベクトルを生成する（ステップＳ７）。
【００４１】
生成されたベクトルと第２のベクトル記憶手段３３に記憶されたベクトルとの間で、第２の類似度の計算を行なう（ステップＳ８）。
【００４２】
最後に、得られた２つの類似度の統合類似度を計算し、出力する（ステップＳ９）。
【００４３】
次に、本発明の第１の実施例の作用効果について説明する。
【００４４】
本発明の第１の実施例では、意味情報を抽出するためのパタンをパタン辞書として持ち、単語辞書を利用した浅い意味処理による類似度とパタン辞書を利用したより深い意味処理による類似度を同時に有効に利用できるように構成されている。このため、従来の方法よりも深い意味処理ができる。
【００４５】
次に、本発明の第１の実施例について具体例に則して説明する。
【００４６】
図６乃至図８に示すように、「資料を送付してくださいね。」という内容の文書が入力されたとする。、図６の単語辞書により、
「資料」、「を」、「送付」、「して」、「ください」、「ね」、「。」
と分解され、さらに、
「資料」、「を」、「送付」、「する」、「くださる」、「ね」、「。」
と正規形に変換される。
【００４７】
これから単語頻度ベクトル（第１のベクトル）として、
（「資料」1，「を」1，「送付」1，「する」1，「くださる」1，「ね」1，「。」1）
が得られる。これを基に公知のＴＦ・ＩＤＦ法等を用いて類似度が計算される。
【００４８】
次に、図８の文書パタンの例に示されているように、
「資料を送付してくださいね。」という内容の文書（元文書）は、例えば、
「資料を送付してくださいね。＄」
という形（文書パタン例１）に変換される（ただし、＄は文末を表す記号）。
【００４９】
また、「する」、「ね」、「。」などを不要語とする不要語辞書を利用すると、
「資料を送付＊ください＊＄」
という形（文書パタン例２）にも変換される。
【００５０】
このような文書パタンに対して、図７に示すようなパタン辞書を利用すると、「資料送付希望」という形に変換され、これから、ベクトル（第２ベクトル）として、
（「資料」1，「送付希望」1）
が得られる。
【００５１】
このように、複数の抽象度に応じたベクトルを生成することにより、目的に応じた抽象度の表現を選択することができる。
【００５２】
これも、前回同様に、例えば公知のＴＦ・ＩＤＦ法などにより、類似度が計算される。
【００５３】
得られた２つの類似度は、単純積や重み付き和により、一つの類似度に変換され出力される。
【００５４】
なお、本発明の第１の実施例において、得られた類似度を昇順又は降順にソースするようにしてもよい。
【００５５】
次に、本発明の第２の実施例について図面を参照して詳細に説明する。
【００５６】
図３は、本発明の第２の実施例の構成を示す図である。図３を参照すると、本発明の第２の実施例は、文書入力手段１と、プログラム制御により動作するデータ処理装置２と、記憶装置３と、類似度出力手段４とを備えている。
【００５７】
データ処理装置２は、類似度統合手段２０と、単語検索手段２１と、第１のベクトル生成手段２２と、第１の類似度計算手段２３と、文書パタン生成手段２４と、第１のパタン検索手段２５と、第２のベクトル生成手段２６と、第２の類似度計算手段２７と、第２のパタン検索手段２８と、第３のベクトル生成手段２９と、第３の類似度計算手段２Ａとを含む。
【００５８】
記憶装置３は、第１のベクトル記憶手段３１と、単語辞書記憶手段３２と、第２のベクトル記憶手段３３と、第１のパタン辞書記憶手段３４と、第３のベクトル記憶手段３５と、第２のパタン辞書記憶手段３６とを含む。
【００５９】
これらの手段はそれぞれ概略つぎのような機能を有する。
【００６０】
文書入力手段１は、文書をデータ処理装置２に入力する。
【００６１】
類似度出力手段４は、記憶された文書と入力された文書の類似度を出力する。
【００６２】
類似度統合手段２０は、第１の類似度計算手段２３の出力と、第２の類似度計算手段２７の出力を統合して１つの類似度として出力する。
【００６３】
単語検索手段２１は、単語辞書記憶手段３２に記憶された単語により入力された文書を単語列に分解する。
【００６４】
第１のベクトル生成手段２２は、分解された単語列の単語頻度を計数し、入力された文書の単語辞書によるベクトルを生成する。
【００６５】
第１の類似度計算手段２３は、第１のベクトル生成手段２２で生成されたベクトルと第１のベクトル記憶手段３１に記憶されたベクトルとを比較し、その類似度を計算する。
【００６６】
文書パタン生成手段２４は、単語検索手段２１により分解された単語の配列を文書パタンとする。
【００６７】
第１のパタン検索手段２５は、第１のパタン辞書記憶手段３４に記憶されたパタンにより、文書パタンを走査し、パタンおよび単語の配列に分解する。
【００６８】
第２のベクトル生成手段２６は、パタンおよび単語の配列のパタン頻度を計数し、入力された文書のパタン辞書によるベクトルを生成する。
【００６９】
第２の類似度計算手段２７は、第２のベクトル生成手段２６で生成されたベクトルと第２のベクトル記憶手段３３に記憶されたベクトルとを比較し、その類似度を計算する。
【００７０】
第２のパタン検索手段２８は、第２のパタン辞書記憶手段３６に記憶されたパタンにより、文書パタンを走査し、パタンおよび単語の配列に分解する。
【００７１】
第３のベクトル生成手段２９は、パタンおよび単語の配列のパタン頻度を計数し、入力された文書のパタン辞書によるベクトルを生成する。
【００７２】
第３の類似度計算手段２Ａは、第３のベクトル生成手段２９で生成されたベクトルと第３のベクトル記憶手段３５に記憶されたベクトルを比較し、その類似度を計算する。
【００７３】
第１のベクトル記憶手段３１は、記憶された文書の単語辞書によるベクトルを記憶する。
【００７４】
単語辞書記憶手段３２は、単語検索手段２１で利用する単語辞書を記憶する。
【００７５】
第２のベクトル記憶手段３３は、記憶された文書のパタン辞書によるベクトルを記憶する。
【００７６】
パタン辞書記憶手段３４は、パタン検索手段２５で利用するパタン辞書を記憶する。
【００７７】
次に、図４は、本発明の第２の実施例の処理手順を示す流れ図である。図３及び図４を参照して、本発明の第２の実施例の全体の動作について詳細に説明する。なお、図４のステップＳ１〜Ｓ８は、図２に示した処理と実質的に同一である。
【００７８】
まず、文書をデータ処理装置に入力する（図４のステップＳ１）。
【００７９】
次に、単語検索を行ない文書を単語列に置き換える（ステップＳ２）。
【００８０】
さらに、単語の頻度を計数し第１のベクトルを生成する（ステップＳ３）。
【００８１】
さらに、生成されたベクトルと第１のベクトル記憶手段３１に記憶されたベクトルとの間で、第１の類似度計算を行なう（ステップＳ４）。
【００８２】
次に、単語列を文書パタンとみなし（ステップＳ５）、文書パタンに対して第１のパタン辞書を利用してパタン検索を行ない、文書をパタン列に置き換える（ステップＳ６）。
【００８３】
さらに、パタンの頻度を計数し第２のベクトルを生成する（ステップＳ７）。
【００８４】
さらに、生成されたベクトルと第２のベクトル記憶手段３３に記憶されたベクトルとの間で、第２の類似度計算を行なう（ステップＳ８）。
【００８５】
文書パタンに対して第２のパタン辞書を利用してパタン検索を行ない文書をパタン列に置き換える（ステップＳ１０）。
【００８６】
さらに、パタンの頻度を計数し第３のベクトルを生成する（ステップＳ１１）。
【００８７】
さらに、生成された第３のベクトルと第３のベクトル記憶手段３５に記憶されたベクトルとの間で、第３の類似度計算を行なう（ステップＳ１２）。
【００８８】
最後に、得られた２つの類似度の統合類似度を計算し出力する（ステップＳ１３）。
【００８９】
次に、本発明の第２の実施例の作用効果について説明する。
【００９０】
本実施例では、意味情報を抽出するためのパタンをパタン辞書として持ち、単語辞書を利用した浅い意味処理による類似度とパタン辞書を利用したより深い意味処理による類似度を同時に有効に利用できるように構成されているため、従来より深い意味処理ができる。
【００９１】
次に、本発明の第２の実施例について具体例に則して説明する。
【００９２】
図６乃至図８に示すように、「資料を送付してくださいね。」という内容の文書が入力されたとすると、図６の単語辞書により、
「資料」、「を」、「送付」、「して」、「ください」、「ね」、「。」
と分解され、さらに、
「資料」、「を」、「送付」、「する」、「くださる」、「ね」、「。」
と正規形に変換される。
【００９３】
これから単語頻度ベクトルとして、
（「資料」1，「を」1，「送付」1，「する」1，「くださる」1，「ね」1，「。」1）
が得られ、これを基に、例えば公知のＴＦ・ＩＤＦ法などにより、類似度が計算される。
【００９４】
次に、図８の文書パタンの例にあるように、「資料を送付してくださいね。」という内容の文書は、例えば、
「資料を送付してくださいね。＄」
という形に変換される（ただし、＄は文末を表す記号）。
【００９５】
また、「する」「ね」「。」などを不要語とする不要語辞書を利用すると、
「資料を送付＊ください＊＄」
という形にも書ける。
【００９６】
このような文書パタンに対し、図７に示すようなパタン辞書を利用すると、
「資料送付希望」という形に変換され、これからベクトルとして、
（「資料」1，「送付希望」1）
が得られる。
【００９７】
このように、複数の抽象度に応じたベクトルを生成することにより、目的に応じた抽象度の表現を選択できる。そして、前回と同様、例えば公知のＴＦ・ＩＤＦ法などにより類似度が計算される。
【００９８】
同様に、別のパタン辞書を利用すると、別の類似度が計算される。
【００９９】
得られた３つの類似度は、単純積や重み付き和により一つの類似度に変換され出力される。
【０１００】
なお、本発明の第１の実施例において、得られた類似度を昇順又は降順にソースするようにしてもよい。
【０１０１】
次に、本発明の第３の実施例について図面を参照して詳細に説明する。図５は、本発明の第３の実施例の構成を示す図である。図５を参照すると、本発明の第３の実施例は、類似度計算プログラムを記録した記録媒体５を備える。この媒体としては、ＦＤ（フレキシブルディスク）等の磁気ディスク、半導体メモリ、ＣＤ−ＲＯＭ、ＤＶＤ（digital versatile disk）、ＭＴその他の記録媒体であってよい。また、データ処理装置２が、通信手段を介して、サーバ装置等他のデータ処理装置の記憶媒体から、類似度計算プログラムをダウンロードすることで本発明を実施するようにしてもよい。
【０１０２】
類似度計算プログラムは、記録媒体５からデータ処理装置２に読み込まれ、コンピュータの動作を制御する。コンピュータは類似度計算プログラムの制御により以下の処理、すなわち、前記した第１の実施例、又は第２の実例におけるデータ処理装置２による処理、すなわち図２、図４の流れ図で規定される処理と同一の処理を実行する。
【０１０３】
すなわち、図１を参照すると、文書入力手段１と、プログラム制御により動作するデータ処理装置２と、記憶装置３と、類似度出力手段４とを備え、データ処理装置２は、類似度統合手段２０と、単語検索手段２１と、第１のベクトル生成手段２２と、第１の類似度計算手段２３と、文書パタン生成手段２４と、パタン検索手段２５と、第２のベクトル生成手段２６と、第２の類似度計算手段２７とを含む。
【０１０４】
記憶装置３は、第１のベクトル記憶手段３１と、単語辞書記憶手段３２と、第２のベクトル記憶手段３３と、パタン辞書記憶手段３４とを含む。
【０１０５】
あるいは、図３を参照すると、文書入力手段１と、プログラム制御により動作するデータ処理装置２と、記憶装置３と、類似度出力手段４とを備え、データ処理装置２は、類似度統合手段２０と、単語検索手段２１と、第１のベクトル生成手段２２と、第１の類似度計算手段２３と、文書パタン生成手段２４と、第１のパタン検索手段２５と、第２のベクトル生成手段２６と、第２の類似度計算手段２７と、第２のパタン検索手段２８と、第３のベクトル生成手段２９と、第３の類似度計算手段２Ａとを含む。
【０１０６】
記憶装置３は、第１のベクトル記憶手段３１と、単語辞書記憶手段３２と、第２のベクトル記憶手段３３と、第１のパタン辞書記憶手段３４と、第３のベクトル記憶手段３５と、第２のパタン辞書記憶手段３６とを含む。
【０１０７】
【発明の効果】
以上説明したように、本発明によれば下記記載の効果を奏する。
【０１０８】
本発明の第１の効果は、従来の方法よりも深い意味処理を行うことができる、ということである。
【０１０９】
その理由は、本発明においては、意味情報を抽出するためのパタンをパタン辞書として備え、単語辞書を利用した浅い意味処理による類似度と、パタン辞書を利用したより深い意味処理による類似度とを同時に有効に利用できるようにしたためである。
【０１１０】
本発明の第２の効果は、複合語や連語の扱いを容易化する、ということである。
【０１１１】
その理由は、本発明においては、パタン辞書にパタンとしてこれらを登録できるためである。
【図面の簡単な説明】
【図１】本発明の第１の実施例の構成を示す示すブ図である。
【図２】本発明の第１の実施例の動作を示す流れ図である。
【図３】本発明の第２の実施例の構成を示す図である。
【図４】本発明の第２の実施例の動作を示す流れ図である。
【図５】本発明の第３の実施例の構成を示す図である。
【図６】本発明の実施例を説明するための図であり、単語辞書の具体例を示す図である。
【図７】本発明の実施例を説明するための図であり、パタン辞書の具体例を示す図である。
【図８】本発明の実施例を説明するための図であり、文書パタンの具体例を示す図である。
【符号の説明】
１文書入力手段
２データ処理装置
２０類似度統合手段
２１単語検索手段
２２第１のベクトル生成手段
２３第１の類似度計算手段
２４文書パタン生成手段
２５第１のパタン検索手段
２６第２のベクトル生成手段
２７第２の類似度計算手段
２８第２のパタン検索手段
２９第３のベクトル生成手段
２Ａ第３の類似度計算手段
３記憶装置
３１第１のベクトル記憶手段
３２単語辞書記憶手段
３３第２のベクトル記憶手段
３４パタン辞書記憶手段
４類似度出力手段
５記録媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a similarity calculation technique, and more particularly to a similar document search system, a similar document search method, and a recording medium on which a similar document search program is recorded that is preferably used for searching and classification.
[0002]
[Prior art]
As a conventional similar document retrieval method in the field of information retrieval, for example, as described in Reference 1 (G. Salton, M. McGill, Introduction to Modern Information Retrieval, New York, McGraw-Hill, 1983) A method of calculating the distance or similarity of documents based on the appearance frequency of words is known. In this conventional similar document search method, a word frequency vector in each document is obtained, each vector is subjected to weighting called TF / IDF, and a cosine value (cos θ = (x · y) between two vectors is formed. ) / | X || y |, where (x · y) is the inner product of two vectors x and y, and | x | and | y | are absolute values). In the TF / IDF method, the importance of a word is defined using the appearance frequency tf and the reciprocal idf of the number of occurrences df, and the importance (weight) W (tk, Ci) of the word tk in the category Ci is W (tk, Ci). tk, Ci) = tf (tk, Ci) log (Li / df (tk, Ci) +1), where tf (tk, Ci) is the frequency of occurrence of the word tk in category Ci, df (tk, Ci) Represents the number of occurrences of the word tk in category Ci, and Li represents the total number of texts in category Ci.
[0003]
Also, for example, in Japanese Patent Application Laid-Open No. 11-134359, etc., when calculating the similarity between two documents, a word is extracted by performing morphological analysis of the two sets of input documents and the summary, and the words included in the summary are extracted. A method has been proposed in which the similarity between two documents is calculated based on the words included in each document.
[0004]
[Problems to be solved by the invention]
However, the conventional method described above has the following problems.
[0005]
The first problem is that the accuracy of similarity between documents is low.
[0006]
This is because only shallow semantic processing based on word vectors is performed.
[0007]
The second problem is that it is difficult to handle compound words.
[0008]
The reason is that the document is expressed by one vector.
[0009]
Accordingly, the present invention has been made in view of the above problems, and an object thereof is to provide a similar document search system and method, and a recording medium, which can perform deeper semantic processing.
[0010]
Another object of the present invention is to provide a similar document search system and method, and a recording medium that can easily handle compound words and collocations.
[0011]
[Means for Solving the Problems]
The similarity search system of the present invention that achieves the above object includes a first similarity calculation means based on a word dictionary, a second similarity calculation means based on a pattern dictionary, and one similarity calculated from two similarities. Means.
[0012]
The present invention also includes a first similarity calculation unit based on a word dictionary, a second similarity calculation unit based on a first pattern dictionary, a third similarity calculation unit based on a second pattern dictionary, Means for calculating one similarity from the three similarities.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described. In a preferred embodiment of the apparatus of the present invention, referring to FIG. 1, a first vector storage means (31) for storing a first vector according to a word dictionary of a stored document and a word dictionary are stored. A word dictionary storage means (32) for storing, a second vector storage means (33) for storing a second vector based on a pattern dictionary of the stored document, a pattern dictionary storage means (34) for storing a pattern dictionary, The storage device (3) includes a document input means (1) for inputting a document, and the data processing device converts the input document into a word string using words stored in the word dictionary storage means (34). A word search means (21) for decomposing, a first vector generating means (22) for counting the word frequency of the decomposed word string, and generating a first vector based on the word dictionary of the input document; 1's A first similarity calculation for comparing the first vector generated by the vector generation means (22) and the first vector stored in the first vector storage means (31) and calculating the similarity. A document pattern generation means (24) having a word pattern decomposed by the word search means (21) as a document pattern, and a pattern stored in the pattern dictionary storage means (34). A pattern search means (25) that scans the pattern and decomposes it into a pattern and word arrangement, counts the pattern frequency of the pattern and word arrangement, and generates a second vector by the pattern dictionary of the input document. Second vector generation means (26), second vector generated by second vector generation means (26), and second vector stored in second vector storage means (33) , The second similarity calculation means (27) for calculating the similarity, and the output of the first similarity calculation means (23) and the output of the second similarity calculation means (27) are integrated. And a similarity integrating means (20) for outputting as one similarity, and further comprising means (4) for outputting the similarity output from the similarity integrating means (20).
[0014]
In another embodiment of the present invention, referring to FIG. 3, the storage device (3) includes a first vector storage means (31) for storing a first vector according to a word dictionary of a stored document; Word dictionary storage means (32) for storing a word dictionary, second vector storage means (33) for storing a second vector based on a pattern dictionary of the stored document, and a pattern dictionary for storing the first pattern dictionary Storage means (34), third vector storage means (35) for storing a third vector in the second pattern dictionary of the stored document, and second pattern dictionary storage for storing the second pattern dictionary Means (36), wherein the data processing device uses the words stored in the word dictionary storage means to decompose the input document into word strings, and the decomposed word strings Word frequency The first vector generating means (22) for counting and generating a vector based on the word dictionary of the input document, the first vector generated by the first vector generating means (22), and the first vector A first similarity calculation means (23) for comparing the first vector stored in the storage means (31) and calculating the similarity, and an array of words decomposed by the word search means (21) A first pattern search that scans a document pattern by a pattern stored in the document pattern generation means (24) having the document pattern as a document pattern and the first pattern dictionary storage means (34), and decomposes the pattern into a pattern of words and words. Means (25); second vector generation means (26) for counting pattern frequencies of patterns and word arrangements and generating a second vector by the first pattern dictionary of the input document; A second vector generated by the second vector generation means (26) and the second vector stored in the second vector storage means (33), and the similarity is calculated. Degree calculation means (27), second pattern search means (28) that scans the document pattern using the patterns stored in the second pattern dictionary storage means (36), and decomposes the document pattern into patterns and word arrays; A third vector generating means (29) for counting the pattern frequency of the pattern and word arrangement and generating a third vector by the second pattern dictionary of the input document; and a third vector generating means ( A third similarity calculation means (2A) for comparing the third vector generated in 29) with the third vector stored in the third vector storage means (35) and calculating the similarity; , First to third similarities Similarity integration means (20) for integrating the outputs of the degree calculation means and outputting them as one similarity.
[0015]
In the embodiment of the present invention, each means included in the data processing device has its processing / function realized by a program executed by the data processing device. In this case, the present invention can be implemented by loading and executing a program in an execution format from the recording medium on which the program is recorded via a predetermined reading device to the main memory of the data processing device.
[0016]
【Example】
In order to describe the above-described embodiment of the present invention in more detail, examples of the present invention will be described in detail with reference to the drawings. FIG. 1 is a diagram showing the configuration of the first exemplary embodiment of the present invention. Referring to FIG. 1, the first embodiment of the present invention includes a document input unit 1, a data processing device 2 that operates under program control, a storage device 3, and a similarity output unit 4.
[0017]
Referring to FIG. 1, the data processing apparatus 2 includes a similarity integration unit 20, a word search unit 21, a first vector generation unit 22, a first similarity calculation unit 23, and a document pattern generation unit 24. Pattern search means 25, second vector generation means 26, and second similarity calculation means 27.
[0018]
The storage device 3 includes first vector storage means 31, word dictionary storage means 32, second vector storage means 33, and pattern dictionary storage means 34.
[0019]
Each of these means generally operates as follows.
[0020]
The document input unit 1 inputs a document to the data processing device 2.
[0021]
The similarity output unit 4 outputs the similarity between the stored document and the input document.
[0022]
The similarity integration unit 20 integrates the output of the first similarity calculation unit 23 and the output of the second similarity calculation unit 27 and outputs it as one similarity.
[0023]
The word search means 21 decomposes the document input by the words stored in the word dictionary storage means 32 into word strings.
[0024]
The first vector generation means 22 counts the word frequency of the decomposed word string, and generates a vector based on the word dictionary of the input document.
[0025]
The first similarity calculation unit 23 compares the vector generated by the first vector generation unit 22 with the vector stored in the first vector storage unit 31 and calculates the similarity.
[0026]
The document pattern generation unit 24 uses the word sequence decomposed by the word search unit 21 as a document pattern.
[0027]
The pattern search means 25 scans the document pattern based on the patterns stored in the pattern dictionary storage means 34 and decomposes it into patterns and word arrays.
[0028]
The second vector generation means 26 counts the pattern frequency of the pattern and word arrangement, and generates a vector based on the pattern dictionary of the input document.
[0029]
The second similarity calculation means 27 compares the vector generated by the second vector generation means 26 with the vector stored in the second vector storage means 33, and calculates the similarity.
[0030]
The first vector storage means 31 stores a vector of a stored document according to a word dictionary.
[0031]
The word dictionary storage unit 32 stores a word dictionary used by the word search unit 21.
[0032]
The second vector storage means 33 stores a vector of a stored document according to a pattern dictionary.
[0033]
The pattern dictionary storage unit 34 stores a pattern dictionary used by the pattern search unit 25.
[0034]
Next, FIG. 2 is a flowchart showing the processing procedure of the first embodiment of the present invention. The overall operation of the first exemplary embodiment of the present invention will be described in detail with reference to FIGS.
[0035]
First, a document is input to the data processing device 2 (step S1 in FIG. 2).
[0036]
Next, a word search is performed and the document is replaced with a word string (step S2).
[0037]
Further, the frequency of words is counted to generate a first vector (step S3).
[0038]
Further, a first similarity is calculated between the generated vector and the vector stored in the first vector storage means 31 (step S4).
[0039]
Next, the word string is regarded as a document pattern (step S5), a pattern search is performed on the document pattern, and the document is replaced with the pattern string (step S6).
[0040]
The frequency of the pattern is counted to generate a second vector (step S7).
[0041]
A second similarity is calculated between the generated vector and the vector stored in the second vector storage means 33 (step S8).
[0042]
Finally, the integrated similarity of the two obtained similarities is calculated and output (step S9).
[0043]
Next, the function and effect of the first embodiment of the present invention will be described.
[0044]
In the first embodiment of the present invention, a pattern for extracting semantic information is provided as a pattern dictionary, and the similarity by shallow semantic processing using a word dictionary and the similarity by deeper semantic processing using a pattern dictionary are simultaneously displayed. It is configured to be used effectively. For this reason, semantic processing deeper than the conventional method can be performed.
[0045]
Next, the first embodiment of the present invention will be described based on a specific example.
[0046]
As shown in FIG. 6 to FIG. 8, it is assumed that a document with the content “Please send us materials” is input. The word dictionary of FIG.
"Document", "O", "Send", "Do", "Please", "Ne", "."
And then,
"Document", "O", "Send", "Yes", "Give me", "Ne", "."
And is converted to normal form.
[0047]
From now on, as the word frequency vector (first vector),
("Document" 1, "O" 1, "Send" 1, "Yes" 1, "Please" 1, "Ne" 1, "." 1)
Is obtained. Based on this, the similarity is calculated using a known TF / IDF method or the like.
[0048]
Next, as shown in the example document pattern in FIG.
For example, a document (original document) with the content “Please send us your materials.”
"Please send us your materials. $"
(Document pattern example 1) (where $ is a symbol representing the end of a sentence).
[0049]
Also, if you use an unnecessary word dictionary with unnecessary words such as “Yes”, “Ne”, “.”,
“Send materials ** $”
Is also converted into a form (document pattern example 2).
[0050]
If a pattern dictionary such as that shown in FIG. 7 is used for such a document pattern, the document pattern is converted into a form “request for material delivery”. From now on, as a vector (second vector)
("Document" 1, "Request to send" 1)
Is obtained.
[0051]
In this way, by generating a vector corresponding to a plurality of abstraction levels, it is possible to select an expression of abstraction level corresponding to the purpose.
[0052]
Similarly to the previous time, the similarity is calculated by a known TF / IDF method, for example.
[0053]
The obtained two similarities are converted into one similarity by a simple product or a weighted sum and output.
[0054]
In the first embodiment of the present invention, the obtained similarities may be sourced in ascending order or descending order.
[0055]
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.
[0056]
FIG. 3 is a diagram showing the configuration of the second exemplary embodiment of the present invention. Referring to FIG. 3, the second embodiment of the present invention comprises a document input means 1, a data processing device 2 operated by program control, a storage device 3, and a similarity output means 4.
[0057]
The data processing apparatus 2 includes a similarity integration unit 20, a word search unit 21, a first vector generation unit 22, a first similarity calculation unit 23, a document pattern generation unit 24, and a first pattern search. Means 25, second vector generation means 26, second similarity calculation means 27, second pattern search means 28, third vector generation means 29, third similarity calculation means 2A, including.
[0058]
The storage device 3 includes a first vector storage unit 31, a word dictionary storage unit 32, a second vector storage unit 33, a first pattern dictionary storage unit 34, a third vector storage unit 35, 2 pattern dictionary storage means 36.
[0059]
Each of these means generally has the following functions.
[0060]
The document input unit 1 inputs a document to the data processing device 2.
[0061]
The similarity output unit 4 outputs the similarity between the stored document and the input document.
[0062]
The similarity integration unit 20 integrates the output of the first similarity calculation unit 23 and the output of the second similarity calculation unit 27 and outputs it as one similarity.
[0063]
The word search means 21 decomposes the document input by the words stored in the word dictionary storage means 32 into word strings.
[0064]
The first vector generation means 22 counts the word frequency of the decomposed word string, and generates a vector based on the word dictionary of the input document.
[0065]
The first similarity calculation unit 23 compares the vector generated by the first vector generation unit 22 with the vector stored in the first vector storage unit 31 and calculates the similarity.
[0066]
The document pattern generation unit 24 uses the word sequence decomposed by the word search unit 21 as a document pattern.
[0067]
The first pattern search means 25 scans the document pattern with the patterns stored in the first pattern dictionary storage means 34 and decomposes it into patterns and word arrays.
[0068]
The second vector generation means 26 counts the pattern frequency of the pattern and word arrangement, and generates a vector based on the pattern dictionary of the input document.
[0069]
The second similarity calculation means 27 compares the vector generated by the second vector generation means 26 with the vector stored in the second vector storage means 33, and calculates the similarity.
[0070]
The second pattern search means 28 scans the document pattern based on the pattern stored in the second pattern dictionary storage means 36, and decomposes it into an arrangement of patterns and words.
[0071]
The third vector generation means 29 counts the pattern frequency of the pattern and word arrangement, and generates a vector based on the pattern dictionary of the input document.
[0072]
The third similarity calculation means 2A compares the vector generated by the third vector generation means 29 with the vector stored in the third vector storage means 35, and calculates the similarity.
[0073]
The first vector storage means 31 stores a vector of a stored document according to a word dictionary.
[0074]
The word dictionary storage unit 32 stores a word dictionary used by the word search unit 21.
[0075]
The second vector storage means 33 stores a vector of a stored document according to a pattern dictionary.
[0076]
The pattern dictionary storage unit 34 stores a pattern dictionary used by the pattern search unit 25.
[0077]
Next, FIG. 4 is a flowchart showing the processing procedure of the second embodiment of the present invention. The overall operation of the second exemplary embodiment of the present invention will be described in detail with reference to FIGS. Note that steps S1 to S8 in FIG. 4 are substantially the same as the processing shown in FIG.
[0078]
First, a document is input to the data processing apparatus (step S1 in FIG. 4).
[0079]
Next, a word search is performed and the document is replaced with a word string (step S2).
[0080]
Further, the frequency of words is counted to generate a first vector (step S3).
[0081]
Further, a first similarity calculation is performed between the generated vector and the vector stored in the first vector storage unit 31 (step S4).
[0082]
Next, the word string is regarded as a document pattern (step S5), a pattern search is performed on the document pattern using the first pattern dictionary, and the document is replaced with the pattern string (step S6).
[0083]
Further, the frequency of the pattern is counted to generate a second vector (step S7).
[0084]
Further, a second similarity calculation is performed between the generated vector and the vector stored in the second vector storage means 33 (step S8).
[0085]
A pattern search is performed on the document pattern using the second pattern dictionary, and the document is replaced with a pattern string (step S10).
[0086]
Further, the pattern frequency is counted to generate a third vector (step S11).
[0087]
Further, a third similarity calculation is performed between the generated third vector and the vector stored in the third vector storage means 35 (step S12).
[0088]
Finally, the integrated similarity of the two obtained similarities is calculated and output (step S13).
[0089]
Next, the function and effect of the second embodiment of the present invention will be described.
[0090]
In this embodiment, a pattern for extracting semantic information is provided as a pattern dictionary, and the similarity by shallow semantic processing using a word dictionary and the similarity by deeper semantic processing using a pattern dictionary can be used effectively at the same time. Therefore, deeper semantic processing can be performed.
[0091]
Next, a second embodiment of the present invention will be described based on a specific example.
[0092]
As shown in FIG. 6 to FIG. 8, if a document with the content “Please send materials” is input, the word dictionary of FIG.
"Document", "O", "Send", "Do", "Please", "Ne", "."
And then,
"Document", "O", "Send", "Yes", "Give me", "Ne", "."
And is converted to normal form.
[0093]
From now on, as the word frequency vector,
("Document" 1, "O" 1, "Send" 1, "Yes" 1, "Please" 1, "Ne" 1, "." 1)
Based on this, the similarity is calculated by, for example, a known TF / IDF method.
[0094]
Next, as shown in the example of the document pattern in FIG. 8, a document with the content “Please send materials” is, for example,
"Please send me the materials. $"
(Where $ is a symbol representing the end of a sentence).
[0095]
Also, if you use an unnecessary word dictionary with unnecessary words such as “Yes”, “Ne”, “.”, Etc.,
“Send materials ** $”
It can also be written in the form of
[0096]
For such a document pattern, if a pattern dictionary as shown in FIG. 7 is used,
It will be converted into the form of “Request to send materials” and from now on as a vector,
("Document" 1, "Request to send" 1)
Is obtained.
[0097]
Thus, by generating a vector corresponding to a plurality of abstraction levels, an expression of abstraction level corresponding to the purpose can be selected. Similar to the previous time, the similarity is calculated by, for example, a known TF / IDF method.
[0098]
Similarly, when another pattern dictionary is used, another similarity is calculated.
[0099]
The obtained three similarities are converted into one similarity by a simple product or a weighted sum and output.
[0100]
In the first embodiment of the present invention, the obtained similarities may be sourced in ascending order or descending order.
[0101]
Next, a third embodiment of the present invention will be described in detail with reference to the drawings. FIG. 5 is a diagram showing the configuration of the third exemplary embodiment of the present invention. Referring to FIG. 5, the third embodiment of the present invention includes a recording medium 5 on which a similarity calculation program is recorded. The medium may be a magnetic disk such as an FD ( flexible disk), a semiconductor memory, a CD-ROM, a DVD (digital versatile disk), an MT, or other recording medium. Further, the present invention may be implemented by the data processing device 2 downloading a similarity calculation program from a storage medium of another data processing device such as a server device via a communication means.
[0102]
The similarity calculation program is read from the recording medium 5 into the data processing device 2 and controls the operation of the computer. The computer controls the following processing under the control of the similarity calculation program, that is, the processing by the data processing device 2 in the first embodiment or the second example, that is, the processing defined by the flowcharts of FIGS. Execute the same process.
[0103]
That is, referring to FIG. 1, a document input unit 1, a data processing device 2 that operates under program control, a storage device 3, and a similarity output unit 4 are provided. The data processing device 2 includes a similarity integration unit 20. A word search means 21, a first vector generation means 22, a first similarity calculation means 23, a document pattern generation means 24, a pattern search means 25, a second vector generation means 26, 2 similarity calculation means 27.
[0104]
The storage device 3 includes first vector storage means 31, word dictionary storage means 32, second vector storage means 33, and pattern dictionary storage means 34.
[0105]
Alternatively, referring to FIG. 3, it includes a document input unit 1, a data processing device 2 that operates under program control, a storage device 3, and a similarity output unit 4, and the data processing device 2 includes a similarity integration unit 20. The word search means 21, the first vector generation means 22, the first similarity calculation means 23, the document pattern generation means 24, the first pattern search means 25, and the second vector generation means 26. And second similarity calculation means 27, second pattern search means 28, third vector generation means 29, and third similarity calculation means 2A.
[0106]
The storage device 3 includes a first vector storage unit 31, a word dictionary storage unit 32, a second vector storage unit 33, a first pattern dictionary storage unit 34, a third vector storage unit 35, 2 pattern dictionary storage means 36.
[0107]
【The invention's effect】
As described above, the present invention has the following effects.
[0108]
The first effect of the present invention is that semantic processing deeper than the conventional method can be performed.
[0109]
The reason for this is that in the present invention, a pattern for extracting semantic information is provided as a pattern dictionary, and the similarity based on shallow semantic processing using a word dictionary and the similarity based on deeper semantic processing using a pattern dictionary. This is because it can be used effectively at the same time.
[0110]
The second effect of the present invention is to facilitate the handling of compound words and collocations.
[0111]
This is because in the present invention, these can be registered as patterns in the pattern dictionary.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a first exemplary embodiment of the present invention.
FIG. 2 is a flowchart showing the operation of the first exemplary embodiment of the present invention.
FIG. 3 is a diagram showing a configuration of a second exemplary embodiment of the present invention.
FIG. 4 is a flowchart showing the operation of the second exemplary embodiment of the present invention.
FIG. 5 is a diagram showing a configuration of a third exemplary embodiment of the present invention.
FIG. 6 is a diagram for explaining an embodiment of the present invention, and is a diagram showing a specific example of a word dictionary.
FIG. 7 is a diagram for explaining an example of the present invention, and is a diagram showing a specific example of a pattern dictionary.
FIG. 8 is a diagram for explaining an embodiment of the present invention, and is a diagram showing a specific example of a document pattern.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Document input means 2 Data processing apparatus 20 Similarity integration means 21 Word search means 22 First vector generation means 23 First similarity calculation means 24 Document pattern generation means 25 First pattern search means 26 Second vector generation Means 27 Second similarity calculation means 28 Second pattern search means 29 Third vector generation means 2A Third similarity calculation means 3 Storage device 31 First vector storage means 32 Word dictionary storage means 33 Second Vector storage means 34 Pattern dictionary storage means 4 Similarity output means 5 Recording medium

Claims

Word search means for decomposing the input document into word strings;
Means for decomposing the input document into word strings based on a word dictionary;
Means for counting a word frequency of a word string and generating a vector of word frequencies (referred to as a “first vector”) based on the word dictionary of the input document;
Means for comparing the generated first vector with a first vector of the word dictionary of a document stored in advance to calculate a first similarity;
Means for using the decomposed word array as a document pattern;
Means for scanning the document pattern based on the pattern dictionary pattern for extracting the semantic information of the document, and decomposing the pattern into word and word sequences;
Means for counting a pattern frequency of a pattern and a word sequence, and generating a vector of pattern and word sequence pattern frequencies (referred to as a “second vector”) according to the pattern dictionary of the input document;
Means for comparing the generated second vector with a second vector stored in advance in the pattern dictionary of the document and calculating a second similarity;
Means for integrating the first similarity and the second similarity by a simple product or a weighted sum and outputting as one similarity;
An inter-document similarity calculation device characterized by comprising:

Word dictionary storage means for storing a word dictionary;
First vector storage means for storing a vector of word frequencies (referred to as “first vector”) according to the word dictionary of a stored document;
Pattern dictionary storage means for storing a pattern dictionary for extracting semantic information of a document;
Second vector storage means for storing a pattern of the stored document according to the pattern dictionary and a vector of word pattern frequency (referred to as “second vector”);
A document input means for inputting a document;
Word search means for decomposing a document input from the document input means into a word string based on word information stored in the word dictionary storage means;
First vector generation means for counting word frequencies of the decomposed word string and generating a first vector of the input document;
First similarity calculation means for comparing the first vector generated by the first vector generation means with the first vector stored in the first vector storage means and calculating the similarity. When,
A document pattern generation unit that uses an array of words decomposed by the word search unit as a document pattern;
A pattern search unit that scans a document pattern according to a pattern stored in the pattern dictionary storage unit and decomposes it into a pattern and a word arrangement;
Second vector generation means for counting pattern frequencies of patterns and word sequences and generating a second vector of the input document;
Second similarity calculation means for comparing the second vector generated by the second vector generation means with the second vector stored in the second vector storage means and calculating the similarity When,
Similarity integration means for integrating the similarity output from the first similarity calculation means and the similarity output from the second similarity calculation means by a simple product or a weighted sum and outputting it as one similarity When,
Means for outputting the similarity output from the similarity integration means;
An inter-document similarity calculation device characterized by comprising:

Word dictionary storage means for storing a word dictionary;
First vector storage means for storing a vector of word frequencies (referred to as “first vector”) according to the word dictionary of a stored document;
First pattern dictionary storage means for storing a first pattern dictionary for extracting semantic information of a document;
Second vector storage means for storing a pattern of the stored document according to the first pattern dictionary and a vector of word pattern frequency (referred to as “second vector”);
Second pattern dictionary storage means for storing a second pattern dictionary for extracting semantic information of the document;
Third vector storage means for storing a frequency vector (referred to as “third vector”) of a pattern of the stored document according to the second pattern dictionary;
A document input means for inputting a document;
Word search means for decomposing a document input from the document input means into a word string based on word information stored in the word dictionary storage means;
First vector generation means for counting the word frequency of the decomposed word string and generating a first vector by the word dictionary of the input document;
First similarity calculation means for comparing the first vector generated by the first vector generation means with the first vector stored in the first vector storage means and calculating the similarity; ,
A document pattern generation unit that uses an array of words decomposed by the word search unit as a document pattern;
First pattern search means for scanning a document pattern according to a pattern stored in the first pattern dictionary storage means and for decomposing the pattern into a pattern and a word arrangement;
Second vector generation means for counting pattern frequencies of patterns and word arrangements and generating a second vector of the input document by the first pattern dictionary;
Second similarity calculation means for comparing the second vector generated by the second vector generation means with the second vector stored in the second vector storage means and calculating the similarity; ,
A second pattern search unit that scans a document pattern according to a pattern stored in the second pattern dictionary storage unit and decomposes the document pattern into a pattern and a word arrangement;
A third vector generating means for counting a pattern frequency of patterns and word arrangements and generating a third vector by the second pattern dictionary of the input document;
Third similarity calculation means for comparing the third vector generated by the third vector generation means with the third vector stored in the third vector storage means and calculating the similarity; ,
Similarity integration means for integrating the similarities respectively output from the first to third similarity calculation means by a simple product or a weighted sum and outputting them as one similarity;
Means for outputting the similarity output from the similarity integration means;
An inter-document similarity calculation device characterized by comprising:

Word dictionary storage means for storing a word dictionary;
First vector storage means for storing a vector of word frequencies (referred to as “first vector”) according to the word dictionary of a stored document;
Pattern dictionary storage means for storing a pattern dictionary for extracting semantic information of a document;
Second vector storage means for storing a pattern of the stored document according to the pattern dictionary and a vector of word pattern frequency (referred to as “second vector”);
A method for calculating similarity between documents by an information processing apparatus including:
A first step of inputting a document from the document input means;
A second step of searching word information stored in the word dictionary storage means and replacing the document input from the document input means with a word string;
A third step of counting word frequencies and generating a first vector by a word dictionary for the input document;
Calculation of the first similarity between the generated first vector and the first vector stored in advance in the first vector storage means and read out from the first vector storage means A fourth step of performing
A fifth step that regards a word string as a document pattern, performs a pattern search on the document pattern using the pattern dictionary stored in the pattern dictionary storage unit, and replaces the document with a pattern string;
A sixth step of counting a pattern frequency and generating a second vector by a pattern dictionary for the input document;
A second similarity between the generated second vector and the second vector stored in advance in the second vector storage means and read from the second vector storage means by the pattern dictionary; A seventh step of calculating the degree;
An eighth step of calculating and outputting an integrated similarity of the first similarity and the second similarity by a simple product or a weighted sum ;
A method for calculating similarity between documents, including:

Word dictionary storage means for storing a word dictionary;
First vector storage means for storing a vector of word frequencies (referred to as “first vector”) according to the word dictionary of a stored document;
Pattern dictionary storage means for storing a pattern dictionary for extracting semantic information of a document;
Second vector storage means for storing a pattern of the stored document according to the pattern dictionary and a vector of word pattern frequency (referred to as “second vector”);
Second pattern dictionary storage means for storing a second pattern dictionary for extracting semantic information of the document;
Third vector storage means for storing a frequency vector (referred to as “third vector”) of a pattern of the stored document according to the second pattern dictionary;
A method for calculating similarity between documents by an information processing apparatus including:
A first step of inputting a document from the document input means;
A second step of searching word information stored in the word dictionary storage means and replacing the document input from the document input means with a word string;
A third step of counting word frequencies and generating a first vector by a word dictionary for the input document;
A first similarity calculation is performed between the generated first vector and the first vector stored in advance in the first vector storage means and read out from the first vector storage means. A fourth step to perform;
A word string is regarded as a document pattern, a pattern search is performed on the document pattern using the first pattern dictionary stored in advance in the first pattern dictionary storage means, and the document is replaced with a pattern string. 5 steps,
A sixth step of counting a pattern frequency and generating a second vector;
A second similarity calculation is performed between the generated second vector and the second vector stored in advance in the second vector storage means and read out from the second vector storage means. A seventh step to perform;
An eighth step of performing a pattern search on the document pattern using the second pattern dictionary and replacing the document with a pattern sequence;
A ninth step of counting the frequency of patterns and generating a third vector by the second pattern dictionary;
A third similarity calculation is performed between the generated third vector and the third vector stored in advance in the third vector storage means and read out from the third vector storage means. A tenth step;
An eleventh step of calculating and outputting an integrated similarity of the first to third similarities by a simple product or a weighted sum ;
A method for calculating similarity between documents, including:

Word dictionary storage means for storing a word dictionary;
First vector storage means for storing a vector of word frequencies (referred to as “first vector”) according to the word dictionary of a stored document;
Pattern dictionary storage means for storing a pattern dictionary for extracting semantic information of a document;
Second vector storage means for storing a pattern of the stored document according to the pattern dictionary and a vector of word pattern frequency (referred to as “second vector”);
A storage device comprising:
An input device for inputting a document;
An output device for outputting the similarity,
In an information processing device comprising a program-controlled data processing device,
(A) a word search process for decomposing an input document into a word string based on word information stored in the word dictionary storage means;
(B) counting a word frequency of the decomposed word string, and generating a first vector based on a word dictionary of the input document;
(C) A first similarity for comparing the first vector generated in the first vector generation process with the first vector stored in the first vector storage means and calculating the similarity Degree calculation processing,
(D) a document pattern generation process in which an array of words decomposed by the word search process is used as a document pattern;
(E) a pattern search process in which a document pattern is scanned by a pattern stored in the pattern dictionary storage means and decomposed into a pattern and a word arrangement;
(F) a second vector generation process for counting pattern frequencies of patterns and word sequences and generating a second vector by the pattern dictionary of the input document;
(G) a second similarity degree for comparing the second vector generated in the second vector generation process with the second vector stored in the second vector storage means and calculating the degree of similarity Calculation processing,
(H) Similarity in which the similarity output from the first similarity calculation process and the similarity output from the second similarity calculation process are integrated by a simple product or weighted sum and output as one similarity Degree integration processing,
A recording medium on which a program for causing the data processing apparatus to execute each of the processes (a) to (h) is recorded.

Word dictionary storage means for storing a word dictionary;
First vector storage means for storing a vector of word frequencies (referred to as “first vector”) according to the word dictionary of a stored document;
Pattern dictionary storage means for storing a pattern dictionary for extracting semantic information of a document;
Second vector storage means for storing a pattern of the stored document according to the pattern dictionary and a vector of word pattern frequency (referred to as “second vector”);
Second pattern dictionary storage means for storing a second pattern dictionary for extracting semantic information of the document;
Third vector storage means for storing a frequency vector (referred to as “third vector”) of a pattern of the stored document according to the second pattern dictionary;
A storage device comprising:
An input device for inputting a document;
An output device for outputting the similarity,
In an information processing device comprising a program-controlled data processing device,
(A) a word search process for decomposing an input document into word strings based on word information stored in the word dictionary storage means;
(B) counting a word frequency of the decomposed word string and generating a vector based on a word dictionary of the input document;
(C) A first similarity for comparing the first vector generated in the first vector generation process with the first vector stored in the first vector storage means and calculating the similarity Degree calculation processing,
(D) a document pattern generation process in which an array of words decomposed by the word search process is used as a document pattern;
(E) a first pattern search process that scans a document pattern with a pattern stored in the first pattern dictionary storage means and decomposes the document pattern into a pattern and a word arrangement;
(F) a second vector generation process for counting a pattern frequency of patterns and word sequences and generating a second vector by the first pattern dictionary of the input document;
(G) a second similarity in which the second vector generated in the second vector generation process is compared with the second vector stored in the second vector storage means, and the similarity is calculated. Degree calculation processing,
(H) a second pattern search process in which a document pattern is scanned according to a pattern stored in the second pattern dictionary storage means and decomposed into a pattern and a word arrangement;
(I) a third vector generation process for counting pattern frequencies of patterns and word arrangements and generating a third vector by the second pattern dictionary of the input document;
(J) Third similarity for comparing the third vector generated by the third vector generation processing with the third vector stored in the third vector storage means and calculating the similarity Degree calculation processing,
(K) Similarity integration processing for integrating the similarities respectively output from the first to third similarity calculation processing and outputting as a single similarity by a simple product or a weighted sum ;
A recording medium on which a program for causing the data processing apparatus to execute each of the processes (a) to (k) is recorded.