JP2001155027A

JP2001155027A - Method, system and device for calculating similarity between documents, and recording medium recorded with program for similarity calculation

Info

Publication number: JP2001155027A
Application number: JP33638099A
Authority: JP
Inventors: Naoki Fujita; 直毅藤田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-11-26
Filing date: 1999-11-26
Publication date: 2001-06-08
Anticipated expiration: 2019-11-26
Also published as: JP3690216B2

Abstract

PROBLEM TO BE SOLVED: To provide a system and a method for retrieving similar document, by which much deeper meaning processing is made possible and the dealing of compound word or phrase can be facilitated. SOLUTION: This device is provided with a first vector generating means for counting the word frequency of a word string decomposed by a word retrieval means and generating a first vector based on the word dictionary of an input document, a first similarity calculating means for comparing the first vector with a vector stored in a first vector storage means and calculating similarity, a document pattern generating means for defining the arrangement of words as a document pattern, pattern retrieving means for scanning the document pattern on the basis of patterns stored in a pattern dictionary storage means and decomposing the document pattern into the arrangements of patterns and words a second vector generating means for counting the pattern frequency in the arrangements of patterns and words and generating a second vector based on the pattern dictionary of the input document, a second similarity calculating means for comparing the second vector with a vector stored in a second vector storage means and calculating similarity, and similarity merging means for merging and outputting the outputs of the first and second similarity calculating means.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は類似度計算技術に関
し、特に、検索や分類に利用して好適とされる類似文書
検索システム、類似文書検索方法および類似文書検索用
プログラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similarity calculation technique, and more particularly to a similar document search system, a similar document search method, and a recording medium storing a similar document search program which are suitable for use in search and classification. .

【０００２】[0002]

【従来の技術】情報検索の分野における従来の類似文書
検索方法として、例えば、文献１(G.Salton,M.McGill,
Introduction to Modern Information Retrieval, NewY
ork, McGraw-Hill, 1983)に記載されているように、文
書における単語の出現頻度を基に、文書の距離あるいは
類似度を計算する方法が知られている。この従来の類似
文書検索方法においては、各文書における単語の頻度ベ
クトルを求め、各々のベクトルに、ＴＦ・ＩＤＦと呼ば
れる重み付けを行ない、二つのベクトルのなす角度のコ
サイン値（cosθ＝（x・y）/｜x｜｜y｜、但し、（ｘ・
ｙ）は二つのベクトルｘ、ｙの内積、｜x｜、｜y｜は絶
対値）を文書間の類似度とする。ＴＦ・ＩＤＦ法におい
て、単語の重要度は、出現頻度ｔｆと、出現件数ｄｆの
逆数ｉｄｆを用いて定義され、カテゴリCiにおける単語
tkの重要度（重み）W(tk，Ci)は、W(tk，Ci)＝tf(tk，C
i)log（Li／df(tk，Ci)+1）と表され、tf(tk，Ci)はカ
テゴリCiにおける単語ｔｋの出現頻度、df(tk，Ci)は、
カテゴリCiにおける単語tkの出現件数、LiはカテゴリCi
における総テキスト件数を表している。2. Description of the Related Art As a conventional similar document search method in the field of information search, for example, reference 1 (G. Salton, M. McGill,
Introduction to Modern Information Retrieval, NewY
As described in Ork, McGraw-Hill, 1983), a method of calculating the distance or similarity of a document based on the frequency of occurrence of a word in the document is known. In this conventional similar document search method, a frequency vector of a word in each document is obtained, each vector is weighted by TF / IDF, and a cosine value of an angle formed by the two vectors (cos θ = (x · y ) / | X || y | where (x
y) is the inner product of two vectors x and y, and | x | and | y | are absolute values) as the similarity between documents. In the TF / IDF method, the importance of a word is defined using an appearance frequency tf and the reciprocal idf of the number of occurrences df, and the word in the category Ci
The importance (weight) W (tk, Ci) of tk is W (tk, Ci) = tf (tk, C
i) log (Li / df (tk, Ci) +1), tf (tk, Ci) is the frequency of occurrence of word tk in category Ci, and df (tk, Ci) is
Number of occurrences of word tk in category Ci, Li is category Ci
Represents the total number of texts in.

【０００３】また例えば特開平１１−１３４３５９号公
報等には、２つの文書の類似度を計算するにあたり、入
力された２組の文書及び概要の形態素解析を行って単語
を抽出し、概要に含まれる単語に重み付けして、それぞ
れの文書に含まれる単語に基づいて２つの文書の類似度
を計算するようにした方法が提案されている。In calculating the similarity between two documents, for example, Japanese Patent Application Laid-Open No. 11-134359 discloses a morphological analysis of two sets of input documents and a summary to extract words and include the words in the summary. A method has been proposed in which a word to be weighted is calculated and the similarity between two documents is calculated based on the word included in each document.

【０００４】[0004]

【発明が解決しようとする課題】しなしながら、上記し
た従来の方法は下記記載の問題点を有している。However, the above-mentioned conventional method has the following problems.

【０００５】第１の問題点は、文書間の類似度の精度が
低い、ということである。A first problem is that the accuracy of similarity between documents is low.

【０００６】その理由は、単語ベクトルを基準とした浅
い意味処理しか行なっていないためである。The reason is that only shallow semantic processing is performed based on a word vector.

【０００７】第２の問題点は、複合語の扱いが難かし
い、ということである。[0007] The second problem is that it is difficult to handle compound words.

【０００８】その理由は、文書を一つのベクトルで表現
しようとしたためである。The reason is that an attempt was made to represent a document with one vector.

【０００９】したがって、本発明は、上記問題点に鑑み
てなされたものであって、その目的は、より深い意味処
理ができる類似文書検索システム及び方法並びに記録媒
体を提供することにある。Accordingly, the present invention has been made in view of the above problems, and an object of the present invention is to provide a similar document search system and method capable of performing deeper semantic processing, and a recording medium.

【００１０】本発明の他の目的は、複合語や連語の扱い
が容易な類似文書検索システム及び方法並びに記録媒体
を提供することにある。Another object of the present invention is to provide a similar document search system and method and a recording medium which can easily handle compound words and collocations.

【００１１】[0011]

【課題を解決するための手段】前記目的を達成する本発
明の類似検索システムは、単語辞書に基づく第１の類似
度計算手段と、パタン辞書に基づく第２の類似度計算手
段と、２つの類似度から１つの類似度を計算する手段と
を備えている。A similarity search system according to the present invention that achieves the above object comprises a first similarity calculator based on a word dictionary, a second similarity calculator based on a pattern dictionary, and two similarity calculators. Means for calculating one similarity from the similarity.

【００１２】また本発明は、単語辞書に基づく第１の類
似度計算手段と、第１のパタン辞書に基づく第２の類似
度計算手段と、第２のパタン辞書に基づく第３の類似度
計算手段と、３つの類似度から１つの類似度を計算する
よう手段と、を備えている。Further, the present invention provides a first similarity calculating means based on a word dictionary, a second similarity calculating means based on a first pattern dictionary, and a third similarity calculating based on a second pattern dictionary. Means and means for calculating one similarity from the three similarities.

【００１３】[0013]

【発明の実施の形態】本発明の実施の形態について説明
する。本発明の装置は、その好ましい一実施の形態にお
いて、図１を参照すると、記憶された文書の単語辞書に
よる第１のベクトルを記憶する第１のベクトル記憶手段
（３１）と、単語辞書を記憶する単語辞書記憶手段（３
２）と、記憶された文書のパタン辞書による第２のベク
トルを記憶する第２のベクトル記憶手段（３３）と、パ
タン辞書を記憶するパタン辞書記憶手段（３４）と、を
記憶装置（３）が備え、文書を入力する文書入力手段
（１）を備え、データ処理装置が、前記単語辞書記憶手
段（３４）に記憶された単語により、入力された文書を
単語列に分解する単語検索手段（２１）と、分解された
単語列の単語頻度を計数し、入力された文書の単語辞書
による第１のベクトルを生成する第１のベクトル生成手
段（２２）と、前記第１のベクトル生成手段（２２）で
生成された第１のベクトルと前記第１のベクトル記憶手
段（３１）に記憶された第１のベクトルとを比較し、そ
の類似度を計算する第１の類似度計算手段（２３）と、
前記単語検索手段（２１）により分解された単語の配列
を文書パタンとする文書パタン生成手段（２４）と、前
記パタン辞書記憶手段（３４）に記憶されたパタンによ
り、文書パタンを走査し、パタンおよび単語の配列に分
解するパタン検索手段（２５）と、パタンおよび単語の
配列のパタン頻度を計数し、前記入力された文書の前記
パタン辞書による第２のベクトルを生成する第２のベク
トル生成手段（２６）と、第２のベクトル生成手段（２
６）で生成された第２のベクトルと前記第２のベクトル
記憶手段（３３）に記憶された第２のベクトルとを比較
し、その類似度を計算する第２の類似度計算手段（２
７）と、第１の類似度計算手段（２３）の出力と第２の
類似度計算手段（２７）の出力を統合して１つの類似度
として出力する類似度統合手段（２０）と、を備え、さ
らに、類似度統合手段（２０）から出力された類似度を
出力する手段（４）を備えている。Embodiments of the present invention will be described. In a preferred embodiment of the apparatus of the present invention, referring to FIG. 1, a first vector storage means (31) for storing a first vector according to a word dictionary of a stored document, and storing the word dictionary. Word dictionary storage means (3
2) a second vector storage unit (33) for storing a second vector of the stored document according to the pattern dictionary, and a pattern dictionary storage unit (34) for storing the pattern dictionary (3) And a document input means (1) for inputting a document, wherein the data processing device is configured to decompose the input document into a word string based on the words stored in the word dictionary storage means (34). 21), first vector generation means (22) for counting the word frequency of the decomposed word string, and generating a first vector of the input document based on the word dictionary, and the first vector generation means ( First similarity calculating means (23) for comparing the first vector generated in 22) with the first vector stored in the first vector storing means (31) and calculating the similarity thereof; When,
A document pattern is scanned by a document pattern generation unit (24) that uses a sequence of words decomposed by the word search unit (21) as a document pattern, and a pattern stored in the pattern dictionary storage unit (34). And a pattern search means (25) for decomposing into a sequence of words and a second vector generating means for counting a pattern frequency of the sequence of patterns and words and generating a second vector of the input document by the pattern dictionary (26) and the second vector generation means (2
A second similarity calculating means (2) for comparing the second vector generated in 6) with the second vector stored in the second vector storing means (33) and calculating the similarity thereof.
7) and similarity integration means (20) for integrating the output of the first similarity calculation means (23) and the output of the second similarity calculation means (27) and outputting the same as one similarity. And means (4) for outputting the similarity output from the similarity integrating means (20).

【００１４】本発明は、別の実施の形態において、図３
を参照すると、記憶装置（３）が、記憶された文書の単
語辞書による第１のベクトルを記憶する第１のベクトル
記憶手段（３１）と、単語辞書を記憶する単語辞書記憶
手段（３２）と、記憶された文書のパタン辞書による第
２のベクトルを記憶する第２のベクトル記憶手段（３
３）と、第１のパタン辞書を記憶するパタン辞書記憶手
段（３４）と、記憶された文書の第２のパタン辞書によ
る第３のベクトルを記憶する第３のベクトル記憶手段
（３５）と、第２のパタン辞書を記憶する第２のパタン
辞書記憶手段（３６）と、をさらに備え、データ処理装
置が、前記単語辞書記憶手段に記憶された単語により、
入力された文書を単語列に分解する単語検索手段（２
１）と、分解された単語列の単語頻度を計数し、入力さ
れた文書の単語辞書によるベクトルを生成する第１のベ
クトル生成手段（２２）と、前記第１のベクトル生成手
段（２２）で生成された第１のベクトルと前記第１のベ
クトル記憶手段（３１）に記憶された第１のベクトルと
を比較し、その類似度を計算する第１の類似度計算手段
（２３）と、前記単語検索手段（２１）により分解され
た単語の配列を文書パタンとする文書パタン生成手段
（２４）と、第１のパタン辞書記憶手段（３４）に記憶
されたパタンにより、文書パタンを走査し、パタンおよ
び単語の配列に分解する第１のパタン検索手段（２５）
と、パタンおよび単語の配列のパタン頻度を計数し、前
記入力された文書の前記第１のパタン辞書による第２の
ベクトルを生成する第２のベクトル生成手段（２６）
と、第２のベクトル生成手段（２６）で生成された第２
のベクトルと前記第２のベクトル記憶手段（３３）に記
憶された第２のベクトルとを比較し、その類似度を計算
する第２の類似度計算手段（２７）と、第２のパタン辞
書記憶手段（３６）に記憶されたパタンにより、文書パ
タンを走査し、パタンおよび単語の配列に分解する第２
のパタン検索手段（２８）と、パタンおよび単語の配列
のパタン頻度を計数し、前記入力された文書の前記第２
のパタン辞書による第３のベクトルを生成する第３のベ
クトル生成手段（２９）と、第３のベクトル生成手段
（２９）で生成された第３のベクトルと第３のベクトル
記憶手段（３５）に記憶された第３のベクトルとを比較
し、その類似度を計算する第３の類似度計算手段（２
Ａ）と、第１乃至第３の類似度計算手段の出力を統合し
て１つの類似度として出力する類似度統合手段（２０）
と、を備えている。The present invention, in another embodiment, is shown in FIG.
, A storage device (3) includes first vector storage means (31) for storing a first vector of a stored document in a word dictionary, word dictionary storage means (32) for storing a word dictionary, A second vector storage means (3) for storing a second vector of the stored document according to the pattern dictionary.
3), pattern dictionary storage means (34) for storing the first pattern dictionary, and third vector storage means (35) for storing a third vector of the stored document according to the second pattern dictionary. And a second pattern dictionary storage means (36) for storing a second pattern dictionary, wherein the data processing device is configured to:
Word search means (2) for decomposing an input document into word strings
1), a first vector generating means (22) for counting the word frequency of the decomposed word string and generating a vector of the input document according to the word dictionary, and a first vector generating means (22). First similarity calculating means (23) for comparing the generated first vector with the first vector stored in the first vector storage means (31) and calculating the similarity; The document pattern is scanned by a document pattern generation unit (24) that uses the arrangement of words decomposed by the word search unit (21) as a document pattern and a pattern stored in the first pattern dictionary storage unit (34). First pattern retrieval means for decomposing into patterns and word arrays (25)
A second vector generating means (26) for counting a pattern frequency of a pattern and a sequence of words and generating a second vector of the input document based on the first pattern dictionary.
And the second vector generated by the second vector generating means (26).
A second similarity calculating means (27) for comparing the vector of the second vector with the second vector stored in the second vector storing means (33) and calculating the similarity thereof; A second step of scanning the document pattern by the pattern stored in the means (36) and decomposing it into an array of patterns and words;
And a pattern search means (28) for counting the pattern frequency of the pattern and the sequence of words, and calculating the second
A third vector generating means (29) for generating a third vector based on the pattern dictionary of (3), and a third vector generated by the third vector generating means (29) and a third vector storing means (35). A third similarity calculating means (2) for comparing the stored third vector with the third vector and calculating the similarity.
A) and similarity integrating means (20) for integrating the outputs of the first to third similarity calculating means and outputting as one similarity
And

【００１５】本発明の実施の形態において、データ処理
装置が具備する各手段は、データ処理装置で実行される
プログラムによりその処理・機能が実現される。この場
合、該プログラムを記録した記録媒体から所定の読み出
し装置を介してデータ処理装置の主記憶に実行形式のプ
ログラムをロードして実行することで、本発明を実施す
ることができる。In the embodiment of the present invention, the processing and functions of each means of the data processing device are realized by a program executed by the data processing device. In this case, the present invention can be implemented by loading an executable program from a recording medium on which the program is recorded via a predetermined reading device into a main storage of the data processing device and executing the program.

【００１６】[0016]

【実施例】上記した本発明の実施の形態についてさらに
詳細に説明すべく、本発明の実施例について図面を参照
して詳細に説明する。図１は、本発明の第１の実施例の
構成を示す図である。図１を参照すると、本発明の第１
の実施例は、文書入力手段１と、プログラム制御により
動作するデータ処理装置２と、記憶装置３と、類似度出
力手段４と、を備えている。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of the present invention; FIG. 1 is a diagram showing the configuration of the first exemplary embodiment of the present invention. Referring to FIG. 1, a first embodiment of the present invention is shown.
The embodiment includes a document input unit 1, a data processing device 2 operated by program control, a storage device 3, and a similarity output unit 4.

【００１７】データ処理装置２は、図１を参照すると、
類似度統合手段２０と、単語検索手段２１と、第１のベ
クトル生成手段２２と、第１の類似度計算手段２３と、
文書パタン生成手段２４と、パタン検索手段２５と、第
２のベクトル生成手段２６と、第２の類似度計算手段２
７とを含む。Referring to FIG. 1, the data processing device 2
A similarity integration means 20, a word search means 21, a first vector generation means 22, a first similarity calculation means 23,
Document pattern generation means 24, pattern search means 25, second vector generation means 26, second similarity calculation means 2
7 is included.

【００１８】記憶装置３は、第１のベクトル記憶手段３
１と、単語辞書記憶手段３２と、第２のベクトル記憶手
段３３と、パタン辞書記憶手段３４とを含む。The storage device 3 comprises a first vector storage means 3
1, a word dictionary storage means 32, a second vector storage means 33, and a pattern dictionary storage means 34.

【００１９】これらの手段はそれぞれ概略つぎのように
動作する。Each of these means operates as follows.

【００２０】文書入力手段１は、文書をデータ処理装置
２に入力する。The document input means 1 inputs a document to the data processing device 2.

【００２１】類似度出力手段４は、記憶された文書と入
力された文書の類似度を出力する。The similarity output means 4 outputs the similarity between the stored document and the input document.

【００２２】類似度統合手段２０は、第１の類似度計算
手段２３の出力と、第２の類似度計算手段２７の出力を
統合して１つの類似度として出力する。The similarity integrating means 20 integrates the output of the first similarity calculating means 23 and the output of the second similarity calculating means 27 and outputs the result as one similarity.

【００２３】単語検索手段２１は、単語辞書記憶手段３
２に記憶された単語により入力された文書を単語列に分
解する。The word search means 21 is a word dictionary storage means 3
2 is decomposed into word strings by the words stored in the second word.

【００２４】第１のベクトル生成手段２２は、分解され
た単語列の単語頻度を計数し、入力された文書の単語辞
書によるベクトルを生成する。The first vector generation means 22 counts the word frequencies of the decomposed word strings, and generates a vector of the input document based on the word dictionary.

【００２５】第１の類似度計算手段２３は、第１のベク
トル生成手段２２で生成されたベクトルと第１のベクト
ル記憶手段３１に記憶されたベクトルを比較し、その類
似度を計算する。The first similarity calculating means 23 compares the vector generated by the first vector generating means 22 with the vector stored in the first vector storing means 31, and calculates the similarity.

【００２６】文書パタン生成手段２４は、単語検索手段
２１により分解された単語の配列を文書パタンとする。The document pattern generation means 24 uses the arrangement of words decomposed by the word search means 21 as a document pattern.

【００２７】パタン検索手段２５は、パタン辞書記憶手
段３４に記憶されたパタンにより、文書パタンを走査
し、パタンおよび単語の配列に分解する。The pattern search means 25 scans the document pattern by using the pattern stored in the pattern dictionary storage means 34 and decomposes the document pattern into an array of patterns and words.

【００２８】第２のベクトル生成手段２６は、パタンお
よび単語の配列のパタン頻度を計数し、入力された文書
のパタン辞書によるベクトルを生成する。The second vector generation means 26 counts the pattern frequency of the pattern and the arrangement of words, and generates a vector of the input document by the pattern dictionary.

【００２９】第２の類似度計算手段２７は、第２のベク
トル生成手段２６で生成されたベクトルと第２のベクト
ル記憶手段３３に記憶されたベクトルを比較し、その類
似度を計算する。The second similarity calculation means 27 compares the vector generated by the second vector generation means 26 with the vector stored in the second vector storage means 33, and calculates the similarity.

【００３０】第１のベクトル記憶手段３１は、記憶され
た文書の単語辞書によるベクトルを記憶する。The first vector storage means 31 stores a vector of the stored document according to the word dictionary.

【００３１】単語辞書記憶手段３２は、単語検索手段２
１で利用する単語辞書を記憶する。The word dictionary storage means 32 stores the word search means 2
The word dictionary used in step 1 is stored.

【００３２】第２のベクトル記憶手段３３は、記憶され
た文書のパタン辞書によるベクトルを記憶する。The second vector storage means 33 stores a vector of the stored document according to the pattern dictionary.

【００３３】パタン辞書記憶手段３４は、パタン検索手
段２５で利用するパタン辞書を記憶する。The pattern dictionary storage means 34 stores a pattern dictionary used by the pattern search means 25.

【００３４】次に図２は、本発明の第１の実施例の処理
手順を示す流れ図である。図１及び図２を参照して、本
発明の第１の実施例の全体の動作について詳細に説明す
る。FIG. 2 is a flowchart showing a processing procedure according to the first embodiment of the present invention. The overall operation of the first embodiment of the present invention will be described in detail with reference to FIGS.

【００３５】まず、文書をデータ処理装置２に入力する
（図２のステップＳ１）。First, a document is input to the data processing device 2 (step S1 in FIG. 2).

【００３６】次に、単語検索を行ない文書を単語列に置
き換える（ステップＳ２）。Next, a word search is performed to replace the document with a word string (step S2).

【００３７】さらに、単語の頻度を計数し第１のベクト
ルを生成する（ステップＳ３）。Further, the frequency of words is counted to generate a first vector (step S3).

【００３８】さらに、生成されたベクトルと第１のベク
トル記憶手段３１に記憶されたベクトルとの間で、第１
の類似度の計算を行なう（ステップＳ４）。Further, the first vector is stored between the generated vector and the vector stored in the first vector storage means 31.
Is calculated (step S4).

【００３９】次に、単語列を文書パタンとみなし（ステ
ップＳ５）、文書パタンに対してパタン検索を行ない文
書をパタン列に置き換える（ステップＳ６）。Next, the word string is regarded as a document pattern (step S5), and a pattern search is performed on the document pattern to replace the document with the pattern string (step S6).

【００４０】パタンの頻度を計数し第２のベクトルを生
成する（ステップＳ７）。The frequency of the pattern is counted to generate a second vector (step S7).

【００４１】生成されたベクトルと第２のベクトル記憶
手段３３に記憶されたベクトルとの間で、第２の類似度
の計算を行なう（ステップＳ８）。The second similarity is calculated between the generated vector and the vector stored in the second vector storage means 33 (step S8).

【００４２】最後に、得られた２つの類似度の統合類似
度を計算し、出力する（ステップＳ９）。Finally, an integrated similarity of the two obtained similarities is calculated and output (step S9).

【００４３】次に、本発明の第１の実施例の作用効果に
ついて説明する。Next, the operation and effect of the first embodiment of the present invention will be described.

【００４４】本発明の第１の実施例では、意味情報を抽
出するためのパタンをパタン辞書として持ち、単語辞書
を利用した浅い意味処理による類似度とパタン辞書を利
用したより深い意味処理による類似度を同時に有効に利
用できるように構成されている。このため、従来の方法
よりも深い意味処理ができる。In the first embodiment of the present invention, a pattern for extracting semantic information is provided as a pattern dictionary, and the similarity based on shallow semantic processing using a word dictionary and the similarity based on deeper semantic processing using a pattern dictionary are used. The degree is configured to be used effectively at the same time. Therefore, deeper semantic processing can be performed than in the conventional method.

【００４５】次に、本発明の第１の実施例について具体
例に則して説明する。Next, a first embodiment of the present invention will be described with reference to a specific example.

【００４６】図６乃至図８に示すように、「資料を送付
してくださいね。」という内容の文書が入力されたとす
る。、図６の単語辞書により、「資料」、「を」、「送
付」、「して」、「ください」、「ね」、「。」と分解
され、さらに、「資料」、「を」、「送付」、「す
る」、「くださる」、「ね」、「。」と正規形に変換さ
れる。As shown in FIGS. 6 to 8, it is assumed that a document having the content "Please send material." 6, the word dictionary of FIG. 6 is decomposed into "material", "wo", "send", "do", "please", "ne", ".", And "material", "wo", "Send", "Yes", "Send", "Ne", "." Are converted to normal form

【００４７】これから単語頻度ベクトル（第１のベクト
ル）として、（「資料」1，「を」1，「送付」1，「す
る」1，「くださる」1，「ね」1，「。」1）が得られ
る。これを基に公知のＴＦ・ＩＤＦ法等を用いて類似度
が計算される。From now on, as the word frequency vector (first vector), ("material" 1, "wo" 1, "send" 1, "do" 1, "give" 1, "ne" 1, "." 1 ) Is obtained. Based on this, the similarity is calculated using a known TF / IDF method or the like.

【００４８】次に、図８の文書パタンの例に示されてい
るように、「資料を送付してくださいね。」という内容
の文書（元文書）は、例えば、「資料を送付してくださ
いね。＄」という形（文書パタン例１）に変換される
（ただし、＄は文末を表す記号）。Next, as shown in the example of the document pattern in FIG. 8, a document (original document) having the content "Please send the material." (The document pattern example 1) is converted into (in this case, the symbol represents the end of the sentence).

【００４９】また、「する」、「ね」、「。」などを不
要語とする不要語辞書を利用すると、「資料を送付＊く
ださい＊＄」という形（文書パタン例２）にも変換され
る。When an unnecessary word dictionary that makes unnecessary words such as “do”, “ne”, “.”, Etc. is used, it is also converted into the form “send material * please * ＄” (document pattern example 2). You.

【００５０】このような文書パタンに対して、図７に示
すようなパタン辞書を利用すると、「資料送付希望」と
いう形に変換され、これから、ベクトル（第２ベクト
ル）として、（「資料」1，「送付希望」1）が得られ
る。When such a document pattern is used by using a pattern dictionary as shown in FIG. 7, it is converted into a form of "material transmission request". From this, ("material" 1) is converted into a vector (second vector). , "Request to send" 1) is obtained.

【００５１】このように、複数の抽象度に応じたベクト
ルを生成することにより、目的に応じた抽象度の表現を
選択することができる。As described above, by generating vectors corresponding to a plurality of abstractions, it is possible to select an expression of the abstraction according to the purpose.

【００５２】これも、前回同様に、例えば公知のＴＦ・
ＩＤＦ法などにより、類似度が計算される。As in the previous case, for example, a well-known TF
The similarity is calculated by the IDF method or the like.

【００５３】得られた２つの類似度は、単純積や重み付
き和により、一つの類似度に変換され出力される。The obtained two similarities are converted into one similarity by a simple product or a weighted sum and output.

【００５４】なお、本発明の第１の実施例において、得
られた類似度を昇順又は降順にソースするようにしても
よい。In the first embodiment of the present invention, the obtained similarities may be sourced in ascending or descending order.

【００５５】次に、本発明の第２の実施例について図面
を参照して詳細に説明する。Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

【００５６】図３は、本発明の第２の実施例の構成を示
す図である。図３を参照すると、本発明の第２の実施例
は、文書入力手段１と、プログラム制御により動作する
データ処理装置２と、記憶装置３と、類似度出力手段４
とを備えている。FIG. 3 is a diagram showing the configuration of the second embodiment of the present invention. Referring to FIG. 3, a second embodiment of the present invention comprises a document input unit 1, a data processing device 2 operated under program control, a storage device 3, and a similarity output unit 4
And

【００５７】データ処理装置２は、類似度統合手段２０
と、単語検索手段２１と、第１のベクトル生成手段２２
と、第１の類似度計算手段２３と、文書パタン生成手段
２４と、第１のパタン検索手段２５と、第２のベクトル
生成手段２６と、第２の類似度計算手段２７と、第２の
パタン検索手段２８と、第３のベクトル生成手段２９
と、第３の類似度計算手段２Ａとを含む。The data processing device 2 includes a similarity integration means 20
, Word search means 21 and first vector generation means 22
A first similarity calculating means 23, a document pattern generating means 24, a first pattern searching means 25, a second vector generating means 26, a second similarity calculating means 27, a second Pattern search means 28 and third vector generation means 29
And third similarity calculating means 2A.

【００５８】記憶装置３は、第１のベクトル記憶手段３
１と、単語辞書記憶手段３２と、第２のベクトル記憶手
段３３と、第１のパタン辞書記憶手段３４と、第３のベ
クトル記憶手段３５と、第２のパタン辞書記憶手段３６
とを含む。The storage device 3 comprises a first vector storage means 3
1, a word dictionary storage unit 32, a second vector storage unit 33, a first pattern dictionary storage unit 34, a third vector storage unit 35, and a second pattern dictionary storage unit 36.
And

【００５９】これらの手段はそれぞれ概略つぎのような
機能を有する。Each of these means has the following functions.

【００６０】文書入力手段１は、文書をデータ処理装置
２に入力する。The document input unit 1 inputs a document to the data processing device 2.

【００６１】類似度出力手段４は、記憶された文書と入
力された文書の類似度を出力する。The similarity output means 4 outputs the similarity between the stored document and the input document.

【００６２】類似度統合手段２０は、第１の類似度計算
手段２３の出力と、第２の類似度計算手段２７の出力を
統合して１つの類似度として出力する。The similarity integrating means 20 integrates the output of the first similarity calculating means 23 and the output of the second similarity calculating means 27 and outputs the result as one similarity.

【００６３】単語検索手段２１は、単語辞書記憶手段３
２に記憶された単語により入力された文書を単語列に分
解する。The word search means 21 is used for the word dictionary storage means 3
2 is decomposed into word strings by the words stored in the second word.

【００６４】第１のベクトル生成手段２２は、分解され
た単語列の単語頻度を計数し、入力された文書の単語辞
書によるベクトルを生成する。The first vector generation means 22 counts the word frequencies of the decomposed word strings, and generates a vector of the input document based on the word dictionary.

【００６５】第１の類似度計算手段２３は、第１のベク
トル生成手段２２で生成されたベクトルと第１のベクト
ル記憶手段３１に記憶されたベクトルとを比較し、その
類似度を計算する。The first similarity calculating means 23 compares the vector generated by the first vector generating means 22 with the vector stored in the first vector storing means 31, and calculates the similarity.

【００６６】文書パタン生成手段２４は、単語検索手段
２１により分解された単語の配列を文書パタンとする。The document pattern generation means 24 uses the word array decomposed by the word search means 21 as a document pattern.

【００６７】第１のパタン検索手段２５は、第１のパタ
ン辞書記憶手段３４に記憶されたパタンにより、文書パ
タンを走査し、パタンおよび単語の配列に分解する。The first pattern search means 25 scans the document pattern using the pattern stored in the first pattern dictionary storage means 34 and decomposes the document pattern into an array of patterns and words.

【００６８】第２のベクトル生成手段２６は、パタンお
よび単語の配列のパタン頻度を計数し、入力された文書
のパタン辞書によるベクトルを生成する。The second vector generation means 26 counts the pattern frequency of the pattern and the arrangement of the words, and generates a vector of the input document by the pattern dictionary.

【００６９】第２の類似度計算手段２７は、第２のベク
トル生成手段２６で生成されたベクトルと第２のベクト
ル記憶手段３３に記憶されたベクトルとを比較し、その
類似度を計算する。The second similarity calculating means 27 compares the vector generated by the second vector generating means 26 with the vector stored in the second vector storing means 33, and calculates the similarity.

【００７０】第２のパタン検索手段２８は、第２のパタ
ン辞書記憶手段３６に記憶されたパタンにより、文書パ
タンを走査し、パタンおよび単語の配列に分解する。The second pattern search means 28 scans the document pattern using the pattern stored in the second pattern dictionary storage means 36 and decomposes it into an array of patterns and words.

【００７１】第３のベクトル生成手段２９は、パタンお
よび単語の配列のパタン頻度を計数し、入力された文書
のパタン辞書によるベクトルを生成する。The third vector generation means 29 counts the pattern frequency of the pattern and the arrangement of words, and generates a vector of the input document by the pattern dictionary.

【００７２】第３の類似度計算手段２Ａは、第３のベク
トル生成手段２９で生成されたベクトルと第３のベクト
ル記憶手段３５に記憶されたベクトルを比較し、その類
似度を計算する。The third similarity calculating means 2A compares the vector generated by the third vector generating means 29 with the vector stored in the third vector storing means 35, and calculates the similarity.

【００７３】第１のベクトル記憶手段３１は、記憶され
た文書の単語辞書によるベクトルを記憶する。The first vector storage means 31 stores a vector of the stored document according to the word dictionary.

【００７４】単語辞書記憶手段３２は、単語検索手段２
１で利用する単語辞書を記憶する。The word dictionary storage means 32 stores the word search means 2
The word dictionary used in step 1 is stored.

【００７５】第２のベクトル記憶手段３３は、記憶され
た文書のパタン辞書によるベクトルを記憶する。The second vector storage means 33 stores a vector of the stored document according to the pattern dictionary.

【００７６】パタン辞書記憶手段３４は、パタン検索手
段２５で利用するパタン辞書を記憶する。The pattern dictionary storage means 34 stores a pattern dictionary used by the pattern search means 25.

【００７７】次に、図４は、本発明の第２の実施例の処
理手順を示す流れ図である。図３及び図４を参照して、
本発明の第２の実施例の全体の動作について詳細に説明
する。なお、図４のステップＳ１〜Ｓ８は、図２に示し
た処理と実質的に同一である。Next, FIG. 4 is a flowchart showing a processing procedure of the second embodiment of the present invention. Referring to FIG. 3 and FIG.
The overall operation of the second embodiment of the present invention will be described in detail. Steps S1 to S8 in FIG. 4 are substantially the same as the processing shown in FIG.

【００７８】まず、文書をデータ処理装置に入力する
（図４のステップＳ１）。First, a document is input to the data processing device (step S1 in FIG. 4).

【００７９】次に、単語検索を行ない文書を単語列に置
き換える（ステップＳ２）。Next, a word search is performed to replace the document with a word string (step S2).

【００８０】さらに、単語の頻度を計数し第１のベクト
ルを生成する（ステップＳ３）。Further, the frequency of words is counted to generate a first vector (step S3).

【００８１】さらに、生成されたベクトルと第１のベク
トル記憶手段３１に記憶されたベクトルとの間で、第１
の類似度計算を行なう（ステップＳ４）。Further, between the generated vector and the vector stored in the first vector storage means 31,
Is calculated (step S4).

【００８２】次に、単語列を文書パタンとみなし（ステ
ップＳ５）、文書パタンに対して第１のパタン辞書を利
用してパタン検索を行ない、文書をパタン列に置き換え
る（ステップＳ６）。Next, the word string is regarded as a document pattern (step S5), a pattern search is performed on the document pattern using the first pattern dictionary, and the document is replaced with the pattern string (step S6).

【００８３】さらに、パタンの頻度を計数し第２のベク
トルを生成する（ステップＳ７）。Further, the frequency of the pattern is counted to generate a second vector (step S7).

【００８４】さらに、生成されたベクトルと第２のベク
トル記憶手段３３に記憶されたベクトルとの間で、第２
の類似度計算を行なう（ステップＳ８）。Further, the second vector between the generated vector and the vector stored in the second vector
Is calculated (step S8).

【００８５】文書パタンに対して第２のパタン辞書を利
用してパタン検索を行ない文書をパタン列に置き換える
（ステップＳ１０）。A pattern search is performed on the document pattern using the second pattern dictionary, and the document is replaced with a pattern sequence (step S10).

【００８６】さらに、パタンの頻度を計数し第３のベク
トルを生成する（ステップＳ１１）。Further, the frequency of the pattern is counted to generate a third vector (step S11).

【００８７】さらに、生成された第３のベクトルと第３
のベクトル記憶手段３５に記憶されたベクトルとの間
で、第３の類似度計算を行なう（ステップＳ１２）。Further, the generated third vector and the third vector
A third similarity calculation is performed between the vector and the vector stored in the vector storage means 35 (step S12).

【００８８】最後に、得られた２つの類似度の統合類似
度を計算し出力する（ステップＳ１３）。Finally, the integrated similarity of the two obtained similarities is calculated and output (step S13).

【００８９】次に、本発明の第２の実施例の作用効果に
ついて説明する。Next, the operation and effect of the second embodiment of the present invention will be described.

【００９０】本実施例では、意味情報を抽出するための
パタンをパタン辞書として持ち、単語辞書を利用した浅
い意味処理による類似度とパタン辞書を利用したより深
い意味処理による類似度を同時に有効に利用できるよう
に構成されているため、従来より深い意味処理ができ
る。In this embodiment, a pattern for extracting semantic information is provided as a pattern dictionary, and the similarity obtained by shallow semantic processing using a word dictionary and the similarity obtained by deeper semantic processing using a pattern dictionary can be simultaneously enabled. Since it is configured so that it can be used, deeper semantic processing than before can be performed.

【００９１】次に、本発明の第２の実施例について具体
例に則して説明する。Next, a second embodiment of the present invention will be described with reference to a specific example.

【００９２】図６乃至図８に示すように、「資料を送付
してくださいね。」という内容の文書が入力されたとす
ると、図６の単語辞書により、「資料」、「を」、「送
付」、「して」、「ください」、「ね」、「。」と分解
され、さらに、「資料」、「を」、「送付」、「す
る」、「くださる」、「ね」、「。」と正規形に変換さ
れる。As shown in FIGS. 6 to 8, if a document with the content “Please send the material” is input, “material”, “A”, “Send” ",""","","","","." To the normal form.

【００９３】これから単語頻度ベクトルとして、（「資
料」1，「を」1，「送付」1，「する」1，「くださる」
1，「ね」1，「。」1）が得られ、これを基に、例えば
公知のＴＦ・ＩＤＦ法などにより、類似度が計算され
る。[0093] From now on, as word frequency vectors, ("material" 1, "wo" 1, "send" 1, "do" 1, "given"
1, “Ne” 1, “.” 1) are obtained, and based on this, the similarity is calculated by, for example, the known TF / IDF method.

【００９４】次に、図８の文書パタンの例にあるよう
に、「資料を送付してくださいね。」という内容の文書
は、例えば、「資料を送付してくださいね。＄」という
形に変換される（ただし、＄は文末を表す記号）。Next, as shown in the example of the document pattern in FIG. 8, a document having the content "Please send the material." Is in the form of "Please send the material." It is converted (however, ＄ is a symbol indicating the end of the sentence).

【００９５】また、「する」「ね」「。」などを不要語
とする不要語辞書を利用すると、「資料を送付＊くださ
い＊＄」という形にも書ける。Further, by using an unnecessary word dictionary that makes unnecessary words such as “do”, “ne” and “.”, It is possible to write in the form of “send material * please * ＄”.

【００９６】このような文書パタンに対し、図７に示す
ようなパタン辞書を利用すると、「資料送付希望」とい
う形に変換され、これからベクトルとして、（「資料」
1，「送付希望」1）が得られる。When such a document pattern is used by using a pattern dictionary as shown in FIG. 7, it is converted into a form of "material transmission request", and is converted into a vector ("material").
1, "Request to send" 1) is obtained.

【００９７】このように、複数の抽象度に応じたベクト
ルを生成することにより、目的に応じた抽象度の表現を
選択できる。そして、前回と同様、例えば公知のＴＦ・
ＩＤＦ法などにより類似度が計算される。As described above, by generating vectors corresponding to a plurality of abstractions, it is possible to select an expression of the abstraction according to the purpose. Then, as in the previous case, for example, the known TF
The similarity is calculated by the IDF method or the like.

【００９８】同様に、別のパタン辞書を利用すると、別
の類似度が計算される。Similarly, when another pattern dictionary is used, another similarity is calculated.

【００９９】得られた３つの類似度は、単純積や重み付
き和により一つの類似度に変換され出力される。The obtained three similarities are converted into one similarity by a simple product or a weighted sum and output.

【０１００】なお、本発明の第１の実施例において、得
られた類似度を昇順又は降順にソースするようにしても
よい。In the first embodiment of the present invention, the obtained similarities may be sourced in ascending or descending order.

【０１０１】次に、本発明の第３の実施例について図面
を参照して詳細に説明する。図５は、本発明の第３の実
施例の構成を示す図である。図５を参照すると、本発明
の第３の実施例は、類似度計算プログラムを記録した記
録媒体５を備える。この媒体としては、ＦＤ（フロッッ
ピーディスク）等の磁気ディスク、半導体メモリ、ＣＤ
−ＲＯＭ、ＤＶＤ（digital versatile disk）、ＭＴ
その他の記録媒体であってよい。また、データ処理装置
２が、通信手段を介して、サーバ装置等他のデータ処理
装置の記憶媒体から、類似度計算プログラムをダウンロ
ードすることで本発明を実施するようにしてもよく、こ
の場合、上記媒体５として、通信媒体も含む。Next, a third embodiment of the present invention will be described in detail with reference to the drawings. FIG. 5 is a diagram showing the configuration of the third embodiment of the present invention. Referring to FIG. 5, the third embodiment of the present invention includes a recording medium 5 on which a similarity calculation program is recorded. Examples of the medium include a magnetic disk such as an FD (floppy disk), a semiconductor memory, and a CD.
-ROM, DVD (digital versatile disk), MT
Other recording media may be used. Further, the data processing device 2 may implement the present invention by downloading a similarity calculation program from a storage medium of another data processing device such as a server device via a communication unit. The medium 5 includes a communication medium.

【０１０２】類似度計算プログラムは、記録媒体５から
データ処理装置２に読み込まれ、コンピュータの動作を
制御する。コンピュータは類似度計算プログラムの制御
により以下の処理、すなわち、前記した第１の実施例、
又は第２の実例におけるデータ処理装置２による処理、
すなわち図２、図４の流れ図で規定される処理と同一の
処理を実行する。The similarity calculation program is read into the data processing device 2 from the recording medium 5 and controls the operation of the computer. The computer performs the following processing under the control of the similarity calculation program, that is, the first embodiment described above,
Or processing by the data processing device 2 in the second example,
That is, the same processing as the processing specified in the flowcharts of FIGS. 2 and 4 is executed.

【０１０３】すなわち、図１を参照すると、文書入力手
段１と、プログラム制御により動作するデータ処理装置
２と、記憶装置３と、類似度出力手段４とを備え、デー
タ処理装置２は、類似度統合手段２０と、単語検索手段
２１と、第１のベクトル生成手段２２と、第１の類似度
計算手段２３と、文書パタン生成手段２４と、パタン検
索手段２５と、第２のベクトル生成手段２６と、第２の
類似度計算手段２７とを含む。That is, referring to FIG. 1, a document input means 1, a data processing device 2 operated by program control, a storage device 3, and a similarity output means 4 are provided. Integrating means 20, word searching means 21, first vector generating means 22, first similarity calculating means 23, document pattern generating means 24, pattern searching means 25, and second vector generating means 26 And second similarity calculating means 27.

【０１０４】記憶装置３は、第１のベクトル記憶手段３
１と、単語辞書記憶手段３２と、第２のベクトル記憶手
段３３と、パタン辞書記憶手段３４とを含む。The storage device 3 comprises a first vector storage means 3
1, a word dictionary storage means 32, a second vector storage means 33, and a pattern dictionary storage means 34.

【０１０５】あるいは、図３を参照すると、文書入力手
段１と、プログラム制御により動作するデータ処理装置
２と、記憶装置３と、類似度出力手段４とを備え、デー
タ処理装置２は、類似度統合手段２０と、単語検索手段
２１と、第１のベクトル生成手段２２と、第１の類似度
計算手段２３と、文書パタン生成手段２４と、第１のパ
タン検索手段２５と、第２のベクトル生成手段２６と、
第２の類似度計算手段２７と、第２のパタン検索手段２
８と、第３のベクトル生成手段２９と、第３の類似度計
算手段２Ａとを含む。Alternatively, referring to FIG. 3, a document input means 1, a data processing device 2 operated by program control, a storage device 3, and a similarity output means 4 are provided. Integrating means 20, word searching means 21, first vector generating means 22, first similarity calculating means 23, document pattern generating means 24, first pattern searching means 25, second vector Generating means 26;
Second similarity calculating means 27 and second pattern searching means 2
8, a third vector generating means 29, and a third similarity calculating means 2A.

【０１０６】記憶装置３は、第１のベクトル記憶手段３
１と、単語辞書記憶手段３２と、第２のベクトル記憶手
段３３と、第１のパタン辞書記憶手段３４と、第３のベ
クトル記憶手段３５と、第２のパタン辞書記憶手段３６
とを含む。The storage device 3 comprises a first vector storage means 3
1, a word dictionary storage unit 32, a second vector storage unit 33, a first pattern dictionary storage unit 34, a third vector storage unit 35, and a second pattern dictionary storage unit 36.
And

【０１０７】[0107]

【発明の効果】以上説明したように、本発明によれば下
記記載の効果を奏する。As described above, according to the present invention, the following effects can be obtained.

【０１０８】本発明の第１の効果は、従来の方法よりも
深い意味処理を行うことができる、ということである。The first effect of the present invention is that deeper semantic processing can be performed than in the conventional method.

【０１０９】その理由は、本発明においては、意味情報
を抽出するためのパタンをパタン辞書として備え、単語
辞書を利用した浅い意味処理による類似度と、パタン辞
書を利用したより深い意味処理による類似度とを同時に
有効に利用できるようにしたためである。The reason for this is that, in the present invention, a pattern for extracting semantic information is provided as a pattern dictionary, and the similarity based on shallow semantic processing using a word dictionary is similar to the similarity based on deeper semantic processing using a pattern dictionary. The reason is that the degree and the time can be used effectively at the same time.

【０１１０】本発明の第２の効果は、複合語や連語の扱
いを容易化する、ということである。A second effect of the present invention is that handling of compound words and collocation words is facilitated.

【０１１１】その理由は、本発明においては、パタン辞
書にパタンとしてこれらを登録できるためである。The reason for this is that, in the present invention, these can be registered as patterns in the pattern dictionary.

[Brief description of the drawings]

【図１】本発明の第１の実施例の構成を示す示すブ図で
ある。FIG. 1 is a block diagram showing a configuration of a first exemplary embodiment of the present invention.

【図２】本発明の第１の実施例の動作を示す流れ図であ
る。FIG. 2 is a flowchart showing the operation of the first embodiment of the present invention.

【図３】本発明の第２の実施例の構成を示す図である。FIG. 3 is a diagram showing a configuration of a second exemplary embodiment of the present invention.

【図４】本発明の第２の実施例の動作を示す流れ図であ
る。FIG. 4 is a flowchart showing the operation of the second exemplary embodiment of the present invention.

【図５】本発明の第３の実施例の構成を示す図である。FIG. 5 is a diagram showing a configuration of a third exemplary embodiment of the present invention.

【図６】本発明の実施例を説明するための図であり、単
語辞書の具体例を示す図である。FIG. 6 is a diagram for explaining an embodiment of the present invention, and is a diagram showing a specific example of a word dictionary.

【図７】本発明の実施例を説明するための図であり、パ
タン辞書の具体例を示す図である。FIG. 7 is a diagram for explaining an embodiment of the present invention, and is a diagram showing a specific example of a pattern dictionary.

【図８】本発明の実施例を説明するための図であり、文
書パタンの具体例を示す図である。FIG. 8 is a diagram for explaining an embodiment of the present invention, and is a diagram showing a specific example of a document pattern.

[Explanation of symbols]

１文書入力手段２データ処理装置２０類似度統合手段２１単語検索手段２２第１のベクトル生成手段２３第１の類似度計算手段２４文書パタン生成手段２５第１のパタン検索手段２６第２のベクトル生成手段２７第２の類似度計算手段２８第２のパタン検索手段２９第３のベクトル生成手段２Ａ第３の類似度計算手段３記憶装置３１第１のベクトル記憶手段３２単語辞書記憶手段３３第２のベクトル記憶手段３４パタン辞書記憶手段４類似度出力手段５記録媒体 DESCRIPTION OF SYMBOLS 1 Document input means 2 Data processing apparatus 20 Similarity integration means 21 Word search means 22 First vector generation means 23 First similarity calculation means 24 Document pattern generation means 25 First pattern search means 26 Second vector generation Means 27 Second similarity calculating means 28 Second pattern searching means 29 Third vector generating means 2A Third similarity calculating means 3 Storage device 31 First vector storing means 32 Word dictionary storing means 33 Second Vector storage means 34 Pattern dictionary storage means 4 Similarity output means 5 Recording medium

Claims

[Claims]

A first similarity calculating means for calculating a first similarity based on a word dictionary; a second similarity calculating means for calculating a second similarity based on a pattern dictionary; And a similarity integrating means for calculating one similarity from the second similarity.

2. A first similarity calculating means for calculating a first similarity based on a word dictionary, a second similarity calculating means for calculating a second similarity based on a first pattern dictionary, A third similarity calculating unit that calculates a third similarity based on the second pattern dictionary; and a similarity integrating unit that calculates one similarity from the first, second, and third similarities. A similarity search system comprising:

3. The similarity calculating means includes a plurality of similarity calculating means based on a plurality of dictionaries, and a means for integrating similarities output from the plurality of similarity calculating means. The similarity search system according to claim 1.

4. The similarity search system according to claim 1, further comprising means for sorting in order of similarity.

5. The first similarity calculating means calculates the first similarity by comparing vectors generated based on word frequency information of a document, and the second similarity calculating means calculates Calculating the second similarity by comparing vectors generated based on the pattern frequency information of the document pattern using the pattern dictionary storing the pattern information for extracting the meaning information of the document pattern. The similarity search system according to claim 1.

6. The first similarity calculating means calculates the first similarity by comparing vectors generated based on word frequency information of a document. Calculating the second similarity by comparing the vectors generated based on the pattern frequency information of the document pattern using the first pattern dictionary storing the pattern information for extracting the semantic information of the The third similarity calculating means compares the vector generated based on the pattern frequency information of the document pattern using the second pattern dictionary storing the pattern information for extracting the semantic information of the document, thereby obtaining a third similarity. The similarity search system according to claim 2, wherein the similarity is calculated.

7. A step of calculating a first similarity based on a word dictionary, a step of calculating a second similarity based on a pattern dictionary, and one similarity based on the first and second similarities. Calculating a degree, comprising: calculating the degree of similarity.

8. A step of calculating a first similarity based on the word dictionary, a step of calculating a second similarity based on the first pattern dictionary, and a step of calculating a third similarity based on the second pattern dictionary. And calculating one similarity from the first to third similarities. A method for calculating a similarity, comprising:

9. The similarity search method according to claim 7, further comprising a step of sorting in order of similarity.

10. A pattern for extracting semantic information is provided as a pattern dictionary, and a first pattern obtained by using a word dictionary is provided.
And outputting a similarity obtained by integrating a second similarity obtained by the semantic processing using the pattern dictionary.

11. A first similarity calculation process for calculating a first similarity based on a word dictionary, a second similarity calculation process for calculating a second similarity based on a pattern dictionary, And a similarity integration process for calculating one similarity from the first and second similarities; and a recording medium storing a program for causing a computer to execute the similarity integration process.

12. A first similarity calculation process for calculating a first similarity based on a word dictionary, and a second similarity calculation process for calculating a second similarity based on a first pattern dictionary A third similarity calculation process for calculating a third similarity based on a second pattern dictionary; a similarity integration process for calculating one similarity from the first to third similarities; Recording a program for causing a computer to execute the program.

13. The recording medium according to claim 11, wherein a program for causing said computer to further execute a process of sorting in order of similarity is recorded.

14. A word search means for decomposing an input document into a word string; a means for decomposing the input document into a word string based on a word dictionary; Means for generating a first vector based on the word dictionary of the input document; comparing the generated first vector with a first vector based on the word dictionary of a pre-stored document;
Means for calculating the degree of similarity of a document, means for using the array of the decomposed words as a document pattern, and scanning a document pattern based on a pattern of a pattern dictionary for extracting semantic information of the document. Means for decomposing into a sequence of patterns, means for counting the pattern frequency of patterns and patterns of words, and means for generating a second vector based on the pattern dictionary of the input document; and the generated second vector, Means for comparing a document stored in advance with a second vector based on the pattern dictionary to calculate a second similarity; and integrating the first similarity and the second similarity into one similarity. Means for outputting as a degree, and a similarity calculation apparatus between documents.

15. A word dictionary storage means for storing a word dictionary, a first vector storage means for storing a first vector of a stored document according to the word dictionary, and a pattern for extracting document semantic information. Pattern dictionary storage means for storing a dictionary; second vector storage means for storing a second vector of the stored document according to the pattern dictionary; document input means for inputting a document; storage in the word dictionary storage means Word search means for decomposing a document input from the document input means into a word string based on the input word information; counting the word frequency of the decomposed word string; and the word dictionary of the input document A first vector generating means for generating a first vector according to the following: and a first vector generated by the first vector generating means and stored in the first vector storing means. The first
A first similarity calculating means for comparing the vector of the word and calculating the similarity thereof; a document pattern generating means for setting an array of words decomposed by the word searching means to a document pattern; and a pattern dictionary storing means. A pattern search unit that scans a document pattern using the stored pattern and decomposes the document pattern into an array of patterns and words; counts a pattern frequency of the array of patterns and words; A second vector generating means for generating a vector of the second vector, and a second vector generated by the second vector generating means and a second vector stored in the second vector storing means.
A second similarity calculating means for comparing the similarity and the similarity outputted from the first similarity calculating means with the similarity outputted from the first similarity calculating means. An inter-document similarity calculating apparatus, comprising: a similarity integrating unit that integrates degrees and outputs a single similarity; and a unit that outputs the similarity output from the similarity integrating unit.

16. A word dictionary storage means for storing a word dictionary; a first vector storage means for storing a first vector of the stored document according to the word dictionary; A first pattern dictionary storage unit for storing one pattern dictionary; a second vector storage unit for storing a second vector of the stored document according to the first pattern dictionary; and extracting semantic information of the document. Pattern dictionary storage means for storing a second pattern dictionary for storing, a third vector storage means for storing a third vector of the stored document according to the second pattern dictionary, and inputting the document Document input means; word search means for decomposing a document input from the document input means into word strings based on word information stored in the word dictionary storage means; A first vector generation unit that counts a word frequency and generates a vector of an input document according to a word dictionary; and a first vector generated by the first vector generation unit and the first vector storage unit. A first similarity calculating unit that compares the stored first vector and calculates a similarity thereof; a document pattern generating unit that uses a sequence of words decomposed by the word searching unit as a document pattern; A first pattern search unit that scans a document pattern based on the pattern stored in the first pattern dictionary storage unit and decomposes the document pattern into an array of patterns and words; and counts a pattern frequency of the array of patterns and words. A second vector generation unit that generates a second vector of the input document based on the first pattern dictionary; and a second vector generated by the second vector generation unit. A second similarity calculating means for comparing the second vector stored in the second vector storing means and calculating a similarity thereof; and a pattern stored in the second pattern dictionary storing means. A second pattern search means for scanning a document pattern and decomposing it into an array of patterns and words; counting a pattern frequency of the array of patterns and words; and obtaining a third pattern of the input document by using the second pattern dictionary. A third vector generating means for generating a vector of the third vector, and comparing a third vector generated by the third vector generating means with a third vector stored in the third vector storing means, Third similarity calculating means for calculating similarity, similarity integrating means for integrating similarities output from the first to third similarity calculating means and outputting the same as one similarity, Kind Means for outputting the similarity output from the similarity integrating means, and an inter-document similarity calculation apparatus.

17. A first step of inputting a document, a second step of performing a word search and replacing the document with a word string, counting the frequency of words, and using a word dictionary for the input document in a first step. A third step of generating a vector, and calculating a first similarity between the generated first vector and a vector based on the word dictionary stored in advance in the first vector storage means. Step 4, a word string is regarded as a document pattern, a pattern search is performed on the document pattern using a pattern dictionary, and the document is replaced with the pattern string. The frequency of the pattern is counted. A sixth step of generating a second vector based on a pattern dictionary for the generated document, and generating the second vector and the pattern stored in a second vector storage means in advance. A seventh step of calculating a second similarity between a vector based on a binary dictionary, and an eighth step of calculating and outputting an integrated similarity of the first similarity and the second similarity A method for calculating the degree of similarity between documents, comprising:

18. A first step of inputting a document, a second step of performing a word search and replacing the document with a word string, counting the frequency of words, and using a word dictionary for the input document in a first step. A third step of generating a vector, and a fourth step of performing a first similarity calculation between the generated first vector and the first vector stored in the first vector storage means in advance. A step of considering the word string as a document pattern, performing a pattern search on the document pattern using the first pattern dictionary, and replacing the document with the pattern string; counting the frequency of the pattern; A sixth step of generating a second vector, and a second step of performing a second similarity calculation between the generated second vector and a second vector stored in the second vector storage means in advance. 7 of An eighth step of performing a pattern search on the document pattern using the second pattern dictionary and replacing the document with a pattern sequence, and a ninth step of counting the frequency of the pattern and generating a third vector A tenth step of performing a third similarity calculation between the generated third vector and a vector previously stored in the third vector storage means; and the first to third similarities And calculating and outputting an integrated similarity of the two documents.

19. Word dictionary storage means for storing a word dictionary; first vector storage means for storing a first vector of a stored document according to the word dictionary; and a pattern for extracting document semantic information. A storage device comprising: a pattern dictionary storage device for storing a dictionary; a second vector storage device for storing a second vector of the stored document according to the pattern dictionary; and an input device for inputting a document. An information processing apparatus comprising: an output device that outputs a degree; and a data processing device that is controlled by a program. (B) counting the frequency of words in the decomposed word string and generating a first vector based on a word dictionary of the input document;
And (c) comparing the first vector generated in the first vector generation process with the first vector stored in the first vector storage means, and calculating the similarity between the first vector and the first vector. A first similarity calculation process for calculating; (d) a document pattern generation process in which an array of words decomposed by the word search process is used as a document pattern; and (e) a pattern stored in the pattern dictionary storage unit. And (f) counting the pattern frequency of the pattern and word sequence by scanning the document pattern and decomposing it into a pattern and word sequence.
A second vector generation process for generating a second vector of the input document based on the pattern dictionary; and (g) storing the second vector generated in the second vector generation process and the second vector. A second similarity calculation process for comparing the second vector stored in the means and calculating the similarity, and (h) a similarity output from the first similarity calculation and the second similarity calculation. A similarity integration process of integrating similarities output from the similarity calculation process and outputting the same as one similarity; and causing the data processing device to execute each of the processes (a) to (h). A recording medium on which a program is recorded.

20. A word dictionary storage means for storing a word dictionary; a first vector storage means for storing a first vector of the stored document according to the word dictionary; and a second vector storage means for extracting semantic information of the document. A first pattern dictionary storage unit for storing one pattern dictionary; a second vector storage unit for storing a second vector of the stored document according to the first pattern dictionary; and extracting semantic information of the document. A second pattern dictionary storing means for storing a second pattern dictionary for storing a second vector, and a third vector storing means for storing a third vector of the stored document according to the second pattern dictionary. An information processing apparatus comprising: a device; an input device for inputting a document; an output device for outputting a degree of similarity; and a data processing device controlled by a program. A word search process for decomposing the input document into word strings based on the stored word information; and (b) counting the word frequencies of the decomposed word strings, and using the word dictionary of the input document as a vector. And (c) comparing the first vector generated in the first vector generation process with the first vector stored in the first vector storage means. A first similarity calculation process for calculating the similarity, (d) a document pattern generation process for using a sequence of words decomposed by the word search process as a document pattern, and (e) the first pattern dictionary. A first pattern search process for scanning a document pattern using the pattern stored in the storage means and decomposing the document pattern into an array of patterns and words; (f) counting a pattern frequency of the array of patterns and words;
A second of the input document according to the first pattern dictionary
(G) a second vector generation process for generating the vector of the second vector generation process, and a second vector generated in the second vector generation process and a second vector stored in the second vector storage unit. (H) scanning the document pattern by using the pattern stored in the second pattern dictionary storage means and decomposing the document pattern into an array of patterns and words; And (i) counting the pattern frequency of the pattern and the sequence of words,
A third of the input document according to the second pattern dictionary
A third vector generation process for generating a vector of the third vector, and (j) a third vector generated in the third vector generation process and a third vector stored in the third vector storage unit. A third similarity calculation process for comparing and calculating the similarity; and (k) integrating the similarities output from the first to third similarity calculation processes and outputting as one similarity. A recording medium on which a program for causing the data processing device to execute each of the above-described processes (a) to (k) is described.