JP2003288366A

JP2003288366A - Similar text retrieval device

Info

Publication number: JP2003288366A
Application number: JP2002090099A
Authority: JP
Inventors: Taro Fujimoto; 太郎藤本; Atsushi Arima; 淳有馬
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-03-28
Filing date: 2002-03-28
Publication date: 2003-10-10

Abstract

<P>PROBLEM TO BE SOLVED: To determine how much a plurality of texts agree with each other even when they are not perfectly agree with each other at a high speed. <P>SOLUTION: This device is characterized in that it is provided with an input interface means 101 for inputting a text, a pre-processing means 103 for performing a pre-processing to the inputted text, an N-gram forming means 104 for forming the N-gram element to the text, a similarity calculation means 105 for calculating the matching degree of N-gram element for a plurality of texts, and an output interface means 108 for outputting the calculation result of the similarity calculation means 105. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は類似テキスト検索装
置に係り、特に複数のテキストが完全に一致していない
場合でもどの程度類似しているのかを高速に判別するも
のに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similar text search apparatus, and more particularly to a method for quickly determining how similar texts are, even if a plurality of texts do not completely match.

【０００２】[0002]

【従来の技術】テキストに対する検索装置として、キー
ワードの完全一致検索が行われているが、これは検索者
が入力した検索キーに完全に一致するテキストのみを出
力している。しかし「テキスト」と「テクスト」のよう
に完全に一致していないものでも一致するというように
判断する「あいまい検索」が要求されている。例えばＷ
ｉｎｄｏｗｓ（登録商標）に関する表現は、個人によ
り、バージョンによりＷｉｎｄｏｗｓ−ＮＴと表現され
たりＷＩＮＮＴ４と表現されたり色々な表現が行われ
ているが、これらは類似のものであると判定することが
必要なことが多い。2. Description of the Related Art An exact match search for keywords is performed as a search device for texts, but this only outputs texts that exactly match a search key entered by a searcher. However, "fuzzy search" is required to judge that even incomplete matches such as "text" and "text" match. For example W
Regarding the expressions regarding Windows (registered trademark), various expressions such as Windows-NT and WIN NT4 are made by an individual depending on the version, but it is necessary to determine that these are similar. There are many things.

【０００３】[0003]

【発明が解決しようとする課題】従来の検索装置は、検
索者が入力した検索キーに完全に一致するテキストのみ
を出力しており、高精度のあいまい検索は時間がかかる
と考えられており、検索システムとしてあまり整備され
ていなかった。The conventional search device outputs only the text that exactly matches the search key input by the searcher, and it is considered that high-precision fuzzy search takes time. The search system was not well developed.

【０００４】したがって本発明の目的は、前記「Ｗｉｎ
ｄｏｗｓ−ＮＴ」や「ＷＩＮＮＴ４」のように、表記
にゆれがあるキーワードを高速に検索するための装置を
提供することである。Therefore, the object of the present invention is to achieve the above-mentioned "Win.
An object of the present invention is to provide a device for high-speed search for a keyword having a variation in notation, such as "dows-NT" or "WIN NT4".

【０００５】[0005]

【課題を解決するための手段】本発明の原理図を図１に
示す。図１において１は類似テキスト検索装置、１０１
は入力インタフェース手段、１０３は前処理手段、１０
４はＮグラム化手段、１０５は類似度算出手段、１０８
は出力インタフェース手段である。FIG. 1 shows the principle of the present invention. In FIG. 1, 1 is a similar text search device, 101
Is an input interface means, 103 is a preprocessing means, 10
4 is an N-gram conversion unit, 105 is a similarity calculation unit, and 108.
Is an output interface means.

【０００６】本発明の前記目的は下記（１）〜（５）に
より達成される。The above object of the present invention is achieved by the following items (1) to (5).

【０００７】（１）テキストが入力される入力インタフ
ェース手段１０１と、入力されたテキストに対する前処
理を行う前処理手段１０３と、テキストに対するＮグラ
ム要素を作成するＮグラム化手段１０４と、複数のテキ
ストに関するＮグラム要素の一致度を演算する類似度演
算手段１０５と、この類似度演算手段１０５の演算結果
を出力する出力インタフェース手段１０８を具備したこ
とを特徴とする類似テキスト検索装置。(1) Input interface means 101 for inputting text, preprocessing means 103 for preprocessing input text, N-gram conversion means 104 for creating N-gram elements for text, and a plurality of texts A similar text search device comprising: a similarity calculation means 105 for calculating the degree of coincidence of N-gram elements relating to the above and an output interface means 108 for outputting the calculation result of the similarity calculation means 105.

【０００８】（２）テキストが入力される入力インタフ
ェース手段と、類似度演算対象となるテキストが保持さ
れるテキスト・データベース手段と、テキストに対する
前処理を行う前処理手段と、テキストに対するＮグラム
要素を作成するＮグラム化手段と、複数のテキストに関
するＮグラム要素の一致度を演算する類似度演算手段
と、前記一致度の高い順から演算結果を出力するソート
手段と、このソート手段のソート結果を出力する出力イ
ンタフェース手段を具備したことを特徴とする類似テキ
スト検索装置。(2) An input interface means for inputting text, a text database means for holding text to be a similarity calculation object, a preprocessing means for preprocessing text, and an N-gram element for text. The N-gram converting means to be created, the similarity calculating means for calculating the degree of coincidence of N-gram elements regarding a plurality of texts, the sorting means for outputting the operation result in the order of the highest degree of coincidence, and the sorting result of the sorting means A similar text search device comprising output interface means for outputting.

【０００９】（３）テキストが入力される入力インタフ
ェース手段と、類似度演算対象となるテキストが保持さ
れるテキスト・データベース手段と、テキストに対する
前処理を行う前処理手段と、テキストに対するＮの値が
異なる複数種類のＮグラム要素を作成する複数のＮグラ
ム化手段と、前記異なる複数種類のＮグラム要素につい
て、それぞれのＮグラム要素の頻度により類似度を算出
する類似度算出手段と、それぞれのＮグラム要素の頻度
により類似度を算出した値を加算する類似度加算手段
と、この類似度加算手段の出力を大きい順から出力する
ソート手段と、このソート手段のソート結果を出力する
出力インタフェース手段を具備したことを特徴とする類
似テキスト検索装置。(3) The input interface means for inputting text, the text database means for holding the text to be the similarity calculation object, the preprocessing means for preprocessing the text, and the value of N for the text are A plurality of N-gram converting means for creating a plurality of different types of N-gram elements, a similarity degree calculating means for calculating the degree of similarity of the different plurality of types of N-gram elements based on the frequency of each N-gram element, and each N-gram element. A similarity adding means for adding the values calculated by the frequency of the gram element, a sorting means for outputting the outputs of the similarity adding means in descending order, and an output interface means for outputting the sorting result of the sorting means. A similar text search device characterized by being provided.

【００１０】（４）テキストが入力される入力インタフ
ェース手段と、類似度演算対象となるテキストが保持さ
れるテキスト・データベース手段と、テキストに対する
前処理を行う前処理手段と、テキストに対するＮグラム
要素を作成するＮグラム化手段と、テキスト・データベ
ース手段に保持されたテキストに対して作成されたＮグ
ラム要素をインデクス保持するインデクス・データベー
ス手段と、このインデクス・データベース手段に対する
アクセス手段と、複数のテキストに関するＮグラム要素
の一致度を演算する類似度演算手段と、前記一致度の高
い順から演算結果を出力するソート手段と、このソート
手段のソート結果を出力する出力インタフェース手段を
具備したことを特徴とする類似テキスト検索装置。(4) An input interface means for inputting text, a text database means for holding text to be a similarity calculation object, a preprocessing means for preprocessing text, and an N-gram element for text. Regarding N-gram forming means for creating, index database means for holding N-gram elements created for the text held in the text database means, access means for this index database means, and a plurality of texts It is characterized by further comprising a similarity calculation means for calculating the degree of coincidence of the N-gram element, a sorting means for outputting the calculation result in the order of the highest degree of coincidence, and an output interface means for outputting the sorting result of the sorting means. Similar text search device.

【００１１】（５）テキストが入力される入力インタフ
ェース手段と、類似度演算対象となるテキストが保持さ
れるテキスト・データベース手段と、テキストに対する
前処理を行う前処理手段と、テキストに対するＮの値が
異なる複数種類のＮグラム要素を作成する複数のＮグラ
ム化手段と、テキスト・データベース手段に保持された
テキストに対して作成された、異なる複数種類のＮグラ
ム要素をインデクス保持するインデクス・データベース
手段と、このインデクス・データベース手段に対するア
クセス手段と、前記異なる複数種類のＮグラム要素につ
いて、それぞれのＮグラム要素の頻度により類似度を算
出する類似度算出手段と、それぞれのＮグラム要素の頻
度により類似度を算出した値を加算する類似度加算手段
と、この類似度加算手段の出力を大きい順から出力する
ソート手段と、このソート手段のソート結果を出力する
出力インタフェース手段を具備したことを特徴とする類
似テキスト検索装置。(5) The input interface means for inputting a text, the text database means for holding the text to be the similarity calculation object, the preprocessing means for preprocessing the text, and the value of N for the text are A plurality of N-gram conversion means for creating different types of N-gram elements, and an index database means for indexing different types of N-gram elements created for the text stored in the text database means; , An access means for the index database means, a similarity calculation means for calculating the degree of similarity of the N-gram elements of the different plural types by the frequency of each N-gram element, and a degree of similarity by the frequency of each N-gram element. The similarity adding means for adding the calculated value and this similarity addition And sorting means for outputting the descending order output means, similar text search apparatus characterized by comprising an output interface means for outputting the sort result of the sorting means.

【００１２】これにより下記の作用効果を奏する。As a result, the following operational effects are exhibited.

【００１３】（１）テキストをそれぞれＮグラム要素を
作成してそのマッチングを行うので、表現のぶれを吸収
した形でテキストのマッチングを検索できるので、あい
まい検索を正確に実行することができる。(1) Since N-gram elements are created for each text and the matching is performed, the text matching can be searched in a form that absorbs the blurring of the expression, so that the fuzzy search can be accurately executed.

【００１４】（２）あらかじめ比較すべき一方のテキス
トをテキスト・データベースに保持しているので、検索
の度に比較すべき全テキストを入力する必要がなく、高
速に類似テキストを検索できる。(2) Since one of the texts to be compared is held in the text database in advance, it is not necessary to input all the texts to be compared at each search, and similar texts can be searched at high speed.

【００１５】（３）一方のテキストをデータベースに保
持するとともに、Ｎの値が異なる複数種類のＮグラム要
素を作成してその頻度によって類似度を演算するので、
例えばＮ＝２つまり２グラム要素の場合に助詞の部分の
一致により見かけ上の類似度の上がるようなテキストに
対しても３グラム要素の場合にはこれを抑制することが
でき、類似度の判定結果の速度及び精度を向上すること
ができる。(3) Since one of the texts is held in the database, a plurality of types of N-gram elements having different N values are created, and the degree of similarity is calculated according to the frequency.
For example, in the case of N = 2, that is, in the case of a 2-gram element, even if the apparent similarity is increased by matching the particle part, this can be suppressed in the case of a 3-gram element, and the similarity determination The speed and accuracy of the results can be improved.

【００１６】（４）あらかじめテキスト・データベース
に保持していたテキストのＮグラム要素を作成し、これ
をインデクス・データベースに保持しているので、テキ
ストの比較に際し、このインデクス・データベースに保
管していたＮグラム要素を使用して入力されたテキスト
に対する類似度を算出することを高速に行うことができ
る。(4) Since the N-gram element of the text stored in advance in the text database is created and stored in the index database, it is stored in this index database when the texts are compared. It is possible to quickly calculate the similarity to the input text using the N-gram element.

【００１７】（５）入力テキストをＮの値の異なる複数
種類のＮグラム要素を作成し、またテキスト・データベ
ースに保持していたテキストについてもこれまたＮの異
なる複数種類のＮグラム要素を作成してインデクス・デ
ータベースに保持しているので、Ｎグラム要素の頻度に
よる類似度を高速に行うことができ、しかもその類似度
の精度を向上したものとすることができる。(5) A plurality of types of N-gram elements having different N values are created from the input text, and a plurality of types of N-gram elements having different N are also created for the text held in the text database. Since it is stored in the index database, the similarity depending on the frequency of N-gram elements can be performed at high speed, and the accuracy of the similarity can be improved.

【００１８】[0018]

【発明の実施の形態】本発明の実施の形態を説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described.

【００１９】Ａ．本発明の第一の実施の形態本発明の第一の実施の形態を図１にもとづき説明する。
図１（Ａ）は本発明の第一の実施の形態を示し、同
（Ｂ）はそのＮグラム化装置の動作説明図である。A. First Embodiment of the Present Invention A first embodiment of the present invention will be described with reference to FIG.
FIG. 1A shows a first embodiment of the present invention, and FIG. 1B is an operation explanatory diagram of the N-gram conversion device.

【００２０】図１において、類似テキスト検索装置１
は、入力インタフェース手段１０１、前処理手段１０
３、Ｎグラム化手段１０４、類似度算出手段１０５、出
力インタフェース手段１０８を具備している。In FIG. 1, a similar text retrieval device 1
Are input interface means 101 and preprocessing means 10.
3, N-gram conversion means 104, similarity calculation means 105, and output interface means 108.

【００２１】入力インタフェース手段１０１は、類似度
を検索されるテキストが入力されるものであり、いまテ
キスト「ＷｉｎＮＴ４」と「Ｗｉｎｄｏｗｓ−ＮＴ」
の類似度を求める場合、これらのテキストが入力される
ものである。そしてこの入力を行うため、例えばパーソ
ナル・コンピュータ（以下パソコンという）の如き端末
装置が接続される。The input interface means 101 is used for inputting the text whose similarity is searched, and the texts "Win NT4" and "Windows-NT" are now input.
These texts are input when the similarity of is calculated. In order to perform this input, a terminal device such as a personal computer (hereinafter referred to as a personal computer) is connected.

【００２２】前処理手段１０３は、テキストがアルファ
ベットの場合は大文字で統一したり、空白を除去した
り、ハイフォンを除去したり、句読点や括弧等を除去し
たり、半角文字を全角文字にするものである。When the text is alphabetic, the preprocessing means 103 unifies them with capital letters, removes blanks, removes hyphens, removes punctuation marks, parentheses, etc., and converts half-width characters into full-width characters. Is.

【００２３】Ｎグラム化手段１０４はテキストをＮグラ
ム化つまりＮ文字ずつの要素に分けるものである。例え
ばＷＩＮＮＴ４を２グラム化（２文字ずつの要素に分け
る）する場合の動作について、図１（Ｂ）により説明す
る。The N-gram converting means 104 converts the text into N-grams, that is, divides the text into N-character elements. For example, the operation of converting WINNT4 into 2 grams (divided into two character elements) will be described with reference to FIG.

【００２４】まずＷＩＮＮＴ４の最初の２文字ＷＩを
取る。First, take the first two characters WI of WINNT4.

【００２５】次に先頭から１文字ずらして２文字取
る。これによりＩＮが取れる。Next, two characters are taken by shifting one character from the beginning. As a result, IN can be obtained.

【００２６】それからさらに１文字ずらして２文字取
る。これによりＮＮが取れる。Then, one character is further shifted to take two characters. As a result, NN can be obtained.

【００２７】このような処理を繰返し行うことによりＷ
ＩＮＮＴ４を、図１（Ｂ）に示す如く、ＷＩ、ＩＮ、Ｎ
Ｎ、ＮＴ、Ｔ４に２グラム要素（バイグラム要素）とし
て分割することができる。By repeating such processing, W
INT4 is connected to WI, IN, N as shown in FIG.
It can be divided into N, NT, and T4 as a 2-gram element (bigram element).

【００２８】類似度算出手段１０５は、２つのテキスト
の例えばバイグラム要素の一致不一致を比較して類似度
を算出するものである。例えば類似度を測りたい２つの
テキストをそれぞれバイグラム要素に分けて、その共通
集合の個数を和集合の個数で割ることにより、類似度を
算出する。図１（Ｂ）に示す如く、ＷＩＮＮＴ４とＷＩ
ＮＤＯＷＳＮＴ（登録商標）との類似度を求めるとき、
共通集合は３（ＷＩ、ＩＮ、ＮＴ）、和集合は１０（Ｗ
Ｉ、ＩＮ、ＮＮ、ＮＴ、Ｔ４、ＮＤ、ＤＯ、ＯＷ、Ｗ
Ｓ、ＳＮ）であり、この場合の２つのテキストの類似度
は０．３となる。The similarity calculating means 105 calculates the similarity by comparing the coincidence and non-coincidence of bigram elements of two texts. For example, two texts whose similarity is to be measured are divided into bigram elements, and the number of common sets is divided by the number of unions to calculate the similarity. As shown in FIG. 1B, WINNT4 and WI
When obtaining the similarity with NDOWNSTNT (registered trademark),
The common set is 3 (WI, IN, NT), and the union is 10 (W
I, IN, NN, NT, T4, ND, DO, OW, W
S, SN), and the similarity between the two texts in this case is 0.3.

【００２９】また２つのテキストの長さが極端に異なる
場合は、短い方のテキストのバイグラム要素の個数で共
通集合の個数を割る方法もある。この例では短い方のテ
キストＷＩＮＮＴ４の類似度は、共通集合の個数が３、
短い方のテキストのバイグラム要素数が５のため、類似
度は０．６となる。When the lengths of the two texts are extremely different, there is also a method of dividing the number of common sets by the number of bigram elements of the shorter text. In this example, the similarity of the shorter text WINNT4 is that the number of common sets is 3,
Since the number of bigram elements in the shorter text is 5, the similarity is 0.6.

【００３０】出力インタフェース手段１０８は、類似度
算出手段１０５において算出された類似度を、例えば入
力インタフェース手段１０１に対しテキストを入力した
端末装置に出力するものである。The output interface means 108 outputs the similarity calculated by the similarity calculating means 105 to, for example, the terminal device which has input the text to the input interface means 101.

【００３１】図１の動作についてテキスト「ＷｉｎＮＴ
４」と「Ｗｉｎｄｏｗｓ−ＮＴ」の類似度を求める場合
について説明する。For the operation of FIG. 1, the text "WinNT
A case where the similarity between “4” and “Windows-NT” is obtained will be described.

【００３２】（１）まずユーザは、図示省略した端末装
置よりテキスト「ＷｉｎＮＴ４」と「Ｗｉｎｄｏｗｓ−
ＮＴ」を入力する。前処理手段１０３は、これらのテキ
ストを前処理してこれらをＷＩＮＮＴ４、ＷＩＮＤＯＷ
ＳＮＴとして、Ｎグラム化手段１０４に送出する。(1) First, the user inputs the texts "WinNT4" and "Windows-" from a terminal device (not shown).
Enter "NT". The pre-processing unit 103 pre-processes these texts and processes them as WINNT4 and WINDOW.
It is sent to the N-gram converting means 104 as SNT.

【００３３】（２）この例ではＮグラム化手段１０４は
２グラム化手段として動作する例について説明する。Ｎ
グラム化手段１０４は、まずＷＩＮＮＴ４を図１（Ｂ）
に示す如く２グラム要素に断片し、得られたＷＩ、Ｉ
Ｎ、ＮＮ、ＮＴ、Ｔ４を類似度算出手段１０５に送出
し、次にＷＩＮＤＯＷＳＮＴを図１（Ｂ）に示す如く２
グラム要素に断片し、得られたＷＩ、ＩＮ、ＮＤ、Ｄ
Ｏ、ＯＷ、ＷＳ、ＳＮ、ＮＴを類似度算出手段１０５に
送出する。このとき同一テキストで同一の２グラム要素
が複数存在したとき、１つの要素のみ残し、重複した２
グラム要素を削除する。Ｎ（Ｎ≠２）グラム化でも同様
である。(2) In this example, an example in which the N-gram converting means 104 operates as the 2-gram converting means will be described. N
The gram conversion means 104 first sets WINNT4 in FIG.
The resulting WI, I was fragmented into 2 gram elements as shown in
N, NN, NT, T4 are sent to the similarity calculation means 105, and then WINDOWSNT is set to 2 as shown in FIG. 1 (B).
Fragmented into gram elements and obtained WI, IN, ND, D
The O, OW, WS, SN, and NT are sent to the similarity calculation means 105. If there are multiple identical 2-gram elements with the same text, only one element is left
Delete the gram element. The same applies to N (N ≠ 2) grammarization.

【００３４】（３）類似度算出手段１０５では、これら
の２グラム要素より共通集合（ＷＩ、ＩＮ、ＮＴ）と和
集合（ＷＩ、ＩＮ、ＮＮ、ＮＴ、Ｔ４、ＮＤ、ＤＯ、Ｏ
Ｗ、ＷＳ、ＳＮ）を求める。そして共通集合の要素数３
を和集合の要素数１０で商し、得られた値０．３を出力
インタフェース１０８を経由して、前記テキストを入力
した端末装置にこれを出力、表示する。これによりユー
ザは２つのテキストの類似度が０．３であることを認識
する。(3) In the similarity calculating means 105, a common set (WI, IN, NT) and a union (WI, IN, NN, NT, T4, ND, DO, O) are created from these two-gram elements.
W, WS, SN). And the number of elements in the common set is 3
Is quotient with the number of elements of the union, and the obtained value 0.3 is output and displayed on the terminal device to which the text is input via the output interface 108. Thereby, the user recognizes that the similarity between the two texts is 0.3.

【００３５】Ｂ．本発明の第二の実施の形態本発明の第二の実施の形態を図２、図３にもとづき説明
する。図２は本発明の第二の実施の形態を示し、図３は
その動作説明図である。B. Second Embodiment of the Present Invention A second embodiment of the present invention will be described with reference to FIGS. FIG. 2 shows a second embodiment of the present invention, and FIG. 3 is an operation explanatory diagram thereof.

【００３６】図２において、類似テキスト検索装置２
は、テキストからＮの値が異なる複数種類のＮグラム要
素を作成し、それぞれのＮグラムの頻度に対して類似度
を算出した値を加算し、その加算結果で類似度を認識す
るものである。例えばテキストを２グラム要素と３グラ
ム要素の２つの種類のものを作成し、２グラム要素で算
出した類似度と、３グラム要素で算出した類似度とを加
算し、その加算結果で類似度を認識する。In FIG. 2, a similar text search device 2
Is to create a plurality of types of N-gram elements having different values of N from the text, add the calculated values of similarity to the frequency of each N-gram, and recognize the similarity from the addition result. . For example, two types of text, a 2-gram element and a 3-gram element, are created, the similarity calculated by the 2-gram element and the similarity calculated by the 3-gram element are added, and the similarity is calculated by the addition result. recognize.

【００３７】これにより、例えばテキストが文章のよう
な場合、２グラム要素では「から」、「より」、「で
は」のような２文字の助詞が一致したとき類似度が上が
ることになるが、３グラム要素ではこのような一致を防
止できるので、類似度を精度良く算出できる。以下の説
明は２グラム化と３グラム化した場合について述べる。As a result, for example, when the text is a sentence, the degree of similarity increases when two-letter particles such as "kara", "yori", and "wa" match in the 2-gram element. Since such a match can be prevented with the 3-gram element, the similarity can be calculated accurately. In the following description, the case of converting into 2 grams and 3 grams will be described.

【００３８】（１）まずユーザは図示省略した端末装置
よりテキスト「ＷｉｎＮＴ４」と「Ｗｉｎｄｏｗｓ−Ｎ
Ｔ」を入力する。これらのテキストは入力インタフェー
ス手段１０１を介して前処理手段１０３に入力される。
前処理手段１０３はこれらのテキストを前処理して、こ
れらをＷＩＮＮＴ４、ＷＩＮＤＯＷＳＮＴとしてＮグラ
ム化手段１０４に送出する。(1) First, the user inputs the texts "WinNT4" and "Windows-N" from a terminal device (not shown).
Enter "T". These texts are input to the preprocessing unit 103 via the input interface unit 101.
The pre-processing means 103 pre-processes these texts and sends them to the N-gram converting means 104 as WINNT4 and WINDOWSNT.

【００３９】（２）Ｎグラム化手段１０４では、これら
を先ず２グラム要素に断片して、図３（Ｂ）に示すもの
を類似度算出手段１０５に出力してその２グラム要素に
おける類似度の算出を行い、（この例では３／１０）次
いで３グラム要素に断片して、図３（Ｃ）に示すものを
類似度算出手段１０５に出力してその３グラム要素にお
ける類似度の算出を行う（この例では１／１０）。(2) In the N-gram conversion means 104, these are first fragmented into 2-gram elements, and the one shown in FIG. 3B is output to the similarity calculation means 105 to calculate the similarity between the 2-gram elements. The calculation is performed (3/10 in this example), and then fragmented into 3 gram elements, and the one shown in FIG. 3C is output to the similarity calculation means 105 to calculate the similarity in the 3 gram elements. (1/10 in this example).

【００４０】（３）そしてこれら２グラム要素における
類似度と、３グラム要素における類似度とを類似度加算
手段１０６に送出して、これらの和を求め、これを出力
インタフェース手段１０８を経由して、前記テキストを
入力した端末装置にこれを出力、表示する。(3) Then, the degree of similarity in the 2-gram element and the degree of similarity in the 3-gram element are sent to the degree-of-similarity adding means 106 to obtain the sum of these, which is then passed through the output interface means 108. , The text is output and displayed on the terminal device to which the text is input.

【００４１】類似度はＮの値により、また複数種類Ｎの
値により、変化するので、各ケースに応じて類似非類似
の基準が異なるものとなる。Since the degree of similarity changes depending on the value of N and the value of a plurality of types N, the criteria of similarity and dissimilarity differ depending on each case.

【００４２】なお前記の場合、Ｎグラム化手段、類似度
算出手段をそれぞれＮの種類の数だけ設けてもよく、１
個ずつ設けてもよい。複数にすれば高速化をはかること
ができ、１個にすればコスト節約をはかることができ
る。In the above case, N-gram conversion means and similarity calculation means may be provided for each of N types.
You may provide one by one. If a plurality is provided, the speed can be increased, and if only one is provided, cost can be saved.

【００４３】Ｃ．本発明の第３の実施の形態本発明の第３の実施の形態を図４にもとづき説明する。C. Third embodiment of the present invention A third embodiment of the present invention will be described based on FIG.

【００４４】図４において類似テキスト検索装置３は、
類似度を算出すべき対象である複数のテキストをあらか
じめテキスト・データベース１０９に格納しておく。こ
れにより、類似度を算出する場合に、図示省略した端末
装置から複数のテキストを入力する必要はなく、１テキ
ストのみ入力すればよいので、入力コストが大幅に削減
され、高速に類似度を算出することができる。テキスト
・データベース１０９には複数のテキストが格納されて
いるので、ソート手段１０７を介してその類似度の高い
ものから順次出力される。In FIG. 4, the similar text search device 3 is
A plurality of texts whose similarity is to be calculated are stored in the text database 109 in advance. Thus, when calculating the similarity, it is not necessary to input a plurality of texts from a terminal device (not shown), and only one text needs to be input, so that the input cost is significantly reduced and the similarity can be calculated at high speed. can do. Since a plurality of texts are stored in the text database 109, the texts having a high degree of similarity are sequentially output via the sorting means 107.

【００４５】（１）まずテキスト・データベース１０９
に、あらかじめ類似度の算出対象となる複数のテキスト
Ｔ₁、Ｔ₂、Ｔ₃・・・を図示省略した、入力手段から
格納しておく。(1) First, the text database 109
In advance, a plurality of texts T ₁ , T ₂ , T ₃ ... For which the similarity is calculated are stored from an input means (not shown).

【００４６】（２）ユーザは、図示省略した端末装置よ
りテキストＴ₀を入力する。このテキストＴ₀は入力イ
ンタフェース手段１０２を経由して前処理手段１０３に
送出されて、前記前処理が行われ、Ｎグラム化手段１０
４によりＮグラム化例えば２グラム要素に断片され、類
似度算出手段１０５に送出される。(2) The user inputs the text T ₀ from the terminal device (not shown). This text T ₀ is sent to the preprocessing means 103 via the input interface means 102, the preprocessing is performed, and the N-gram converting means 10 is executed.
4 is fragmented into N-grams, for example, 2-gram elements, and sent to the similarity calculation means 105.

【００４７】（３）テキスト・データベース１０９で
は、まずテキストＴ₁が読み出されて、前処理手段１０
３で前処理され、Ｎグラム化手段１０４により、前記テ
キストＴ₀と同じく２グラム要素に断片され、類似度算
出手段１０５により、テキストＴ₀とＴ₁との類似度Ｓ
₁が算出され、ソート手段１０７にこれが送出される。(3) In the text database 109, the text T ₁ is read out first, and the preprocessing means 10 is read.
3 is pre-processed, and the N-gram conversion means 104 fragments the same as the text T ₀ into 2-gram elements, and the similarity calculation means 105 fragments the similarity S between the texts T ₀ and T _1.
₁ is calculated and sent to the sorting means 107.

【００４８】（４）次にテキスト・データベース１０９
からテキストＴ₂が読み出され、同様にして類似度算出
手段１０５によりテキストＴ₀とＴ₂との類似度Ｓ₂が
算出され、ソート手段１０７に送出される。このように
してテキスト・データベース１０９からテキストＴ₃、
Ｔ₄・・・が読み出され、テキストＴ₀との類似度
Ｓ ₃、Ｓ₄・・・が算出され、ソート手段１０７に送出
される。(4) Next, the text database 109
From the text T₂Is read out and the similarity is calculated in the same way.
Text T by means 105₀And T₂Similarity S with₂But
It is calculated and sent to the sorting means 107. in this way
Text database 109 to text T₃,
T_Four... is read and the text T₀Similarity with
S ₃, S_Four... is calculated and sent to the sorting means 107.
To be done.

【００４９】（５）このようにテキスト・データベース
１０９内のすべてのテキストに対して類似度が算出され
た後に、ソート手段１０７はそれらの類似度を高い順に
ソートして、これらを全部、あるいはあらかじめ定めら
れた数だけ出力し、出力インタフェース手段１０８を経
由して、前記テキストＴ₀を入力したユーザの端末装置
に対しこれらを出力する。(5) After the similarities have been calculated for all the texts in the text database 109 in this way, the sorting means 107 sorts the similarities in descending order, and all of them or in advance. A predetermined number is output, and these are output to the terminal device of the user who has input the text T ₀ via the output interface unit 108.

【００５０】なお前記説明では、ユーザから入力された
テキストＴ₀に対する前処理手段、Ｎグラム化手段を、
テキスト・データベース１０９から読み出したテキスト
Ｔ₁、Ｔ₂・・・に対する前処理手段、Ｎグラム化手段
とを別のものを使用する例について記載したが、勿論こ
れらは同一のものを使用してもよい。In the above description, the preprocessing means and N-gram converting means for the text T ₀ input by the user are
The example in which the preprocessing means and the N-gram converting means for the texts T ₁ , T ₂ ... Read out from the text database 109 are used separately, but of course the same ones may be used. Good.

【００５１】Ｄ．本発明の第４の実施の形態本発明の第４の実施の形態を図５に示す。図５において
類似テキスト検索装置４では、ユーザが入力したテキス
トＴ₀及びテキスト・データベース１０９に格納されて
いるテキストＴ₁、Ｔ₂・・・をそれぞれＮの値の異な
る複数種類のＮグラム要素を作成し、類似度を算出し、
類似度を加算して類似の程度を認識するものである。D. Fourth Embodiment of the Present Invention A fourth embodiment of the present invention is shown in FIG. 5, in the similar text search device 4, the text T ₀ input by the user and the texts T ₁ , T ₂ ... Stored in the text database 109 are respectively converted into a plurality of types of N-gram elements having different N values. Create, calculate the similarity,
The degree of similarity is recognized by adding the degrees of similarity.

【００５２】図５の動作について説明する。The operation of FIG. 5 will be described.

【００５３】（１）まずテキスト・データベース１０９
に、あらかじめ類似度の算出対象となる複数のテキスト
Ｔ₁、Ｔ₂、Ｔ₃・・・を図示省略した入力手段から格
納しておく。(1) First, the text database 109
In advance, a plurality of texts T ₁ , T ₂ , T ₃ ... Which are the objects of similarity calculation are stored from an input means (not shown).

【００５４】（２）ユーザは、図示省略した端末装置よ
りテキストＴ₀を入力する。このテキストＴ₀は入力イ
ンタフェース手段１０２を経由して前処理手段１０３に
送出されて、前記前処理が行われ、Ｎグラム化手段１０
４、１０４によりＮの値の異なる複数種類のＮグラム要
素、例えば２グラム要素及び３グラム要素に断片され、
それぞれ類似度算出装置１０５、１０５に送出される。(2) The user inputs the text T ₀ from the terminal device (not shown). This text T ₀ is sent to the preprocessing means 103 via the input interface means 102, the preprocessing is performed, and the N-gram converting means 10 is executed.
4, 104 are fragmented into a plurality of types of N-gram elements having different values of N, for example, a 2-gram element and a 3-gram element,
It is sent to the similarity calculation devices 105 and 105, respectively.

【００５５】（３）テキスト・データベース１０９で
は、先ずテキストＴ₁が読み出されて、前処理手段１０
３で前処理され、Ｎグラム化手段１０４、１０４によ
り、前記テキストＴ₀と同じく２グラム要素及び３グラ
ム要素に断片され、類似度算出手段１０５、１０５に送
出され、類似度算出手段１０５、１０５で２グラム要素
及び３グラム要素にもとづきテキストＴ₀とＴ₁の類似
度Ｓ₁₂、Ｓ₁₃が算出され、これらが類似度加算手段１０
６に送出されてその和（Ｓ₁₂＋Ｓ₁₃）が求められ、ソー
ト手段１０７に送出される。(3) In the text database 109, the text T ₁ is read out first, and the preprocessing means 10 is read.
3 is pre-processed, and the N-gram converting means 104, 104 fragment the same into the 2-gram element and the 3-gram element as the text T _0, and the fragment is sent to the similarity calculating means 105, 105, and the similarity calculating means 105, 105. The similarity S ₁₂ and S ₁₃ between the texts T ₀ and T ₁ are calculated on the basis of the 2-gram element and the 3-gram element, and these are calculated by the similarity adding means 10
6 and the sum (S ₁₂ + S ₁₃ ) is obtained and sent to the sorting means 107.

【００５６】（４）次にテキスト・データベース１０９
からテキストＴ₂が読み出され、同様にして類似度算出
手段１０５、１０５によりテキストＴ₀とＴ₂の２グラ
ム要素及び３グラム要素にもとづき類似度Ｓ₂₂、Ｓ₂₃が
算出され、これらが類似度加算手段１０６に送出されて
その和（Ｓ₂₂＋Ｓ₂₃）が求められ、ソート手段１０７に
送出される。(4) Next, the text database 109
The text T ₂ is read from the text T ₂ , and the similarity calculation means 105, 105 calculates the similarities S ₂₂ and S ₂₃ based on the 2-gram element and the 3-gram element of the texts T ₀ and T _{2 in} the same manner. The sum (S ₂₂ + S ₂₃ ) is sent to the degree adding means 106 and is then sent to the sorting means 107.

【００５７】（５）このようにしてテキスト・データベ
ース１０９からテキストＴ₃、Ｔ₄・・・が読み出さ
れ、テキストＴ₀との２グラム要素及び３グラム要素に
もとづく類似度（Ｓ₃₂、Ｓ₃₃）、（Ｓ₄₂、Ｓ₄₃）・・・
が算出され、これらが類似度加算手段１０６に送出され
てその和（Ｓ₃₂＋Ｓ₃₃）、（Ｓ₄₂＋Ｓ₄₃）・・・が得ら
れ、ソート手段１０７に送出される。(5) In this way, the texts T ₃ , T _4, ... Are read from the text database 109 and the similarity (S ₃₂ , S) based on the 2-gram element and the 3-gram element with the text T ₀ is read. ₃₃ ), (S ₄₂ , S ₄₃ ) ...
Are calculated, and these are sent to the similarity adding means 106 to obtain their sums (S ₃₂ + S ₃₃ ), (S ₄₂ + S ₄₃ ), and sent to the sorting means 107.

【００５８】（６）このようにテキスト・データベース
１０９内の全てのテキストに対して２グラム要素、３グ
ラム要素の類似度の和が算出された後に、ソート手段１
０７はこれらの類似度の和の値の高い順にソートしてこ
れらを全部、あるいはあらかじめ定められた数だけ出力
し、出力インタフェース手段１０８を経由して、前記テ
キストＴ₀を入力したユーザの端末装置に対しこれらを
出力する。(6) After the sum of the similarities of the 2-gram element and the 3-gram element is calculated for all the texts in the text database 109, the sorting means 1
07 is sorted in descending order of the value of the sum of these similarities and outputs all or a predetermined number, and the terminal device of the user who inputs the text T ₀ via the output interface means 108. These are output to.

【００５９】なお前記説明では、ユーザから入力された
テキストＴ₀に対する前処理手段、Ｎグラム化手段を、
テキスト・データベース１０９から読み出したテキスト
Ｔ₁、Ｔ₂・・・に対する前処理手段、Ｎグラム化手段
とを別のものを使用する例について記載したが、これら
は同一のものを使用してもよい。またＮグラム化手段及
び類似度算出手段をこれまた同一のものを使用してもよ
い。In the above description, the preprocessing means and N-gram converting means for the text T ₀ input by the user are
The example in which the preprocessing means and the N-gram converting means for the texts T ₁ , T _2, ... Read out from the text database 109 are used separately has been described, but the same may be used. . The same N-gram conversion means and similarity calculation means may be used.

【００６０】Ｅ．本発明の第５の実施の形態本発明の第５の実施の形態を図６〜図９にもとづき説明
する。図６は本発明の第５の実施の形態を示し、図７は
そのインデクス・データベースの説明図、図８はインデ
クス・データベースにデータを登録するときの動作説明
図、図９は検索・類似度算出のときの動作説明図であ
る。E. Fifth Embodiment of the Invention A fifth embodiment of the invention will be described with reference to FIGS. 6 to 9. FIG. 6 shows a fifth embodiment of the present invention, FIG. 7 is an explanatory diagram of the index database, FIG. 8 is an explanatory diagram of an operation when data is registered in the index database, and FIG. 9 is a search / similarity degree. It is an operation explanatory view at the time of calculation.

【００６１】図６において、類似テキスト検索装置５で
は、インデクス・データベース１１１には検索対象とな
るテキストＴ₁、Ｔ₂・・・のＮグラム要素と、テキス
トとの関係が例えば図７（Ｄ）に示す如く、テーブル化
されて格納されている。In FIG. 6, in the similar text search apparatus 5, the relationship between the N-gram elements of the texts T ₁ , T ₂ ... Which are the search targets in the index database 111 and the text is, for example, FIG. 7D. As shown in FIG. 4, the data is stored in a table.

【００６２】検索対象となるテキストＴ₁、Ｔ₂、Ｔ₃
が、図７（Ａ）に示す如く、ＷｉｎｄｏｗｓＮＴ、Ｗｉ
ｎｄｏｗｓ２０００、ＷｉｎＭＥのとき、図７（Ｂ）に
示す如く、例えば２グラム化され、同（Ｄ）に示す如
く、各２グラム要素ＷＩ、ＩＮ・・・と、その属するテ
キスト名Ｔ₁、Ｔ₂、Ｔ₃が格納されている。なおテキ
ストＴ₂では２グラム要素として００が２個作成される
が、同一要素については１個のみ残すので、図７（Ｄ）
の如きものとなる。そしてｔ₁はテキストＴ₁の２グラ
ム要素の数（この場合はｔ₁＝８）、ｔ₂はテキストＴ
₂の２グラム要素の数（この場合はｔ₂＝９）、ｔ₃は
テキストＴ₃の２グラム要素の数（この場合はｔ₃＝
４）を示す。Texts to be searched T ₁ , T ₂ , T ₃
However, as shown in FIG. 7 (A), WindowsNT, Wi
Ndows2000, when WinME, as shown in FIG. 7 (B), for example, be 2 g of, as shown in (D), each 2 g element WI, IN · · · and a text name T _1, T ₂ thereof belonging , T ₃ are stored. In the text T ₂ , two 00 elements are created as 2-gram elements, but only one element is left for the same element.
It becomes something like. And t ₁ is the number of 2-gram elements of the text T ₁ (in this case t ₁ = 8), t ₂ is the text T
The number of ₂ 2 grams elements (t ₂ = 9 in this case), t ₃ Number 2 grams elements of text T ₃ are (in this case t ₃ =
4) is shown.

【００６３】そして図７（Ａ）に示すテキストＴ₀（Ｗ
ｉｎＮＴ４）との類似度を求めるとき、テキストＴ₀の
２グラム要素をインデクス・データベース１１１に格納
されているテキストＴ₁、Ｔ₂、Ｔ₃の２グラム要素と
照合し、その一致数を求める。このときテキストＴ₀の
みに存在する２グラム要素、ＮＮ、Ｔ４をインデクス・
データベース１１１に登録し、他のテキストとの照合に
備える。これによりテキストＴ₀とＴ₁は３／１０、Ｔ
₀とＴ₂は２／１２、Ｔ₀とＴ₃は２／７という類似度
を得ることができる。このように、テキストＴ₁、
Ｔ₂、Ｔ₃の２グラム要素及びその数（頻度ともいう）
ｔ₁、ｔ₂、ｔ₃が登録されているので類似度の演算を
高速に行うことができる。Then, the text T ₀ (W shown in FIG.
inNT4), the 2-gram element of the text T ₀ is collated with the 2-gram elements of the texts T ₁ , T ₂ and T ₃ stored in the index database 111, and the number of matches is obtained. At this time, the 2-gram elements existing only in the text T ₀ , NN and T4 are indexed.
It is registered in the database 111 and prepared for collation with other text. This makes the text T ₀ and T ₁ 3/10, T
₀ and T ₂ are 2/12, T ₀ and T ₃ can be obtained similarity of 2/7. Thus, the text T ₁ ,
2-gram elements of T ₂ and T ₃ and their number (also called frequency)
Since t ₁ , t ₂ , and t ₃ are registered, the similarity can be calculated at high speed.

【００６４】なおインデクス・データベース１１１に
は、図７（Ｃ）に示す如く、テキストＴ₁、Ｔ₂、Ｔ₃
・・・のＮグラム要素（この例では２グラム要素）を格
納してもよい。In the index database 111, as shown in FIG. 7C, the texts T ₁ , T ₂ , T _{3 are written.}
N-gram elements (..., 2-gram elements in this example) may be stored.

【００６５】図６の動作を、図８に示すインデクス・デ
ータベース１１１にデータを登録する場合と、図９に示
す類似度算出の場合について説明する。The operation of FIG. 6 will be described for the case of registering data in the index database 111 shown in FIG. 8 and the case of similarity calculation shown in FIG.

【００６６】（１）データ登録について、Ｓ１．まずテキスト・データベース１０９に、あらかじ
め類似度の算出対象となる図７（Ａ）に示す如き、複数
のテキストＴ₁、Ｔ₂、Ｔ₃・・・を図示省略した入力
手段から格納する。それからリストから得た最初のテキ
ストＴ₁を前処理手段１０３に送出し、小文字の大字化
や空白、ハイフォンの削除等の前処理を行ってキーワー
ドクリーニングする。(1) Regarding data registration, S1. First, in the text database 109, a plurality of texts T ₁ , T ₂ , T ₃ ... As shown in FIG. Then, the first text T ₁ obtained from the list is sent to the pre-processing means 103, and the pre-processing such as lower case lettering, blanks, and deletion of hyphens is performed for keyword cleaning.

【００６７】Ｓ２．このように前処理されたテキストを
Ｎグラム化手段１０４に送り、例えば２グラム要素に断
片される。S2. The text thus preprocessed is sent to the N-gram conversion means 104, and is fragmented into, for example, 2-gram elements.

【００６８】Ｓ３．同一テキストにおいて同じ２グラム
要素が存在したとき、重複した２グラム要素を削除して
１つにする。S3. When the same 2-gram element exists in the same text, duplicate 2-gram elements are deleted to make one.

【００６９】Ｓ４．このようにして得たテキストＴ₁の
２グラム要素を、データベース・アクセスインタフェー
ス手段１１０によりテキスト名Ｔ₁と２グラム要素数ｔ
₁とともにインデクス・データベース１１１に格納す
る。このようにして、テキスト・データベース１０９の
リストより得た他のテキストＴ₂、Ｔ₃についても同様
の処理を行い、図７（Ｄ）に示す如く、インデクス・デ
ータベース１１１が作成される。S4. The 2-gram element of the text T ₁ thus obtained is converted by the database access interface means 110 into the text name T ₁ and the 2-gram element number t.
_It is stored in the index database 111 together with ₁ . In this way, similar processing is performed for the other texts T ₂ and T ₃ obtained from the list of the text database 109, and the index database 111 is created as shown in FIG. 7D.

【００７０】（２）類似度算出についてＳ１０．ユーザが、図示省略した端末装置より図７
（Ａ）に示すテキストＴ₀を入力する。この入力された
テキストＴ₀（入力キーワード）は、入力インタフェー
ス手段１０２を経由して前処理手段１０３に送出されて
前処理が行われ、入力キーワードがクリーニングされ
る。(2) Calculation of similarity S10. The user selects a terminal device (not shown) from FIG.
Input the text T ₀ shown in (A). The input text T ₀ (input keyword) is sent to the preprocessing unit 103 via the input interface unit 102 to be preprocessed and the input keyword is cleaned.

【００７１】Ｓ１１．このように前処理されたテキスト
Ｔ₀は、Ｎグラム化手段１０４により、前記テキストＴ
₁、Ｔ₂、Ｔ₃と同様に、図７（Ｂ）に示す如く、２グ
ラム要素に断片される。S11. The text T ₀ preprocessed in this way is processed by the N-gram conversion means 104.
Similar to ₁ , T ₂ and T ₃ , it is fragmented into 2 gram elements as shown in FIG.

【００７２】Ｓ１２．この場合、２グラム要素に重複す
るものがあれば、これらを１個だけ残して他の重複２グ
ラム要素を削除する。S12. In this case, if there are duplicate 2-gram elements, only one of them is left and the other duplicate 2-gram elements are deleted.

【００７３】Ｓ１３．このようにして得られた入力テキ
ストＴ₀の２グラム要素がＮグラム化手段１０４から類
似度算出手段１０５に入力されると、類似度算出手段１
０５は、データベース・アクセスインタフェース手段１
１０を介してインデクス・データベース１１１より、図
７（Ｄ）に示す、テキストＴ₁のレコードを取得する。S13. When the 2-gram element of the input text T ₀ thus obtained is input from the N-gram converting means 104 to the similarity calculating means 105, the similarity calculating means 1
05 is a database access interface means 1
The record of the text T ₁ shown in FIG. 7D is acquired from the index database 111 via 10.

【００７４】Ｓ１４．そして類似度を計算する。これに
よりテキストＴ₀とＴ₁との類似度Ｓ₁＝３／１０が得
られる。S14. Then, the degree of similarity is calculated. This gives a similarity S ₁ = 3/10 between the texts T ₀ and T ₁ .

【００７５】Ｓ１５．このようにしてリストに記入され
た他の全テキストＴ₂、Ｔ₃とテキストＴ₀との類似度
Ｓ₂、Ｓ₃が算出され、Ｓ₂＝２／１２、Ｓ₃＝２／７
が得られる。S15. In this way, the similarities S ₂ and S ₃ between all the other texts T ₂ and T ₃ entered in the list and the text T ₀ are calculated, and S ₂ = 2/12 and S ₃ = 2/7
Is obtained.

【００７６】Ｓ１６．これらの類似度Ｓ₁、Ｓ₂、Ｓ₃
はソート手段１０７に送出される。ソート手段１０７は
これらの類似度をその高い順にソートして、これらを出
力インタフェース手段１０８を経由して、前記テキスト
Ｔ₀を入力したユーザの端末装置に対しこれらを出力す
る。S16. These similarities S ₁ , S ₂ , S ₃
Is sent to the sorting means 107. The sorting means 107 sorts these similarities in the descending order and outputs them to the terminal device of the user who has input the text T ₀ via the output interface means 108.

【００７７】なお前記説明ではユーザから入力されたテ
キストＴ₀に対する前処理手段、Ｎグラム化手段を、テ
キスト・データベース１０９から読み出したテキストＴ
₁、Ｔ₂・・・に対する前処理手段、Ｎグラム化手段と
別のものを使用する例について記載したが、これらは同
一のものを使用してもよい。In the above description, the preprocessing means and N-gram conversion means for the text T ₀ input by the user are read out from the text database 109.
_Although an example in which a pretreatment means for ₁ , T _2, ... And another means for N-gram conversion are used has been described, these may be the same.

【００７８】Ｆ．本発明の第６の実施の形態本発明の第６の実施の形態を図１０にもとづき説明す
る。図１０において、テキスト類似検索装置６におい
て、ユーザが入力したテキストＴ₀及びテキスト・デー
タベース１０９に格納されているテキストＴ₁、Ｔ₂・
・・を、それぞれＮの値の異なる複数種類のＮグラム要
素を作成し、類似度を算出し、類似度を加算して類似の
程度を認識するものである。またインデクス・データベ
ース１１１もＮの値の異なる複数種類のＮグラム要素用
に複数用意されている。F. Sixth Embodiment of the Present Invention A sixth embodiment of the present invention will be described with reference to FIG. 10, in the text similarity search device 6, the text T ₀ input by the user and the texts T ₁ and T ₂ stored in the text database 109.
.. creates a plurality of types of N-gram elements each having a different N value, calculates the degree of similarity, and adds the degrees of similarity to recognize the degree of similarity. Also, a plurality of index databases 111 are prepared for a plurality of types of N-gram elements having different N values.

【００７９】（１）まずテキスト・データベース１０９
に、あらかじめ類似度の算出対象となる、図７（Ａ）に
示す如き、複数のテキストＴ₁、Ｔ₂、Ｔ₃を図示省略
した入力手段から格納する。それから最初のテキストＴ
₁を前処理手段１０３に送出し、前処理を行ったのち、
Ｎグラム化手段１０４、１０４によりＮの値の異なる複
数種類のＮグラム要素、例えば２グラム要素及び３グラ
ム要素に断片され、それぞれデータベース・アクセスイ
ンタフェース手段１１０を介してＮの値の異なる複数種
類のインデクス・データベース１１１、１１１に、格納
される。テキストＴ₂、Ｔ₃についても同様な処理が行
われ、図７（Ｄ）に示す如く、格納される。(1) First, the text database 109
In FIG. 7, a plurality of texts T ₁ , T ₂ and T ₃ which are the objects of similarity calculation are stored in advance from the input means (not shown) as shown in FIG. 7A. Then the first text T
_After sending ₁ to the pre-processing means 103 to perform pre-processing,
A plurality of types of N-gram elements having different values of N, for example, two-gram elements and three-gram elements, are fragmented by the N-gram converting means 104, 104, and a plurality of types of different values of N are respectively passed through the database access interface means 110. It is stored in the index databases 111, 111. Similar processing is performed on the texts T ₂ and T ₃ , and the texts are stored as shown in FIG.

【００８０】（２）ユーザは、図示省略した端末装置よ
りテキストＴ₀を入力する。このテキストＴ₀は入力イ
ンタフェース手段１０２を経由して前処理手段１０３に
送出されて、前記前処理が行われ、Ｎグラム化手段１０
４、１０４によりＮの値の異なる複数種類のＮグラム要
素、例えば２グラム要素及び３グラム要素に断片され、
それぞれ類似度算出装置１０５、１０５に送出される。(2) The user inputs the text T ₀ from the terminal device (not shown). This text T ₀ is sent to the preprocessing means 103 via the input interface means 102, the preprocessing is performed, and the N-gram converting means 10 is executed.
4, 104 are fragmented into a plurality of types of N-gram elements having different values of N, for example, a 2-gram element and a 3-gram element,
It is sent to the similarity calculation devices 105 and 105, respectively.

【００８１】（３）インデクス・データベース１１１、
１１１では先ずテキストＴ₁に対する２グラム要素及び
３グラム要素が読み出され、テキストＴ₀に対する２グ
ラム要素及び３グラム要素と、類似度算出手段１０５、
１０５において２グラム要素同士、３グラム要素同士で
類似度の算出を行い、テキストＴ₀とＴ₁の２グラム要
素の類似度Ｓ₁₂、３グラム要素の類似度Ｓ₁₃が算出さ
れ、これらが類似度加算手段１０６に送出されてその和
（Ｓ₁₂＋Ｓ₁₃）が求められ、ソート手段１０７に送出さ
れる。(3) Index database 111,
In 111, first, the 2-gram element and 3-gram element for the text T ₁ are read out, and the 2-gram element and 3-gram element for the text T ₀ and the similarity calculation means 105,
In 105, the similarity between two-gram elements and between three-gram elements is calculated, and the similarity S ₁₂ between two-gram elements of the texts T ₀ and T ₁ and the similarity S ₁₃ between three-gram elements are calculated. The sum (S ₁₂ + S ₁₃ ) is obtained by sending to the degree adding means 106, and sent to the sorting means 107.

【００８２】（４）次にインデクス・データベース１１
１、１１１よりテキストＴ₂に対する２グラム要素及び
３グラム要素が読み出されて同様にして類似度算出手段
１０５、１０５によりテキストＴ₀とＴ₂の２グラム要
素及び３グラム要素にもとづき類似度Ｓ₂₂、Ｓ₂₃が算出
され、これらが類似度加算手段１０６に送出されてその
和（Ｓ₂₂＋Ｓ₂₃）が求められ、ソート手段１０７に送出
される。このようにしてテキストＴ₀とＴ₃との２グラ
ム要素及び３グラム要素にもとづき類似度Ｓ₃₂、Ｓ₃₃が
算出されて類似度加算手段１０６においてその和（Ｓ₃₂
＋Ｓ₃₃）が求められソート手段１０７に送出される。(4) Next, the index database 11
The 2-gram element and 3-gram element corresponding to the text T ₂ are read out from Nos. ₁ and 111, and similarly, the similarity calculation means 105, 105 calculates the similarity S based on the 2-gram element and 3-gram element of the texts T ₀ and T _2. ₂₂ and S ₂₃ are calculated, and these are sent to the similarity adding means 106 to obtain the sum (S ₂₂ + S ₂₃ ) and sent to the sorting means 107. In this way, the similarities S ₃₂ and S ₃₃ are calculated based on the 2-gram element and the 3-gram element of the texts T ₀ and T ₃ and the sum (S ₃₂₎ is calculated by the similarity adding means 106.
+ S ₃₃ ) is obtained and sent to the sorting means 107.

【００８３】（５）ソート手段１０７は類似度加算手段
１０６から送出された類似度の和（Ｓ₁₂＋Ｓ₁₃）、（Ｓ
₂₂＋Ｓ₂₃）、（Ｓ₃₂＋Ｓ₃₃）の値が高い順にソートし
て、これらをソート順に出力インタフェース手段１０８
を経由して、前記テキストＴ₀を入力したユーザの端末
装置に出力する。(5) The sorting means 107 sums the similarities sent from the similarity adding means 106 (S ₁₂ + S ₁₃ ), (S
₂₂ + S ₂₃ ), (S ₃₂ + S ₃₃ ) are sorted in descending order, and the output interface means 108 is sorted in the sorted order.
The text T ₀ is output to the terminal device of the user who has input the text.

【００８４】なお前記説明では、ユーザから入力された
テキストＴ₀に対する前処理手段、Ｎグラム化手段を、
テキスト・データベース１０９から読み出したテキスト
Ｔ₁、Ｔ₂・・・に対する前処理手段、Ｎグラム化手段
を別のものを使用する例について記載したが、これらは
同一のものを使用してもよい。またＮグラム化手段及び
類似度算出手段インデクス・データベースもこれまた同
一のものを使用してもよい。同一機能のものは同一のも
のを使用してもよい。In the above description, the preprocessing means and N-gram conversion means for the text T ₀ input by the user are
Although an example in which different preprocessing means and N-gram conversion means are used for the texts T ₁ , T _2, ... Read from the text database 109 has been described, the same may be used. The N-gram conversion means and the similarity calculation means index database may also be the same. Those having the same function may use the same one.

【００８５】なお前記説明ではテキストがプログラム名
等で記載された例について説明したが、本発明はこれに
限定されるものではない。日本語の文書でも同様に適用
することができる。例えば書籍検索システムの場合、本
の正しい書名を忘れた場合でも、その一部である有名な
文書を入力することにより書名候補を得ることができ
る。In the above description, an example in which the text is described by a program name or the like has been described, but the present invention is not limited to this. The same applies to Japanese documents. For example, in the case of a book search system, even if the correct title of a book is forgotten, a title candidate can be obtained by inputting a famous document that is a part of it.

【００８６】ある特定の検索対象に対してあいまい検索
を行うことにより、表記のゆれや入力ミスなどを吸収し
た検索を行うことができる。インデクス・データベース
に検索対象の前処理データを入れておくことにより、検
索者が入力してからの応答速度を高めることができる。By performing a fuzzy search with respect to a specific search target, it is possible to perform a search that absorbs fluctuations in input or input errors. By putting preprocessed data to be searched in the index database, it is possible to increase the response speed after the searcher inputs.

【００８７】また２テキストの長さが極端に異なる場合
は短い方のテキストのＮグラム要素の個数で共通のＮグ
ラム要素を商し、類似度を求めることができる。これに
より、従来行われていた最長部分列の長さによる類似度
計算に比べ、高速に類似度計算が可能である。When the lengths of the two texts are extremely different, the common N-gram element can be quoted by the number of N-gram elements of the shorter text to obtain the similarity. As a result, the similarity calculation can be performed faster than the similarity calculation based on the length of the longest subsequence that has been conventionally performed.

【００８８】本発明を検索エンジンに適用する際にさら
に高速化するために、予めデータとして保持されている
テキストに対し、前処理、Ｎグラム化、重複Ｎグラム要
素削除等の処理を行ったものを蓄えておくことで、高速
化を図ることができる。In order to further increase the speed when the present invention is applied to a search engine, the pre-processing, N-gram conversion, deletion of duplicate N-gram elements, etc. are performed on the text stored in advance as data. By storing, it is possible to increase the speed.

【００８９】本発明の実施の形態を以下に付記する。The embodiments of the present invention will be additionally described below.

【００９０】（付記１）テキストが入力される入力イン
タフェース手段と、入力されたテキストに対する前処理
を行う前処理手段と、テキストに対するＮグラム要素を
作成するＮグラム化手段と、複数のテキストに関するＮ
グラム要素の一致度を演算する類似度演算手段と、この
類似度演算手段の演算結果を出力する出力インタフェー
ス手段を具備したことを特徴とする類似テキスト検索装
置。(Supplementary Note 1) Input interface means for inputting text, preprocessing means for performing preprocessing on the input text, N-gram conversion means for creating N-gram elements for the text, and N for a plurality of texts.
A similar text search device comprising: a similarity calculation means for calculating the degree of coincidence of gram elements; and an output interface means for outputting the calculation result of the similarity calculation means.

【００９１】（付記２）テキストが入力される入力イン
タフェース手段と、入力されたテキストに対する前処理
を行う前処理手段と、テキストに対するＮの値が異なる
Ｎグラム要素を作成するＮグラム化手段と、複数のテキ
ストに関するＮグラム要素の一致度をＮグラムの頻度に
より演算する類似度演算手段と、この類似度演算手段の
演算結果を出力する出力インタフェース手段を具備した
ことを特徴とする類似テキスト検索装置。(Supplementary Note 2) Input interface means for inputting text, preprocessing means for preprocessing input text, and N-gram conversion means for creating N-gram elements having different N values for text. A similar text retrieval device comprising: a similarity calculation means for calculating the degree of coincidence of N-gram elements relating to a plurality of texts based on the frequency of N-grams; and an output interface means for outputting the calculation result of the similarity calculation means. .

【００９２】（付記３）テキストが入力される入力イン
タフェース手段と、類似度演算対象となるテキストが保
持されるテキスト・データベース手段と、テキストに対
する前処理を行う前処理手段と、テキストに対するＮグ
ラム要素を作成するＮグラム化手段と、複数のテキスト
に関するＮグラム要素の一致度を演算する類似度演算手
段と、前記一致度の高い順から演算結果を出力するソー
ト手段と、このソート手段のソート結果を出力する出力
インタフェース手段を具備したことを特徴とする類似テ
キスト検索装置。(Supplementary Note 3) Input interface means for inputting text, text database means for holding text to be a similarity calculation object, preprocessing means for preprocessing text, and N-gram element for text , An N-gram conversion means for creating the N-gram element, a similarity calculation means for calculating the degree of coincidence of N-gram elements for a plurality of texts, a sorting means for outputting the calculation result in the descending order of the degree of coincidence, and a sorting result of the sorting means. A similar text search device comprising output interface means for outputting

【００９３】（付記４）テキストが入力される入力イン
タフェース手段と、類似度演算対象となるテキストが保
持されるテキスト・データベース手段と、テキストに対
する前処理を行う前処理手段と、テキストに対するＮの
値が異なる複数種類のＮグラム要素を作成する複数のＮ
グラム化手段と、前記異なる複数種類のＮグラム要素に
ついて、それぞれのＮグラム要素の頻度により類似度を
算出する類似度算出手段と、それぞれのＮグラム要素の
頻度により類似度を算出した値を加算する類似度加算手
段と、この類似度加算手段の出力を大きい順から出力す
るソート手段と、このソート手段のソート結果を出力す
る出力インタフェース手段を具備したことを特徴とする
類似テキスト検索装置。(Supplementary Note 4) Input interface means for inputting text, text / database means for holding text to be a similarity calculation object, preprocessing means for preprocessing text, and N value for text Multiple N's that create multiple types of N-gram elements with different
The grammarizing means, the similarity calculating means for calculating the degree of similarity for each of the plurality of different N-gram elements by the frequency of each N-gram element, and the value for calculating the degree of similarity by the frequency of each N-gram element are added. A similar text search device comprising: a similarity adding means, a sorting means for outputting the outputs of the similarity adding means in descending order, and an output interface means for outputting a sorting result of the sorting means.

【００９４】（付記５）テキストが入力される入力イン
タフェース手段と、類似度演算対象となるテキストが保
持されるテキスト・データベース手段と、テキストに対
する前処理を行う前処理手段と、テキストに対するＮグ
ラム要素を作成するＮグラム化手段と、テキスト・デー
タベース手段に保持されたテキストに対して作成された
Ｎグラム要素をインデクス保持するインデクス・データ
ベース手段と、このインデクス・データベース手段に対
するアクセス手段と、複数のテキストに関するＮグラム
要素の一致度を演算する類似度演算手段と、前記一致度
の高い順から演算結果を出力するソート手段と、このソ
ート手段のソート結果を出力する出力インタフェース手
段を具備したことを特徴とする類似テキスト検索装置。(Supplementary Note 5) Input interface means for inputting text, text database means for holding text to be subjected to similarity calculation, preprocessing means for preprocessing text, and N-gram element for text , An index database means for index-holding the N-gram elements created for the text held in the text database means, an access means for the index database means, and a plurality of texts. The similarity calculation means for calculating the degree of coincidence of the N-gram element, the sorting means for outputting the calculation result in the descending order of the degree of coincidence, and the output interface means for outputting the sorting result of the sorting means. And similar text search device.

【００９５】（付記６）テキストが入力される入力イン
タフェース手段と、類似度演算対象となるテキストが保
持されるテキスト・データベース手段と、テキストに対
する前処理を行う前処理手段と、テキストに対するＮの
値が異なる複数種類のＮグラム要素を作成する複数のＮ
グラム化手段と、テキスト・データベース手段に保持さ
れたテキストに対して作成された、異なる複数種類のＮ
グラム要素をインデクス保持するインデクス・データベ
ース手段と、このインデクス・データベース手段に対す
るアクセス手段と、前記異なる複数種類のＮグラム要素
について、それぞれのＮグラム要素の頻度により類似度
を算出する類似度算出手段と、それぞれのＮグラム要素
の頻度により類似度を算出した値を加算する類似度加算
手段と、この類似度加算手段の出力を大きい順から出力
するソート手段と、このソート手段のソート結果を出力
する出力インタフェース手段を具備したことを特徴とす
る類似テキスト検索装置。(Supplementary Note 6) Input interface means for inputting text, text / database means for holding text to be subjected to similarity calculation, preprocessing means for preprocessing text, and N value for text Multiple N's that create multiple types of N-gram elements with different
Different types of N created for the grammarizing means and the text held in the text database means
Index database means for holding the gram element as an index, access means for the index database means, and similarity calculating means for calculating the degree of similarity of the N-gram elements of the different plurality of types by the frequency of each N-gram element. , A similarity adding means for adding a value of which the similarity is calculated according to the frequency of each N-gram element, a sorting means for outputting the outputs of the similarity adding means in descending order, and a sorting result of the sorting means. A similar text search device comprising output interface means.

【００９６】[0096]

【発明の効果】本発明により下記の効果を奏することが
できる。According to the present invention, the following effects can be obtained.

【００９７】（１）テキストをそれぞれＮグラム要素を
作成してそのマッチングを行うので、表現のぶれを吸収
した形でテキストのマッチングを検索できるので、あい
まい検索を正確に実行することができる。(1) Since each N-gram element is created for each text and the matching is performed, the text matching can be searched in a form that absorbs the blurring of the expression, so that the fuzzy search can be accurately executed.

【００９８】（２）あらかじめ比較すべき一方のテキス
トをテキスト・データベースに保持しているので、検索
の度に比較すべき全テキストを入力する必要がなく、高
速に類似テキストを検索できる。(2) Since one of the texts to be compared is held in the text database in advance, it is not necessary to input all the texts to be compared each time the search is performed, and similar texts can be searched at high speed.

【００９９】（３）一方のテキストをデータベースに保
持するとともに、Ｎの値が異なる複数種類のＮグラム要
素を作成してその頻度によって類似度を演算するので、
例えばＮ＝２つまり２グラム要素の場合に助詞の部分の
一致により見かけ上の類似度の上がるようなテキストに
対しても３グラム要素の場合にはこれを抑制することが
でき、類似度の判定結果の速度及び精度を向上すること
ができる。(3) One of the texts is held in the database, a plurality of types of N-gram elements having different N values are created, and the degree of similarity is calculated according to the frequency.
For example, in the case of N = 2, that is, in the case of a 2-gram element, even if the apparent similarity is increased by matching the particle part, this can be suppressed in the case of a 3-gram element, and the similarity determination The speed and accuracy of the results can be improved.

【０１００】（４）あらかじめテキスト・データベース
に保持していたテキストのＮグラム要素を作成し、これ
をインデクス・データベースに保持しているので、テキ
ストの比較に際し、このインデクス・データベースに保
管していたＮグラム要素を使用して入力されたテキスト
に対する類似度を算出することを高速に行うことができ
る。(4) Since the N-gram element of the text stored in advance in the text database is created and stored in the index database, it is stored in this index database when the texts are compared. It is possible to quickly calculate the similarity to the input text using the N-gram element.

【０１０１】（５）入力テキストをＮの値の異なる複数
種類のＮグラム要素を作成し、またテキスト・データベ
ースに保持していたテキストについてもこれまたＮの異
なる複数種類のＮグラム要素を作成してインデクス・デ
ータベースに保持しているので、Ｎグラム要素の頻度に
よる類似度を高速に行うことができ、しかもその類似度
の精度を向上したものとすることができる。(5) A plurality of types of N-gram elements having different N values are created for the input text, and a plurality of types of N-gram elements having different N are also created for the text held in the text database. Since it is stored in the index database, the similarity depending on the frequency of N-gram elements can be performed at high speed, and the accuracy of the similarity can be improved.

[Brief description of drawings]

【図１】本発明の第１の実施の形態である。FIG. 1 is a first embodiment of the present invention.

【図２】本発明の第２の実施の形態である。FIG. 2 is a second embodiment of the present invention.

【図３】本発明の第２の実施の形態の動作説明図であ
る。FIG. 3 is an operation explanatory diagram of the second embodiment of the present invention.

【図４】本発明の第３の実施の形態である。FIG. 4 is a third embodiment of the present invention.

【図５】本発明の第４の実施の形態である。FIG. 5 is a fourth embodiment of the present invention.

【図６】本発明の第５の実施の形態である。FIG. 6 is a fifth embodiment of the present invention.

【図７】インデクス・データベースの説明図である。FIG. 7 is an explanatory diagram of an index database.

【図８】インデクス・データベースにデータを登録する
ときの動作説明図である。FIG. 8 is an operation explanatory diagram when registering data in an index database.

【図９】検索・類似度算出のときの動作説明図である。FIG. 9 is an explanatory diagram of an operation at the time of searching and calculating a similarity.

【図１０】本発明の第６の実施の形態である。FIG. 10 is a sixth embodiment of the present invention.

[Explanation of symbols]

１〜６類似テキスト検索装置１０１、１０２入力インタフェース手段１０３前処理手段１０４Ｎグラム化手段１０５類似度算出手段１０６類似度加算手段１０７ソート手段１０８出力インタフェース手段１０９テキスト・データベース１１０データベース・アクセスインタフェース手段１１１インデクス・データベース 1-6 Similar text search device 101, 102 input interface means 103 pretreatment means 104 N-gram conversion means 105 similarity calculation means 106 similarity adder 107 sorting means 108 output interface means 109 Text Database 110 Database Access Interface Means 111 Index Database

Claims

[Claims]

1. Input interface means for inputting text, preprocessing means for preprocessing input text, N-gram conversion means for creating N-gram elements for text, and N-gram elements for a plurality of texts. 2. A similar text search device comprising: a similarity calculation means for calculating the degree of coincidence and an output interface means for outputting the calculation result of the similarity calculation means.

2. An input interface means for inputting text, and a text for holding text to be a similarity calculation target.
Database means, preprocessing means for preprocessing text, N-gram conversion means for creating N-gram elements for text, similarity operation means for calculating the degree of coincidence of N-gram elements for a plurality of texts, A similar text search device comprising: sorting means for outputting a calculation result in descending order of frequency and output interface means for outputting a sorting result of the sorting means.

3. An input interface means for inputting text, and a text for holding text to be a similarity calculation target.
The database means, the preprocessing means for performing preprocessing on the text, the plurality of N-gram converting means for creating a plurality of types of N-gram elements having different N values for the text, and the different plurality of types of N-gram elements, respectively. The similarity calculation means for calculating the similarity according to the frequency of N-gram elements, the similarity addition means for adding the values calculated for the similarity with the frequency of each N-gram element, and the output of the similarity addition means are large. A similar text search device comprising a sorting means for outputting in order and an output interface means for outputting a sorting result of the sorting means.

4. An input interface means for inputting text, and a text for holding text to be a similarity calculation target.
Database means, preprocessing means for preprocessing text, N-gram conversion means for creating N-gram elements for text, and N-gram element creation for N-gram elements created for text held in the text database means. Index database means, access means for the index database means, similarity calculating means for calculating the degree of coincidence of N-gram elements relating to a plurality of texts, and sorting means for outputting operation results in descending order of the degree of coincidence And a similar text search device comprising output interface means for outputting the sorting result of the sorting means.

5. An input interface means for inputting text, and a text for holding text to be a similarity calculation target.
The database means, the preprocessing means for preprocessing the text, the plurality of N-gram converting means for creating a plurality of types of N-gram elements having different N values for the text, and the text held in the text database means A plurality of different types of N-gram elements that are stored in the index database means, an access means for accessing the index database means, and a plurality of different types of N-gram elements according to the frequency of each N-gram element. A similarity calculation means for calculating the similarity, a similarity addition means for adding the values calculated by the frequencies of the respective N-gram elements, and a sorting means for outputting the outputs of the similarity addition means in descending order. , Output interface means for outputting the sorting result of this sorting means Similar text search apparatus, characterized in that Bei was.