JP2001067378A

JP2001067378A - Calculation method and device for similarity of character string and recording medium

Info

Publication number: JP2001067378A
Application number: JP2000188490A
Authority: JP
Inventors: Kyoji Umemura; 恭司梅村
Original assignee: Sumitomo Electric Industries Ltd
Current assignee: Sumitomo Electric Industries Ltd
Priority date: 1999-06-23
Filing date: 2000-06-22
Publication date: 2001-03-16

Abstract

PROBLEM TO BE SOLVED: To calculate the similarity of character strings with emphasis put on words and to retrieve a document without analyzing a morpheme. SOLUTION: In this calculation method of similarity of character strings, an input character string and a document of a document data base are defined as two character strings and the similarity of both character strings is calculated by a similarity calculation part 14. A coincident character string similarity calculation part 21 of the part 14 calculates the character string score to a partial character string that is common to both character strings and adds this score to the similarity of the remaining partial strings. An optional character string similarity calculation part 22 shifts the correspondence relation of both character strings to calculates a larger degree of similarity, and a maximum value selection part 23 selects the larger degree of similarity. These processes are repeated to totalize the score of partial character strings adaptive to the sequences of two character strings, i.e., the partial character strings which are common to each other and to calculate the final similarity. A retrieval result output part 13 selects a document of a high degree of similarity out of a document data base as a retrieval result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報検索の分野に
関するものであり、特に、入力された文字列とデータベ
ースに登録された文書との類似判定に用いると好適であ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of information retrieval, and is particularly suitable for use in determining similarity between an input character string and a document registered in a database.

【０００２】[0002]

【従来の技術】言葉を用いて表現された文書のデータベ
ースから、所望の文書（ドキュメント）を取り出す情報
検索が良く行われている。かかる情報検索において、言
葉は、複数の文字からなる単語を組合せた文字列として
扱う。そして、文字列同士を比較して類似度を算出する
ことで所望の内容との適合度を求めたり、最も類似度の
高い文字列を一つ、あるいは、類似度の高い文字列を幾
つか選び出すことで情報検索を行っている。以下では、
特に断らない限り、類似度を示す値は０以上であって、
値が大きいほど類似度が高いものとする。2. Description of the Related Art Information retrieval for extracting a desired document (document) from a database of documents expressed using words is often performed. In such information retrieval, words are treated as character strings obtained by combining words composed of a plurality of characters. Then, the character strings are compared with each other to calculate the similarity, thereby obtaining the degree of conformity with the desired content, or selecting one character string having the highest similarity or selecting some character strings having the highest similarity. The information is searched by doing. Below,
Unless otherwise specified, the value indicating the similarity is 0 or more,
The larger the value, the higher the similarity.

【０００３】文字列同士の類似度は、大きく分けて、形
態素解析を用いる方法と、ｎグラムによる方法の２通り
がある。[0003] The similarity between character strings is roughly classified into two types: a method using morphological analysis and a method using n-grams.

【０００４】形態素解析を用いる方法は、例えば、Gera
rd Salton and Christopher Buckley, Term-Weighting
Approaches in Automatic Text Retrieval, Informatio
n Proceeding and Management, 24, pp.513-523, 1988.
に開示されている。この方法で二つの文字列同士の類似
度を求めるには、まず、両方の文字列を、辞書と文法知
識を用いた形態素解析により単語の列（単語列）に分解
する。次に、両方の単語列を比較して、一致する単語を
求める。そして、一致する単語に対して重みを設定す
る。それから、この重みを、全ての一致する単語に関し
て加算する。この加算の結果得られた総和が、形態素解
析による類似度である。A method using morphological analysis is described, for example, in Gera
rd Salton and Christopher Buckley, Term-Weighting
Approaches in Automatic Text Retrieval, Informatio
n Proceeding and Management, 24, pp.513-523, 1988.
Is disclosed. In order to determine the similarity between two character strings by this method, first, both character strings are decomposed into word strings (word strings) by morphological analysis using a dictionary and grammatical knowledge. Next, both word strings are compared to find a matching word. Then, weights are set for the matching words. This weight is then added for all matching words. The sum obtained as a result of this addition is the similarity obtained by morphological analysis.

【０００５】ｎグラムによる方法は、例えばYasushi Og
awa and Toru Matsuda, Overlapping statistical word
indexing: A new indexing method for Japanese tex
t, Inproceeding of SIGIR'97, Philadelphia PA, USA,
pp.226-234, 1997.に開示されている。この方法は、形
態素解析を用いない方法である。この方法で文字列同士
の類似度を求めるには、まず、両方の文字列の一致する
部分を求める。そして、この一致する部分に対して重み
を設定する。それから、この重みを、全ての一致する部
分に関して加算する。この加算の結果得られた総和が、
ｎグラムによる類似度である。[0005] The method using n-grams is described in, for example, Yasushi Og
awa and Toru Matsuda, Overlapping statistical word
indexing: A new indexing method for Japanese tex
t, Inproceeding of SIGIR'97, Philadelphia PA, USA,
pp.226-234, 1997. This method does not use morphological analysis. To determine the similarity between character strings by this method, first, a matching part of both character strings is determined. Then, a weight is set for the matching part. This weight is then added for all matching parts. The sum obtained as a result of this addition is
It is a similarity based on n-grams.

【０００６】以上の形態素解析による方法とｎグラムに
よる方法は、文書を構成するキーワードに関する加算と
なっており、これが情報検索において一般的である。し
かしながら、これらの加算では、キーワードが出現する
順序が類似度に反映されることがない。[0006] The above-described method based on morphological analysis and the method based on n-grams are additions regarding keywords constituting a document, and this is common in information retrieval. However, in these additions, the order in which the keywords appear is not reflected on the similarity.

【０００７】また、日本語では情報検索の最初のステッ
プで文書を単語列に分解することが行われているが、こ
れを正しく行うことは簡単ではない。単語の切り出しを
誤ると、それは効果的な情報検索のキーワードの切り出
しを失敗したことを意味しており、情報検索の性能を低
下させる結果をもたらす。[0007] In Japanese, a document is decomposed into a word string in the first step of information retrieval, but it is not easy to correctly perform this. Incorrect word segmentation means that keyword segmentation for effective information retrieval has failed, resulting in a decrease in information retrieval performance.

【０００８】また一方、単語という境界をもたないパタ
ーンに関して、情報検索に有益な部分を特定することは
さらに難しい。文字列は、文字が一次元的に繋がったパ
ターンとみなすこともできる。しかし、たとえ一次元の
パターンであっても、パターンの分割の難しさが一つの
原因で情報検索の方法論の適用が難しかった。On the other hand, it is more difficult to specify a portion useful for information retrieval for a pattern having no word boundary. A character string can be regarded as a pattern in which characters are connected one-dimensionally. However, even for a one-dimensional pattern, it is difficult to apply the information retrieval methodology due to one of the difficulties in dividing the pattern.

【０００９】その様な背景から、類似度をダイナミック
プログラミング（Dynamic Programming。以下、ＤＰと
も記す。）による合算で求める方法がある。これには、
単純加算による方法と、文字に重みを加える方法の２つ
の方法がある。[0009] From such a background, there is a method of calculating the similarity by summation by dynamic programming (hereinafter also referred to as DP). This includes
There are two methods, a simple addition method and a method of adding weight to a character.

【００１０】単純な加算による方法は、例えばRobert
R. Korfhage, Information Storageand Retrieval, pp3
00-304, Wiley Computer Publishing, 1997.に開示され
ている。この方法による文字列同士の類似度（以下、加
算ＤＰ類似度）の求め方は、以下の通りである。ここ
で、α、βを文字列とし、x、yを長さ１の異なる文字列
とする（説明の便宜上、文字を、長さ１の文字列とも呼
ぶことにする）。""は長さ0の文字列（以下、空文字
列）とする。関数MAXは、実数の引数から実数を求める
関数であって、引数のうちで最大の値を求める関数であ
るとする（関数MAXについては、特に断らない限り、他
の類似度を求める場合にも同じ関数とする）。A simple addition method is described, for example, by Robert
R. Korfhage, Information Storage and Retrieval, pp3
00-304, Wiley Computer Publishing, 1997. The method of calculating the similarity between character strings (hereinafter referred to as added DP similarity) by this method is as follows. Here, α and β are character strings, and x and y are character strings having different lengths 1 (characters are also referred to as character strings of length 1 for convenience of description). "" Is a character string of length 0 (hereinafter, an empty character string). The function MAX is a function that calculates a real number from a real number argument, and is a function that calculates the maximum value of the arguments. (The function MAX is also used to calculate other similarities unless otherwise specified.) The same function).

【００１１】加算ＤＰ類似度SIM₁は、引数の文字の部分
に関する一致のパターンに応じて、以下の式を再帰的に
当てはめることで求める。まず、引数の両方が空文字列
のときは、 SIM₁("","")＝0.0 (1) とする。長さ１文字以下の異なる文字列のときは、 SIM₁(x,y)＝0.0 (2) とする。先頭の１文字が同じときは、 SIM₁(xα,xβ)＝MAX(SIM₁(α,xβ),SIM₁(xα,β),1.0＋SIM₁(α,β)) (3) とする。先頭の１文字が異なるときは、 SIM₁(xα,yβ)＝MAX(SIM₁(α,yβ),SIM₁(xα,β),SIM₁(α,β)) (4) とする。式(3)と(4)は、既知のＤＰ手法によって効率的
に計算を行う。The additional DP similarity SIM ₁ is obtained by recursively applying the following equation according to the matching pattern for the character part of the argument. First, when both arguments are empty strings, SIM ₁ ("", "") = 0.0 (1). For a different character string having a length of one character or less, SIM ₁ (x, y) = 0.0 (2). If the first character is the same, SIM ₁ (xα, xβ) = MAX (SIM ₁ (α, xβ), SIM ₁ (xα, β), 1.0 + SIM ₁ (α, β)) (3) If the first character is different, SIM ₁ (xα, yβ) = MAX (SIM ₁ (α, yβ), SIM ₁ (xα, β), SIM ₁ (α, β)) (4). Equations (3) and (4) are efficiently calculated by the known DP method.

【００１２】文字に重みを加える方法は、例えばH.Berg
hel and D.Roach, An Extension ofUkkonen's Enhanced
Dynamic Programming ASM Algorithm, ACM Transactio
nson Information Systems, Vol.14, No.1, pp.94-106,
January 1996.に開示されている。この方法による文字
列同士の類似度（以下、文字重みＤＰ類似度）の求め方
は、以下の通りである。ここで、α、βを文字列とし、
x、yを長さ１の異なる文字列とする。""は空文字列とす
る。Score(x)は、文字列xから正の実数値を求める関数
とする。A method of adding weight to a character is described in, for example, H. Berg
hel and D. Roach, An Extension of Ukkonen's Enhanced
Dynamic Programming ASM Algorithm, ACM Transactio
nson Information Systems, Vol.14, No.1, pp.94-106,
January 1996. The method of calculating the similarity between character strings (hereinafter, character weight DP similarity) by this method is as follows. Here, α and β are character strings,
Let x and y be different character strings of length 1. "" Is an empty string. Score (x) is a function for obtaining a positive real value from the character string x.

【００１３】文字重みＤＰ類似度SIM₂は、引数の文字の
部分に関する一致のパターンに応じて、以下の式を再帰
的に当てはめることで求める。まず、引数の両方が空文
字列のときは、 SIM₂("","")＝0.0 (5) とする。長さ１の異なる文字列のときは、 SIM₂(x,y)＝0.0 (6) とする。先頭の１文字が同じときは、 SIM₂(xα,xβ)＝MAX(SIM₂(α,xβ),SIM₂(xα,β),Score(x)＋SIM₂(α,β)) (7) とする。先頭の１文字が異なるときは SIM₂(xα,yβ)＝MAX(SIM₂(α,yβ),SIM₂(xα,β),SIM₂(α,β)) (8) とする。式(7)と(8)が、既知のＤＰ手法によって効率的
に計算できるのは、加算ＤＰ類似度の場合と同様であ
る。The character weight DP similarity SIM ₂ is obtained by recursively applying the following equation according to the matching pattern for the character part of the argument. First, when both arguments are empty strings, SIM ₂ ("", "") = 0.0 (5). If the character strings have different lengths, SIM ₂ (x, y) = 0.0 (6). When the first character is the same, SIM ₂ (xα, xβ) = MAX (SIM ₂ (α, xβ), SIM ₂ (xα, β), Score (x) + SIM ₂ (α, β)) (7) And If the first character is different, SIM ₂ (xα, yβ) = MAX (SIM ₂ (α, yβ), SIM ₂ (xα, β), SIM ₂ (α, β)) (8). Equations (7) and (8) can be efficiently calculated by the known DP method, as in the case of the added DP similarity.

【００１４】ここで用いたScore(x)関数は、長さ１の文
字列（文字）ｘの重みを与えている。式(3)と式(7)を比
較して分かるように、Score(x)関数を使って文字に重み
を与える点が、文字重みＤＰ類似度の特徴である。The Score (x) function used here gives a weight of a character string (character) x having a length of one. As can be seen by comparing Equations (3) and (7), the point of giving weight to a character using the Score (x) function is a feature of the character weight DP similarity.

【００１５】[0015]

【発明が解決しようとする課題】形態素解析を用いる方
法は、頻繁に使われる単語を使うと情報検索がうまく行
えないという問題がある。なぜなら、かかる単語は、多
くの文字列にありふれて含まれるので、かかる単語の有
無だけでは、文書を選び出す決め手とはならないからで
ある。また、単語の出現順序が情報検索に反映されてい
ないという問題がある。The method using morphological analysis has a problem that information retrieval cannot be performed well when frequently used words are used. This is because such words are commonly included in many character strings, and the presence or absence of such words alone is not a deciding factor in selecting a document. Another problem is that the order of appearance of words is not reflected in information retrieval.

【００１６】さらに、形態素解析を利用しているので、
この形態素解析が不首尾だと情報検索が不調に終わると
いう本質的な制約を有している。形態素解析の精度を上
げるには、単語辞書や文法規則などが大規模にならざる
を得ず、簡便に情報検索を利用することが難しい。それ
から、流行語、造語、限られた分野でのみ使われる専門
用語が出現する文書では、単語辞書の整備の手間が大き
な負担となる。Further, since morphological analysis is used,
If this morphological analysis is unsuccessful, there is an inherent restriction that information retrieval will end abnormally. In order to increase the accuracy of morphological analysis, a word dictionary and grammar rules must be large-scale, and it is difficult to use information retrieval easily. In addition, for documents in which buzzwords, coined words, and technical terms used only in limited fields appear, the burden of maintaining a word dictionary is a large burden.

【００１７】これに対して、ｎグラムによる方法は、形
態素解析を要しないので、簡便に利用し得る。しかし、
単語に重点を置かない検索のため、僅かな文字列の不一
致、例えば、単語の活用形の相違でも一致したものとみ
なさず、内容を的確に捉えた検索は困難である。On the other hand, the method using the n-gram does not require morphological analysis, and can be easily used. But,
Since the search does not focus on the word, it is difficult to perform a search that accurately grasps the content without considering a slight mismatch of character strings, for example, a difference in the inflected form of the word as a match.

【００１８】一方、ダイナミックプログラミングによる
方法としてあげた２つの方法のうち、単純加算による方
法は、文字列を１文字ずつ比較をしていき、１文字一致
したら得点として１が加算される。また、文字に重みを
加える方法は、同じく文字列を１文字ずつ比較をしてい
き、１文字一致したら得点を加えるが、１ではなく、文
字による重みが加算される。この方法は、単純加算によ
る方法を発展させたものということができる。On the other hand, of the two methods listed as the method based on dynamic programming, the method based on simple addition compares character strings one character at a time, and if one character matches, one is added as a score. In addition, a method of adding weight to a character is to compare the character strings one by one, and if one character matches, add a score, but the weight of the character, not 1, is added. This method can be said to be an extension of the simple addition method.

【００１９】どちらの方法でも、１文字一致するたび
に、類似度に値が加算される。しかし、前後の文字も一
致したか否かは考慮されていない。そのため、一致した
文字の連続性が加味されず、場合によっては全く異なる
文字列を、完全に一致した文字列と判断してしまう可能
性もある。In either method, a value is added to the similarity every time one character matches. However, it is not considered whether the characters before and after also match. Therefore, the continuity of the matched characters is not taken into account, and in some cases, a completely different character string may be determined as a completely matched character string.

【００２０】本発明は、これらの課題を解決するために
創作されたものであり、文字ではなく、文字列に対して
類似度を与えることでこれらの問題を解決することを目
的としている。The present invention has been made to solve these problems, and has as its object to solve these problems by giving similarities not to characters but to character strings.

【００２１】[0021]

【課題を解決するための手段】かかる目的に対して、第
１の発明は、二つの文字列の類似度を算出する方法にお
いて、前記二つの文字列それぞれにおける順序に適合す
る部分文字列であって、前記二つの文字列に共通する、
部分文字列を複数求め、前記複数求めた部分文字列に対
してそれぞれ重みを定め、前記重みを総和することで類
似度を算出することを特徴とする文字列類似度算出方法
である。According to a first aspect of the present invention, there is provided a method for calculating a similarity between two character strings, wherein the partial character string conforming to the order of each of the two character strings. And common to the two strings
A character string similarity calculation method, wherein a plurality of partial character strings are obtained, a weight is determined for each of the plurality of obtained partial character strings, and a similarity is calculated by summing the weights.

【００２２】第２の発明は、第１の発明において、前記
部分文字列に対する重みが、前記部分文字列を二つ以上
に分割して得られる部分文字列に対する重みの総和より
も、より重い場合があることを特徴とする文字列類似度
算出方法である。In a second aspect based on the first aspect, the weight of the partial character string is heavier than the sum of the weights of the partial character strings obtained by dividing the partial character string into two or more. There is a character string similarity calculation method characterized by the following.

【００２３】第３の発明は、第１又は第２の発明におい
て、前記二つの文字列の一方が文書データベースから選
ばれたものであり、前記重みが、前記文書データベース
における部分文字列の情報量に対応することを特徴とす
る文字列類似度算出方法である。In a third aspect based on the first or second aspect, one of the two character strings is selected from a document database, and the weight is the information amount of the partial character string in the document database. Is a character string similarity calculation method.

【００２４】第４の発明は、第１から第３のいずれかの
発明において、前記重みが、前記文書データベースにお
ける部分文字列の情報量、および部分文字列の出現集中
度に対応することを特徴とする文字列類似度算出方法で
ある。 According to a fourth aspect of the present invention, any one of the first to third aspects is provided.
In the invention, the weight is stored in the document database.
Of substring information and the concentration of occurrence of substrings
Character string similarity calculation method
is there.

【００２５】第５の発明は、第１から第４のいずれかの
発明において、前記類似度が最も高くなるように、部分
文字列を分割することなく複数求めることを特徴とする
文字列類似度算出方法である。According to a fifth aspect of the present invention, in any one of the first to fourth aspects, a plurality of character string similarities are obtained without dividing the partial character string so that the similarity is the highest. This is a calculation method.

【００２６】第６の発明は、二つの文字列の類似度を算
出する方法において、前記二つの文字列それぞれにおけ
る順序に適合する部分文字列であって、同義語辞書の要
素に含まれる、部分文字列を複数求め、前記複数求めた
部分文字列に対応する同義語辞書の要素に対してそれぞ
れ重みを定め、前記重みを総和することで類似度を算出
することを特徴とする文字列類似度算出方法である。According to a sixth aspect of the present invention, in the method for calculating the similarity between two character strings, a partial character string conforming to the order of each of the two character strings is included in an element of the synonym dictionary. Character string similarity, wherein a plurality of character strings are obtained, a weight is determined for each of the elements of the synonym dictionary corresponding to the plurality of obtained partial character strings, and a similarity is calculated by summing up the weights. This is a calculation method.

【００２７】第７の発明は、第６の発明において、前記
二つの文字列の一方が文書データベースから選ばれたも
のであり、前記重みが、前記文書データベースにおける
同義語辞書の要素の情報量に対応することを特徴とする
文字列類似度算出方法である。According to a seventh aspect , in the sixth aspect , one of the two character strings is selected from a document database, and the weight is determined by an information amount of a synonym dictionary element in the document database. This is a character string similarity calculation method that is characterized by being compatible.

【００２８】第８の発明は、第６又は第７の発明におい
て、前記二つの文字列が異なる言語で表されており、前
記同義語辞書の要素は、前記異なる言語の同義語を含む
ことを特徴とする文字列類似度算出方法である。In an eighth aspect based on the sixth or seventh aspect , the two character strings are expressed in different languages, and the elements of the synonym dictionary include synonyms of the different languages. This is a character string similarity calculation method as a feature.

【００２９】第９の発明は、第１から第４、第６から第
８のいずれかの発明において、前記類似度が最も高くな
るように、部分文字列の分割を許容して複数求めること
を特徴とする文字列類似度算出方法である。 The ninth invention is directed to the first to fourth and sixth to fourth embodiments .
8. A character string similarity calculating method according to any one of the eighth to eighth aspects, wherein a plurality of partial character strings are allowed to be divided so as to obtain the highest similarity.

【００３０】第１０の発明は、二つの文字列の類似度を
算出する文字列類似度算出装置において、二つの文字列
に共通する部分文字列の重みを定める文字列スコア算出
部と、前記重みに、残りの部分文字列の類似度を加算し
て類似度を求める一致文字列類似度算出部と、二つの文
字列のいずれか又は両方において１文字ずつ減らした文
字列同士の類似度のうち最も高い類似度を求める任意文
字列類似度算出部と、求まった前記類似度のうち最も高
い類似度を選ぶ選択部と、を有することを特徴とする文
字列類似度算出装置である。According to a tenth aspect , in the character string similarity calculating apparatus for calculating the similarity between two character strings, a character string score calculating unit for determining a weight of a partial character string common to the two character strings; And a matching character string similarity calculating unit that calculates the similarity by adding the similarity of the remaining partial character strings, and the similarity of the character strings reduced by one character in one or both of the two character strings. A character string similarity calculating apparatus, comprising: an arbitrary character string similarity calculating unit that obtains the highest similarity; and a selecting unit that selects the highest similarity among the obtained similarities.

【００３１】第１１の発明は、第１０の発明において、
前記二つの文字列の一方が文書データベースから選ばれ
たものであり、前記重みが、前記文書データベースにお
ける部分文字列の情報量に対応することを特徴とする文
字列類似度算出装置である。According to an eleventh aspect , in the tenth aspect ,
A character string similarity calculating apparatus, wherein one of the two character strings is selected from a document database, and the weight corresponds to an information amount of a partial character string in the document database.

【００３２】第１２の発明は、第１１の発明において、
前記重みが、前記文書データベースにおける部分文字列
の情報量、および部分文字列の出現集中度に対応するこ
とを特徴とする文字列類似度算出装置である。 According to a twelfth aspect, in the eleventh aspect,
The weight is a partial character string in the document database.
Information volume and the degree of concentration of substrings
And a character string similarity calculation device.

【００３３】第１３の発明は、二つの文字列の類似度を
算出する文字列類似度算出装置において、二つの文字列
の部分文字列であって、同義語辞書の要素に含まれる、
部分文字列に対応する同義語辞書の要素の重みを定める
同義語スコア算出部と、前記重みに、残りの部分文字列
の類似度を加算して類似度を求める一致文字列類似度算
出部と、二つの文字列のいずれか又は両方において１文
字ずつ減らした文字列同士の類似度のうち最も高い類似
度を求める任意文字列類似度算出部と、求まった前記類
似度のうち最も高い類似度を選ぶ選択部と、を有するこ
とを特徴とする文字列類似度算出装置である。According to a thirteenth aspect , in the character string similarity calculating apparatus for calculating the similarity between two character strings, a partial character string of the two character strings is included in an element of the synonym dictionary.
A synonym score calculating unit that determines the weight of an element of the synonym dictionary corresponding to the partial character string, and a matching character string similarity calculating unit that calculates the similarity by adding the similarity of the remaining partial character strings to the weight. An arbitrary character string similarity calculating unit for obtaining the highest similarity among the character strings reduced by one character in one or both of the two character strings, and the highest similarity among the obtained similarities And a selecting unit for selecting a character string similarity.

【００３４】第１４の発明は、第１３の発明において前
記二つの文字列の一方が文書データベースから選ばれた
ものであり、前記重みが、前記文書データベースにおけ
る同義語辞書の要素の情報量に対応することを特徴とす
る文字列類似度算出装置である。According to a fourteenth aspect , in the thirteenth aspect , one of the two character strings is selected from a document database, and the weight corresponds to an information amount of a synonym dictionary element in the document database. This is a character string similarity calculation device characterized in that:

【００３５】第１５の発明は、検索文章に類似した文書
を文書データベースの中から選ぶ文章検索装置であっ
て、前記検索文章と、前記文書データベースの中の文書
を、二つの文字列として第１０から第１４のいずれかの
発明である文字列類似度算出装置により類似度を求め、
前記求めた類似度が高い文書を、文書データベースの中
から選ぶことを特徴とする文書検索装置である。A fifteenth invention is a text search apparatus for selecting a document similar to a search text from a document database, wherein the search text and the document in the document database are converted into two character strings in a tenth character string . A similarity calculated by the character string similarity calculating apparatus according to the fourteenth aspect of the present invention,
A document search apparatus characterized in that a document having a high degree of similarity is selected from a document database.

【００３６】第１６の発明は、二つの文字列の類似度を
算出する文字列類似度算出プログラムであって、前記二
つの文字列を対比させる部分を逐次設定する対比設定過
程と、前記対比させる部分から始まる部分文字列であっ
て、二つの文字列に共通する部分文字列を特定する共通
部分文字列特定過程と、前記特定された部分文字列の重
みを定める文字列スコア設定過程と、前記重みを類似度
に加算する文字列スコア加算過程と、類似度が高くなる
ようにこれらの過程を進める進行処理過程とをコンピュ
ータに実行させるための文字列類似度算出プログラムを
記録したコンピュータ読み取り可能な記録媒体である。A sixteenth invention is a character string similarity calculation program for calculating the similarity between two character strings, wherein a comparison setting step of sequentially setting a part for comparing the two character strings is performed. A partial character string starting from a part, a common partial character string specifying step of specifying a partial character string common to the two character strings, a character string score setting step of determining a weight of the specified partial character string, A computer readable recording of a character string similarity calculation program for causing a computer to execute a character string score adding step of adding weight to the similarity and a progress processing step of advancing these steps to increase the similarity. It is a recording medium.

【００３７】第１７の発明は、第１６の発明に記載の文
字列類似度算出プログラムにおいて、前記二つの文字列
の一方が文書データベースから選ばれたものであり、前
記重みが、前記文書データベースにおける部分文字列の
情報量に対応することを特徴とする文字列類似度算出プ
ログラムを記録したコンピュータ読み取り可能な記録媒
体である。According to a seventeenth aspect , in the character string similarity calculation program according to the sixteenth aspect , one of the two character strings is selected from a document database, and the weight is set in the document database. This is a computer-readable recording medium that records a character string similarity calculation program, which corresponds to the information amount of a partial character string.

【００３８】第１８の発明は、第１７の発明に記載の文
字列類似度算出プログラムにおいて、前記重みが、前記
文書データベースにおける部分文字列の情報量、および
部分文字列の出現集中度に対応することを特徴とする文
字列類似度算出プログラムを記録したコンピュータ読み
取り可能な記録媒体である。 According to an eighteenth aspect, a sentence according to the seventeenth aspect is provided.
In the character string similarity calculation program, the weight may be
The amount of substring information in the document database, and
Sentence characterized by corresponding to the occurrence concentration of substring
Computer reading recorded character string similarity calculation program
It is a removable recording medium.

【００３９】第１９の発明は、第１７又は第１８の発明
に記載の文字列類似度算出プログラムにおいて、前記文
字列スコア設定過程が、サフィックスファイルを利用し
て前記情報量を求めることを特徴とする文字列類似度算
出プログラムを記録したコンピュータ読み取り可能な記
録媒体である。A nineteenth invention is directed to the character string similarity calculation program according to the seventeenth or eighteenth invention, wherein
The character string score setting step is a computer readable recording medium storing a character string similarity calculation program, wherein the information amount is obtained by using a suffix file.

【００４０】第２０の発明は、二つの文字列の類似度を
算出する文字列類似度算出プログラムであって、前記二
つの文字列を対比させる部分を逐次設定する対比設定過
程と、前記対比させる部分から始まる部分文字列であっ
て、同義語辞書の要素に含まれる部分文字列を特定する
同義語特定過程と、前記特定された部分文字列に対応す
る同義語辞書の要素の重みを定める同義語スコア設定過
程と、前記重みを類似度に加算する同義語スコア加算過
程と、類似度が高くなるようにこれらの過程を進める進
行処理過程とをコンピュータに実行させるための文字列
類似度算出プログラムを記録したコンピュータ読み取り
可能な記録媒体である。A twentieth aspect of the present invention is a character string similarity calculation program for calculating the similarity between two character strings, wherein a comparison setting step for sequentially setting a part for comparing the two character strings is performed. A synonym specifying step of specifying a substring that is a substring starting from a part and included in an element of the synonym dictionary, and synonym determining a weight of an element of the synonym dictionary corresponding to the specified substring. A character string similarity calculation program for causing a computer to execute a word score setting step, a synonym score adding step of adding the weight to the similarity, and a progress processing step of advancing these steps to increase the similarity. Is a computer-readable recording medium having recorded thereon.

【００４１】第２１の発明は、第２０の発明に記載の文
字列類似度算出プログラムにおいて、前記二つの文字列
の一方が文書データベースから選ばれたものであり、前
記重みが、前記文書データベースにおける同義語辞書の
要素の情報量に対応することを特徴とする文字列類似度
算出プログラムを記録したコンピュータ読み取り可能な
記録媒体である。[0041] 21 invention, in string similarity calculation program according to the twentieth aspect of the present invention, which one of the two strings is selected from the document database, the weight is in the document database A computer-readable storage medium storing a character string similarity calculation program, which corresponds to the information amount of elements of a synonym dictionary.

【００４２】第２２の発明は、第２１の発明に記載の文
字列類似度算出プログラムにおいて、前記同義語スコア
設定過程が、サフィックスファイルを利用して前記情報
量を求めることを特徴とする文字列類似度算出プログラ
ムを記録したコンピュータ読み取り可能な記録媒体。The twenty-second aspect of the present invention, the string similarity calculation program according to the twenty-first aspect, the synonym score
A computer-readable recording medium recording a character string similarity calculation program, wherein the setting step obtains the information amount using a suffix file.

【００４３】第２３の発明は、第１６から第２２のいず
れかの発明に記載の文字列類似度算出プログラムにおい
て、前記進行処理過程がダイナミックプログラミング手
法によることを特徴とする文字列類似度算出プログラム
を記録したコンピュータ読み取り可能な記録媒体であ
る。The twenty-third invention, in the string similarity calculation program according sixteenth to any one of aspects 22, string similarity calculation program the traveling process is characterized in that with dynamic programming techniques Is a computer-readable recording medium having recorded thereon.

【００４４】第２４の発明は、検索文章に類似した文書
を文書データベースの中から選ぶ文章検索プログラムで
あって、前記検索文章と、前記文章データベースの中の
文書を、二つの文字列として第１６から第２３のいずれ
かの発明に記載の文字列類似度算出プログラムにより類
似度を求め、前記求めた類似度が高い文書を、文書デー
タベースの中から選ぶことをコンピュータに実行させる
ための文書検索プログラムを記録したコンピュータ読み
取り可能な記録媒体である。A twenty-fourth aspect of the present invention is a text search program for selecting a document similar to a search text from a document database, wherein the search text and the document in the text database are converted into two character strings as a sixteenth character string . A document search program for causing a computer to calculate similarity by the character string similarity calculation program according to any one of the twenty-third to twenty-third aspects, and to select, from a document database, a document having the high calculated similarity. Is a computer-readable recording medium having recorded thereon.

【００４５】[0045]

【発明の実施の形態】（第１実施例）DESCRIPTION OF THE PREFERRED EMBODIMENTS (First Embodiment)

【００４６】まず、本発明による第１の類似度（以下で
は、文字列重みＤＰ類似度と呼ぶ）を求める式を説明す
る。First, an expression for obtaining a first similarity (hereinafter, referred to as a character string weight DP similarity) according to the present invention will be described.

【００４７】α、β、γ、δを長さ０以上の文字列と
し、ξを長さ１以上の文字列とし、""を空文字列とす
る。また、x、yとzを長さ１の文字列とする。複数の文
字列（例えば、γとδ）を繋げた文字列（例えば、ξ）
は、要素となる文字列の記号を続けて書くことで示す
（例えば、ξ＝γδ）。Α, β, γ, and δ are character strings having a length of 0 or more, ξ is a character string having a length of 1 or more, and “” is an empty character string. Also, let x, y and z be character strings of length 1. A character string (for example, ξ) connecting a plurality of character strings (for example, γ and δ)
Is indicated by writing a symbol of a character string to be an element successively (for example, ξ = γδ).

【００４８】文字列重みＤＰ類似度SIM₃は、引数の文字
の部分に関する一致のパターンに応じて、以下の式を再
帰的に当てはめることで求める。まず、両方とも空文字
のときは、 SIM₃("","")＝Score("") (9) とする。それ以外のときは、 SIM₃(α,β)＝MAX(SIM_3s(α,β), SIM_3g(α,β)) (10) とする。The character string weight DP similarity SIM ₃ is obtained by recursively applying the following expression according to the matching pattern for the character part of the argument. First, when both are empty characters, SIM ₃ ("", "") = Score ("") (9). Otherwise, SIM ₃ (α, β) = MAX (SIM _3s (α, β), SIM _3g (α, β)) (10).

【００４９】ここで、SIM_3sは、一致している最大の文
字列をξとして、 SIM_3s(ξα,ξβ)＝MAX(Score(γ)＋SIM₃(δα, δβ)) (11) がξ＝γδとなる全てのγとδに関して成り立つように
求め、そのような文字列ξが存在しないときは、 SIM_3s(α,β)＝0.0 (12) とする。Here, SIM _3s is defined as follows: SIM _3s (ξα, ξβ) = MAX (Score (γ) + SIM ₃ (δα, δβ)) (11) It is determined to hold for all γ and δ that become γδ, and if there is no such character string ξ, SIM _3s (α, β) = 0.0 (12).

【００５０】また、SIM_3gは、任意の文字列について SIM_3g(xα,yβ)＝MAX(SIM₃(α,yβ),SIM₃(xα,β),SIM₃(α,β)) (13) とする。この式は、二つの文字列（xαとyβ）のいずれ
か又は両方において１文字ずつ減らした残りに相当する
文字列同士（αとyβ、xαとβ、αとβ）の類似度のう
ち、最も高い類似度を採用することを意味する。Further, SIM _3g is obtained by calculating SIM _3g (xα, yβ) = MAX (SIM ₃ (α, yβ), SIM ₃ (xα, β), SIM ₃ (α, β)) for an arbitrary character string. ). This expression is the similarity between the character strings (α and yβ, xα and β, and α and β) corresponding to the rest of one or both of the two character strings (xα and yβ) reduced by one character. This means that the highest similarity is adopted.

【００５１】以上の式を再帰的に適用することで、二つ
の文字列それぞれの順序に適合する共通の部分文字列が
複数求められ、かつ、類似度が最大となる。この際、式
(11)から分かるように、共通の部分文字列をさらに分割
することを許容している。本実施例で示した文字列重み
ＤＰ類似度は、従来技術で示した文字重みＤＰ類似度に
比べると、文字列の重み（以下、文字列スコア）を示す
Score関数の定義域が、文字（長さ１の文字列）ではな
く、長さ１以上の文字列になっている点に、特徴があ
る。By applying the above expression recursively, a plurality of common partial character strings conforming to the respective orders of the two character strings are obtained, and the similarity is maximized. At this time, the formula
As can be seen from (11), the common partial character string is allowed to be further divided. The character string weight DP similarity shown in the present embodiment indicates a character string weight (hereinafter, character string score) as compared with the character weight DP similarity shown in the related art.
The feature is that the domain of the Score function is not a character (a character string of length 1) but a character string of length 1 or more.

【００５２】文書の総数をNとし、df(ξ)を文字列ξを
含む文書の数とし、Lを文字列ξの長さとする。文字列
に重みを与えるScore(ξ)関数は、ξが空文字のとき
は、 Score("")＝0.0 (14) とし、ξが空文字でないときは、 Score(ξ)＝Ｌ×idf(ξ)＝Ｌ×log₂(N／df(ξ)) (15) とすることで、実施できる。It is assumed that the total number of documents is N, df (ξ) is the number of documents including the character string 、, and L is the length of the character string ξ. The Score (ξ) function that weights the character string is as follows: Score (“”) = 0.0 (14) when ξ is an empty character, and Score (ξ) = L × idf (ξ) when ξ is not an empty character. = L × log ₂ (N / df (ξ)) (15).

【００５３】ここで、idf(ξ)は、情報理論で言うとこ
ろの、ξの情報量に対応する値となっている。すなわ
ち、総数がNの文書の中から、df(ξ)個の文書を特定す
るのに必要なビット数log₂(N／df(ξ))に、対応してい
る。これは、多くの文書に現れる文字列は検索に有益で
はなく、逆に少ない文書に現れる文字列は検索に有益で
あるという性質を反映したものである。Here, idf (ξ) is a value corresponding to the information amount of ξ in information theory. That is, it corresponds to the number of bits log ₂ (N / df (ξ)) required to specify df (ξ) documents from among the documents having the total number N. This reflects the property that character strings appearing in many documents are not useful for searching, and conversely, character strings appearing in few documents are useful for searching.

【００５４】Score関数は、上式に限らず、文字列に関
する非負の関数であって、文字列が長い程、単純に加算
するよりも値が増える傾向を有していれば実施できる。
すなわち、２つの文字列ｚとγを繋げたｚγに関して、 Score(ｚγ)＞Score(ｚ)＋Score(γ) (16) が成り立つものであれば良い。つまり、部分文字列の重
み（例えば、Score(ｚγ)）が、部分文字列を二つ以上
に分割して得られる部分文字列に対する重みの総和（例
えば、二つに分割するならScore(ｚ)＋Score(γ)）より
も、より重ければ良い。この式(16)の不等号が成り立つ
ことにより、文字列重みＤＰ類似度は、より長い文字列
で一致するほど、より類似度が高いという性質を有す
る。The Score function is not limited to the above formula, but is a non-negative function relating to a character string. The function can be implemented as long as the character string has a tendency to increase in value as compared to simple addition.
That is, it is sufficient that Score (zγ)> Score (z) + Score (γ) (16) holds for zγ connecting two character strings z and γ. That is, the weight of the partial character string (for example, Score (zγ)) is the sum of the weights of the partial character strings obtained by dividing the partial character string into two or more (for example, Score (z) + Score (γ)). By the inequality of the expression (16) being satisfied, the character string weight DP similarity has a property that the similarity is higher as the match is made with a longer character string.

【００５５】より詳細には、Score関数は文字列重みＤ
Ｐ類似度を求める際に何度も使用される。式(16)は、最
終的な文字列重みＤＰ類似度に対して寄与したScore関
数の総合的な性質として成り立てば、本発明は実施でき
る。そのため、式(16)の不等号が成り立たない場合を含
んでいても良い。More specifically, the Score function calculates the character string weight D
Used many times to determine the P similarity. The present invention can be implemented if Expression (16) is established as an overall property of the Score function that has contributed to the final character string weight DP similarity. Therefore, a case where the inequality expression (16) does not hold may be included.

【００５６】例えば、一般的には、長い文字列ｚγより
も短い文字列ｚとγの方が、文書に出現する回数が多い
ため式(16)が成り立つ。ところが、文字列ｚもγもｚγ
も同じ回数だけ文書に出現する場合は、 Score(ｚγ)＝Score(ｚ)＋Score(γ) (17) となる。For example, in general, the character strings z and γ, which are shorter than the long character string zγ, appear in the document more frequently, so that Expression (16) holds. However, the character strings z and γ are both zγ
If the same number of times appears in the document, Score (zγ) = Score (z) + Score (γ) (17)

【００５７】また、長い文字列ｚγが文書に稀に（例え
ば1回）しか出現しない場合は、それは単語ではなく、
検索に役立たないかもしれない。かかる場合は、Score
(ｚγ)＝0.0とすると、検索に有益である。しかし、文
字列ｚと文字列γが文書に頻繁に出現するなら、 Score(ｚγ)＜Score(ｚ)＋Score(γ) (18) となる。When a long character string zγ rarely appears (for example, once) in a document, it is not a word, but
May not be useful for search. In such cases, Score
Setting (zγ) = 0.0 is useful for searching. However, if the character string z and the character string γ frequently appear in the document, Score (zγ) <Score (z) + Score (γ) (18).

【００５８】この様に、式(17)、式(18)が成り立つ場合
が含まれていても、一般的に、式(16)が成り立っている
ので、式(15)によって本発明は実施できる。As described above, even when the cases where the expressions (17) and (18) are satisfied, since the expression (16) is generally satisfied, the present invention can be implemented by the expression (15). .

【００５９】また、Score関数は、文字列の情報量およ
び出現集中度に対応させるのでも良い。例えば、データ
ベース中の特定の文書に文字列が集中して出現する度合
いを示す値として、出現集中度W(ξ)を W(ξ)＝(df₂ (ξ)／df₁ (ξ))／(df₁ (ξ)／N) (36) で定義する。ただし、df₁ (ξ)＝0のときは、 W(ξ)＝0.0 (37) とする。そして、式(15)のScore(ξ)の代わりに、W(ξ)
≧Kならば、 Score(ξ)＝idf(ξ) (38) とし、W(ξ)＜Kならば Score(ξ)＝0.0 (39) とすることでも、本発明は実施できる。ただし、Kは正
の実定数（例えば、2.0）であり、df₁ (ξ)はdf(ξ)と同
じく、文字列ξを含む文書の数を示し、df₂ (ξ)は文字
列ξを２つ以上含む文書の数を示す。 The Score function calculates the information amount of a character string and
Or the degree of appearance concentration. For example, data
Degree of occurrence of the character string concentrated in a specific document in the base
As a value indicating the degree of appearance, the degree of appearance concentration W (ξ) is defined as W (ξ) = (df ₂ (ξ) / df ₁ (ξ)) / (df ₁ (ξ) / N) (36) . However, when df ₁ (ξ) = 0, W (ξ) = 0.0 (37) . Then, instead of Score (ξ) in equation (15), W (ξ)
If ≧ K, Score (ξ) = idf (ξ) (38) , and if W (ξ) <K, Score (ξ) = 0.0 (39) , the present invention can be implemented. Where K is positive
Df ₁ (ξ) is the same as df (ξ).
The number of documents containing the character string ξ, df ₂ (ξ)
Indicates the number of documents containing two or more columns ξ.

【００６０】この出現集中度は、ある文字列がある文書
にどのくらいの集中度で出現するかを表しており、通常
の文字列であれば式(36)の値は1.0に近い値となる。と
ころが、検索に有効な文字列（キーワード、主題を表す
文字列、技術論分の場合は専門用語）は、ある文書にひ
とたびその文字列が出現すると、同じ文書に２回以上出
現する確率が高い。そのため、式(36)の値は1.0より十
分大きな値になるという性質を有する。従って、出現集
中度により、式(38)と式(39)を切り換えることによっ
て、検索により有効な文字列をより多くScoreへ寄与さ
せることができ、検索精度を向上することができる。こ
れは、Score関数が、文書データベースにおける部分文
字列の出現集中度に応じて、情報量を加算して得られる
ようにした例である。 The degree of appearance concentration is based on a document having a certain character string.
The degree of concentration that appears in
If the character string is, the value of Expression (36) is a value close to 1.0. When
Rolls are valid strings for search (keywords,
Text, or technical terminology for technical discussions)
When that character string appears, it appears twice or more in the same document.
High probability of appearing. Therefore, the value of equation (36) is less than 1.0.
It has the property that the value becomes larger. Therefore, the appearance collection
By switching between Equation (38) and Equation (39),
Search to provide more effective strings to Score
And the search accuracy can be improved. This
This is because the Score function is a sub-sentence in the document database.
It can be obtained by adding the amount of information according to the appearance concentration of the character string
This is an example.

【００６１】Score関数への情報量と出現集中度の対応
のさせ方は上記の例に限らず、種々の変形が可能であ
る。例えば出現集中度を使う他の実施例としては、式(3
8)と式(39)の代わりに、W(ξ)≧１の時は、 Score(ξ)＝log₂ (W(ξ))×idf(ξ) (40) W(ξ)＜１の時は、 Score(ξ)＝0.0 (41) とすることができる。このようにidf(ξ)とW(ξ)の値に
対応してScore関数が定まる。その結果、検索により有
効な文字列をより多くScoreへ寄与させることができ、
検索精度を向上することができる。 Correspondence of Information Amount and Appearance Concentration to Score Function
The method of applying is not limited to the above example, and various modifications are possible.
You. For example, as another embodiment using the degree of occurrence concentration, the expression (3
8) and Equation (39), when W (ξ) ≧ 1, Score (ξ) = log ₂ (W (ξ)) × idf (ξ) (40) When W (ξ) <1 Can be set as Score (ξ) = 0.0 (41) . Thus, the values of idf (ξ) and W (ξ)
Score function is determined correspondingly. As a result,
More effective strings can contribute to Score,
Search accuracy can be improved.

【００６２】なお、以上では、類似度を示す値は0以上
であって、値が大きいほど類似度が高い場合で説明し
た。もし、類似度の値と類似度の高さの対応関係を変え
るなら、Score関数は非負の関数に限られなくなる。例
えば、類似度の値が小さいほど類似度も高いとするな
ら、式(15)の符号を反転させたScore関数で実施でき
る。この場合は、式(16)の符号は逆向きになり、MAX関
数の代わりに、引数のうちで最小の値を求める関数MIN
を使う。この様に類似度を示す値の性質によって、実施
態様も異なるが、以下では、類似度を示す値は0以上で
あって、値が大きいほど類似度が高い場合で代表して説
明する。In the above description, the value indicating the similarity is 0 or more, and the larger the value, the higher the similarity. If the correspondence between the value of the similarity and the height of the similarity is changed, the Score function is not limited to a non-negative function. For example, if it is assumed that the smaller the value of the similarity is, the higher the similarity is, it can be implemented by a Score function in which the sign of Expression (15) is inverted. In this case, the sign of equation (16) is reversed, and instead of the MAX function, the function MIN that calculates the minimum value of the arguments is used.
use. As described above, although the embodiment differs depending on the nature of the value indicating the similarity, the following description will be made on the assumption that the value indicating the similarity is 0 or more, and the larger the value, the higher the similarity.

【００６３】次に、文字列重みＤＰ類似度により、入力
された文字列と最も類似度の高い文書を検索する文書検
索装置の実施例を図１に示す。この文書検索装置は、文
書データベース１０、文字列入力部１１、検索制御部１
２、検索結果出力部１３、類似度算出部１４、及び、再
帰実行制御部１５から構成されている。Next, FIG. 1 shows an embodiment of a document search apparatus for searching for a document having the highest similarity to an input character string by using a character string weight DP similarity. This document search device includes a document database 10, a character string input unit 11, a search control unit 1,
2. It comprises a search result output unit 13, a similarity calculation unit 14, and a recursive execution control unit 15.

【００６４】文書データベース１０には、検索対象とな
る複数の文書１０ａ、１０ｂ、…、１０ｃが登録されて
いる。検索のためには、キーワード、語、語句、文、文
章などを入力する（以下、代表して検索文章と呼ぶ）。
文字列入力部１１は、検索文章を文字列aとして類似度
算出部１４に与える。検索制御部１２は、文書１０ａ、
１０ｂ、…、１０ｃを順に取り出し、文字列bとして、
類似度算出部１４に与える。検索結果出力部１３は、類
似度算出部１４から得られた類似度が最も高い文書を選
択し、検索結果として出力する。検索結果出力部１３が
選択して出力する文書は、最も類似度の高い１つの文書
だけではなく、類似度が所定の値以上の文書を全て、あ
るいは、類似度の高い順に選んだ所定の数の文書するの
でも良い。In the document database 10, a plurality of documents 10a, 10b,..., 10c to be searched are registered. For the search, a keyword, a word, a phrase, a sentence, a sentence, and the like are input (hereinafter, referred to as a search sentence).
The character string input unit 11 gives the search sentence to the similarity calculation unit 14 as a character string a. The search control unit 12 includes a document 10a,
.., 10c are taken out in order, and as a character string b,
This is given to the similarity calculation unit 14. The search result output unit 13 selects a document having the highest similarity obtained from the similarity calculation unit 14 and outputs it as a search result. The search result output unit 13 selects and outputs not only one document having the highest similarity but also all the documents whose similarity is equal to or more than a predetermined value, or a predetermined number of documents selected in descending order of similarity. You can also write a document.

【００６５】類似度算出部１４は、式(9)又は(10)に基
づいて、類似度を算出する。この類似度を算出する途中
で、一部分の文字列について同様に類似度を求める必要
がある。これは、再帰実行制御部１５により、類似度算
出部１４を繰り返し用いることで実施する。式(10)〜(1
3)の記号の表記に従って説明するため、図１では、同じ
文字列に対して異なる記号を割り当て直している部分が
ある。例えば、類似度算出部１４が最初に受け取る文字
列aとbは、類似度算出部１４の中ではαとβで示してい
る（以下、同様）。文字列を分離するに従って、類似度
算出部１４は繰り返し用いるが、その際に受け取る文字
列α、βは、最初に受け取る文字列a、bとは異なってく
る。The similarity calculator 14 calculates the similarity based on the equation (9) or (10). In the process of calculating the similarity, it is necessary to similarly calculate the similarity for a part of the character string. This is performed by the recursive execution control unit 15 using the similarity calculation unit 14 repeatedly. Expressions (10) to (1
In order to explain according to the notation of the symbol of 3), there is a portion in FIG. 1 where different symbols are reassigned to the same character string. For example, the character strings a and b received first by the similarity calculation unit 14 are represented by α and β in the similarity calculation unit 14 (the same applies hereinafter). As the character strings are separated, the similarity calculation unit 14 is used repeatedly, but the character strings α and β received at this time are different from the character strings a and b received first.

【００６６】類似度算出部１４は、一致文字列類似度算
出部２１、任意文字列類似度算出部２２、最大値選択部
２３により実施されている。一致文字列類似度算出部２
１は、式(10)のSIM_3s(α,β)を算出する。任意文字列類
似度算出部２２は、式(10)のSIM_3g(α,β)を算出する。
最大値選択部２３は、これらに対して関数MAXを実施す
ることで、式(10)のSIM₃(α,β)を算出する。なお、類
似度算出部１４の受け取った文字列α、βの両方が空文
字のときは、再帰実行制御部１５によりSIM₃(α,β) ＝
0.0とする。この際、一致文字列類似度算出部２１、任
意文字列類似度算出部２２、最大値選択部２３は動作さ
せない。言うまでもなく、このSIM₃(α,β) ＝0.0とい
う値は、式(9)と(14)を実施するものである。The similarity calculating section 14 is implemented by a matching character string similarity calculating section 21, an arbitrary character string similarity calculating section 22, and a maximum value selecting section 23. Matching character string similarity calculator 2
1 calculates SIM _3s (α, β) in equation (10). The arbitrary character string similarity calculation unit 22 calculates SIM _3g (α, β) in Expression (10).
The maximum value selection unit 23 calculates SIM ₃ (α, β) in Expression (10) by performing the function MAX on these. When both of the character strings α and β received by the similarity calculation unit 14 are empty characters, the recursive execution control unit 15 sets SIM ₃ (α, β) =
Set to 0.0. At this time, the matching character string similarity calculating unit 21, the arbitrary character string similarity calculating unit 22, and the maximum value selecting unit 23 are not operated. Needless to say, this value of SIM ₃ (α, β) = 0.0 implements equations (9) and (14).

【００６７】一致文字列類似度算出部２１は、文字列分
離制御部３１、文字列分離類似度算出部３２、最大値選
択部３３により実施されており、式(11)のSIM_3s(ξα,
ξβ)を算出する。文字列分離類似度算出部３２は、文
字列分離部４１、文字列スコア算出部４２、類似度算出
部４３、加算部４４により実施されている。The matching character string similarity calculation unit 21 is implemented by a character string separation control unit 31, a character string separation similarity calculation unit 32, and a maximum value selection unit 33, and SIM _3s (ξα,
ξβ) is calculated. The character string separation similarity calculation unit 32 is implemented by a character string separation unit 41, a character string score calculation unit 42, a similarity calculation unit 43, and an addition unit 44.

【００６８】まず、文字列分離制御部３１は、一致文字
列類似度算出部２１が受け取った文字列ξα、ξβにお
いて、一致する文字列ξがない場合、すなわち、一致す
る文字列ξが空文字列の場合は、式(12)に従い、SIM
_3s(ξα,ξβ)＝0.0とする。この場合、文字列分離類似
度算出部３２、最大値選択部３３は動作させない。First, the character string separation control unit 31 determines if there is no matching character string において in the character strings ξα and ξβ received by the matching character string similarity calculating unit 21, that is, if the matching character string ξ is an empty character string In the case of, according to equation (12), SIM
_3s (ξα, ξβ) = 0.0. In this case, the character string separation similarity calculation unit 32 and the maximum value selection unit 33 are not operated.

【００６９】次に、文字列分離制御部３１は、一致文字
列類似度算出部２１が受け取った文字列ξα、ξβにお
いて、一致する文字列ξがある場合は、ξ＝γδとなる
全てのγとδに関して、文字列分離類似度算出部３２を
動作させて、式(11)に含まれるScore(γ)＋SIM₃(δα,
δβ)を計算させる。そして、最も大きな値を最大値選
択部３３により選択する。このことにより、式(11)に示
すSIM_3s(ξα,ξβ)が求まる。Next, if there is a matching character string 、 in the character strings ξα and ξβ received by the matching character string similarity calculating section 21, the character string separation control section 31 sets all γs in which ξ = γδ. And δ, the character string separation similarity calculation unit 32 is operated, and Score (γ) + SIM ₃ (δα,
δβ). Then, the largest value is selected by the maximum value selector 33. Thereby, SIM _3s (ξα, ξβ) shown in Expression (11) is obtained.

【００７０】文字列分離部４１は、文字列ξをγとδに
分離して、γを文字列スコア算出部４２に与え、δαと
δβを類似度算出部４３に与える。文字列スコア算出部
４２は、式(15)に基づき、文書データベース１０を参照
して式(11)のScore(γ)を算出する。類似度算出部４３
は、式(11)のSIM₃(δα,δβ)を算出する。類似度算出
部４３は、実際には、再帰実行制御部１５により、類似
度算出部１４をδαとδβに対して適用することで、実
施する。加算部４４は、式(11)の加算を行う。The character string separating section 41 separates the character string に into γ and δ, gives γ to the character string score calculating section 42, and gives δα and δβ to the similarity calculating section 43. The character string score calculation unit 42 calculates Score (γ) of Expression (11) with reference to the document database 10 based on Expression (15). Similarity calculator 43
Calculates SIM ₃ (δα, δβ) in equation (11). The similarity calculation unit 43 is actually implemented by the recursion execution control unit 15 applying the similarity calculation unit 14 to δα and δβ. The adding unit 44 performs addition of Expression (11).

【００７１】任意文字列類似度算出部２２は、類似度算
出部５１〜５３、最大値選択部５４により実施されてお
り、式(13)のSIM_3g(xα,yβ)を算出する。受け取った文
字列xα、yβの先頭の１文字x、yの有無に関する各場合
に対応して、類似度算出部５１、５２、５３は、それぞ
れ式(13)のSIM₃(α,yβ)、SIM₃(xα,β)、SIM₃(α,β)
を求める。類似度算出部５１〜５３は、実際には、再帰
実行制御部１５により、類似度算出部１４を、αとy
β、xαとβ、αとβのそれぞれに対して適用すること
で、実施する。最大値選択部５４は、式(13)の関数MAX
を実施する。The arbitrary character string similarity calculating section 22 is implemented by the similarity calculating sections 51 to 53 and the maximum value selecting section 54, and calculates SIM _3g (xα, yβ) of the equation (13). Corresponding to each case regarding the presence or absence of the first character x, y of the received character strings xα, yβ, the similarity calculation units 51, 52, and 53 respectively calculate SIM ₃ (α, yβ), SIM ₃ (xα, β), SIM ₃ (α, β)
Ask for. Actually, the similarity calculation units 51 to 53 use the recursion execution control unit 15 to set the similarity calculation unit 14 to α and y
This is implemented by applying to each of β, xα and β, and α and β. The maximum value selection unit 54 calculates the function MAX of Expression (13).
Is carried out.

【００７２】なお、文字列スコア算出部４２は、式(15)
の代わりに、式(36)から(39)、または、式(36)と式(40)
から(41)を用いてScore(γ)を算出することでも、本発
明は実施できる。 Note that the character string score calculation unit 42 calculates the expression (15)
Instead of (36) to (39), or (36) and (40)
By calculating Score (γ) using (41) from
Ming can be implemented.

【００７３】（第２実施例）(Second Embodiment)

【００７４】第１実施例では、最大の値を求める関数MA
Xを用いているが、これ以外に何らかの代表を行う関数
に変形することができる。例えば、Score(ξ)に、最大
長の文字列の寄与が全体の値を決定するという性質があ
れば、下記の式を再帰的に適用して類似度を求めること
で本発明は実施できる。式(15)に示すScore関数は、こ
の性質を有するので、本実施例においても適用できる。
下記の式(21)は、式(11)のScore(γ)を、一致している
最大長の文字列ξに関するScore(ξ)で代表することに
よって、関数MAXを省いたものである。この実施例は、
連続して一致している文字列ξがあれば、それをさらに
細かく分割することなく、重みScore(ξ)を与える点に
特徴がある。そのため、この実施例は第１実施例に比べ
て計算量が少なく、類似度SIM_3aは、第１実施例によるS
IM₃の近似となっている。つまり、類似度SIM_3aは、SIM₃
に比べると、多少の誤差は伴うけれども、計算量を大き
く減らせるという利点がある。In the first embodiment, the function MA for finding the maximum value
Although X is used, it can be transformed into a function that performs some other representation. For example, if Score (ξ) has a property that the contribution of the character string having the maximum length determines the overall value, the present invention can be implemented by recursively applying the following expression to obtain the similarity. The Score function shown in Expression (15) has this property, and therefore can be applied to the present embodiment.
Expression (21) below eliminates the function MAX by representing Score (γ) in Expression (11) with Score (ξ) relating to the matching maximum length character string ξ. This example is
The feature is that if there is a continuously matching character string ξ, the weight Score (ξ) is given without further dividing the character string ξ. Therefore, this embodiment requires a smaller amount of calculation than the first embodiment, and the similarity SIM _3a is the same as that of the first embodiment.
IM ₃ approximation. That is, the similarity SIM _3a is the SIM ₃
Compared to, there is an advantage that the amount of calculation can be greatly reduced, though some errors are involved.

【００７５】まず、両方とも空文字のときは、 SIM_3a("","")＝Score("") (19) とする。それ以外のときは、 SIM_3a(α,β)＝MAX(SIM_3as(α,β), SIM_3ag(α,β)) (20) とする。First, when both are empty characters, SIM _3a ("", "") = Score ("") (19) Otherwise, SIM _3a (α, β) = MAX (SIM _3as (α, β), SIM _3ag (α, β)) (20).

【００７６】ここで、SIM_3asは、一致している最大の文
字列をξとして、 SIM_3as(ξα,ξβ)＝Score(ξ)＋SIM_3a(α, β) (21) で求め、そのような文字列ξが存在しないときは、 SIM_3as(α,β)＝0.0 (22) とする。Here, SIM _{3as finds the} maximum matching character string as ξ, and _obtains SIM _3as (ξα, ξβ) = Score (ξ) + SIM _3a (α, β) (21). If the character string _存在 does not exist, SIM _3as (α, β) = 0.0 (22).

【００７７】また、SIM_3agは、任意の文字列について SIM_3ag(xα,yβ)＝MAX(SIM_3a(α,yβ),SIM_3a(xα,β),SIM_3a(α,β)) (23) とする。以上の式を再帰的に適用することで、二つの文
字列それぞれの順序に適合する共通の部分文字列が複数
求められ、かつ、類似度が最大となる。ただし、第１実
施例と異なり、式(21)から分かるように、本実施例では
共通の部分文字列をさらに分割することはない。Further, SIM _3ag is _{calculated as} follows: SIM _3ag (xα, yβ) = MAX (SIM _3a (α, yβ), SIM _3a (xα, β), SIM _3a (α, β)) (23) ). By applying the above expression recursively, a plurality of common partial character strings conforming to the respective orders of the two character strings are obtained, and the similarity is maximized. However, unlike the first embodiment, as can be seen from Expression (21), in this embodiment, the common partial character string is not further divided.

【００７８】この類似度SIM_3aにより、入力された文字
列と最も類似度の高い文書を検索する文書検索装置の実
施例を図２に示す。この文書検索装置は、文書データベ
ース１０、文字列入力部１１、検索制御部１２、検索結
果出力部１３、類似度算出部１６、及び、再帰実行制御
部１７から構成されている。図中１０から１３までの符
号を付した部分は、図１の同符号を付した部分と同じ機
能・構成を有するので、説明を省略する。FIG. 2 shows an embodiment of a document search apparatus for searching for a document having the highest similarity to the input character string by using the similarity SIM _3a . This document search device includes a document database 10, a character string input unit 11, a search control unit 12, a search result output unit 13, a similarity calculation unit 16, and a recursion execution control unit 17. In the figure, portions denoted by reference numerals 10 to 13 have the same functions and configurations as the portions denoted by the same reference numerals in FIG.

【００７９】類似度算出部１６は、式(19)又は(20)に基
づいて、類似度を算出する。この類似度を算出する途中
で、一部分の文字列について同様に類似度を求める必要
がある。これは、再帰実行制御部１７により、類似度算
出部１６を繰り返し用いることで実施する。式(20)〜(2
3)の記号の表記に従って説明するため、図２で同じ文字
列に対して異なる記号を割り当て直している部分がある
のは、図１と同様である。The similarity calculator 16 calculates the similarity based on the equation (19) or (20). In the process of calculating the similarity, it is necessary to similarly calculate the similarity for a part of the character string. This is performed by the recursive execution control unit 17 using the similarity calculation unit 16 repeatedly. Equations (20) to (2
In order to explain according to the notation of the symbol of 3), the same character string is reassigned with a different symbol in FIG. 2 as in FIG.

【００８０】類似度算出部１６は、一致文字列類似度算
出部２４、任意文字列類似度算出部２５、最大値選択部
２６により実施されている。一致文字列類似度算出部２
４は、式(20)のSIM_3as(α,β)を算出する。任意文字列
類似度算出部２５は、式(20)のSIM_3ag(α,β)を算出す
る。最大値選択部２６は、これらに対して関数MAXを実
施することで、式(20)のSIM_3a(α,β)を算出する。な
お、類似度算出部１６の受け取った文字列α、βの両方
が空文字のときは、再帰実行制御部１７によりSIM
₃ _a(α,β) ＝0.0とする。この際、一致文字列類似度算
出部２４、任意文字列類似度算出部２５、最大値選択部
２６は動作させない。言うまでもなく、このSIM_3a(α,
β) ＝0.0という値は、式(19)と(14)を実施するもので
ある。The similarity calculating section 16 is implemented by a matching character string similarity calculating section 24, an arbitrary character string similarity calculating section 25, and a maximum value selecting section 26. Matching character string similarity calculator 2
4 calculates SIM _3as (α, β) in equation (20). The arbitrary character string similarity calculation unit 25 calculates SIM _3ag (α, β) in Expression (20). The maximum value selection unit 26 calculates SIM _3a (α, β) in Expression (20) by performing the function MAX on these. If both of the character strings α and β received by the similarity calculation unit 16 are null characters, the recursive execution control unit 17
₃ _{Let a} (α, β) = 0.0. At this time, the matching character string similarity calculating unit 24, the arbitrary character string similarity calculating unit 25, and the maximum value selecting unit 26 are not operated. Needless to say, this SIM _3a (α,
The value β) = 0.0 implements equations (19) and (14).

【００８１】一致文字列類似度算出部２４は、一致文字
列判定部３４、文字列スコア算出部４５、類似度算出部
４６、加算部４４により実施されている。The matching character string similarity calculating section 24 is implemented by a matching character string determining section 34, a character string score calculating section 45, a similarity calculating section 46, and an adding section 44.

【００８２】まず、一致文字列判定部３４は、一致文字
列類似度算出部２４が受け取った文字列ξα、ξβにお
いて、一致する文字列ξがない場合、すなわち、一致す
る文字列ξが空文字列の場合は、式(22)に従い、SIM_3as
(ξα,ξβ)＝0.0とする。この際、文字列スコア算出部
４５、類似度算出部４６、加算部４７は動作させない。First, the matching character string determining unit 34 determines that there is no matching character string において in the character strings ξα and ξβ received by the matching character string similarity calculating unit 24, that is, the matching character string ξ is an empty character string. In the case of, according to equation (22), SIM _3as
(ξα, ξβ) = 0.0. At this time, the character string score calculator 45, the similarity calculator 46, and the adder 47 are not operated.

【００８３】次に、一致文字列判定部３４は、一致文字
列類似度算出部２４が受け取った文字列ξα、ξβにお
いて、一致する文字列ξがある場合は、文字列スコア算
出部４５、類似度算出部４６、加算部４７を動作させ
て、式(21)のScore(ξ)＋SIM_3a(α,β)を計算させる。
具体的には、文字列スコア算出部４５は、式(15)に基づ
き、文書データベース１０を参照して式(11)のScore
(ξ)を算出する。類似度算出部４６は、実際には、再帰
実行制御部１７により、類似度算出部１６をαとβに対
して適用することで、実施する。加算部は、式(21)の加
算を行う。こうして、式(21)に示すSIM_3as(ξα,ξβ)
が求まる。Next, if there is a matching character string in the character strings {α and {β ”received by the matching character string similarity calculating section 24, the matching character string determining section The degree calculating unit 46 and the adding unit 47 are operated to calculate Score (ξ) + SIM _3a (α, β) in Expression (21).
Specifically, the character string score calculation unit 45 refers to the document database 10 based on Expression (15), and calculates the Score of Expression (11).
Calculate (ξ). The similarity calculation unit 46 is actually implemented by the recursion execution control unit 17 applying the similarity calculation unit 16 to α and β. The addition unit performs addition of Expression (21). Thus, SIM _3as (ξα, ξβ) shown in equation (21)
Is found.

【００８４】任意文字列類似度算出部２５は、類似度算
出部５５〜５７、最大値選択部５８により実施されてお
り、式(23)のSIM_3ag(xα,yβ)を算出する。受け取った
文字列xα、yβの先頭の１文字x、yの有無に関する各場
合に対応して、類似度算出部５５、５６、５７は、それ
ぞれ式(23)のSIM_3a(α,yβ)、SIM_3a(xα,β)、SIM
_3a(α,β)を求める。類似度算出部５５〜５７は、実際
には、再帰実行制御部１７により、類似度算出部１６
を、αとyβ、xαとβ、αとβのそれぞれに対して適用
することで、実施する。最大値選択部５８は、式(23)の
関数MAXを実施する。The arbitrary character string similarity calculating section 25 is implemented by the similarity calculating sections 55 to 57 and the maximum value selecting section 58, and calculates SIM _3ag (xα, yβ) of the equation (23). Corresponding to each case regarding the presence or absence of the first character x, y of the received character strings xα, yβ, the similarity calculation units 55, 56, and 57 respectively calculate SIM _3a (α, yβ), SIM _3a (xα, β), SIM
_{3a Find} (α, β). Actually, the recursive execution control unit 17 causes the similarity calculating units 55 to 57 to
Is applied to each of α and yβ, xα and β, and α and β. The maximum value selection unit 58 implements the function MAX of Expression (23).

【００８５】なお、文字列スコア算出部４５は、式(15)
の代わりに、式(36)から(39)、または、式(36)と式(40)
から(41)を用いてScore(ξ)を算出することでも、本発
明は実施できる。 Note that the character string score calculation unit 45 calculates the expression (15)
Instead of (36) to (39), or (36) and (40)
By calculating Score (ξ) using (41) from
Ming can be implemented.

【００８６】（第３実施例）(Third Embodiment)

【００８７】次に、同義語辞書情報を用いた本発明の実
施例を説明する。同義語辞書に記載されたある文字列の
集合に属する文字列は、同義であるとする。また、記法
として、ξ,η,…ζを長さ１以上の同義の文字列とす
る。すなわち、同義語辞書Dの要素に、ある文字列の集
合Ｔが存在していて、ξ,η,…ζがＴの要素であるとす
る。式で表せば、 D＝{{ξ,η,…ζ}…} (24) である。Next, an embodiment of the present invention using synonym dictionary information will be described. Character strings belonging to a certain set of character strings described in the synonym dictionary are assumed to be synonymous. In addition, as a notation, {, η, ...} is a synonymous character string having a length of 1 or more. That is, it is assumed that a set T of a certain character string exists in the elements of the synonym dictionary D, and ξ, η,. Expressed as an equation, D = {{ξ, η,...}} (24)

【００８８】αとβを文字列とし、""は空文字列とす
る。また、xとyを長さ１の文字列とする。Tは同義語辞
書Dの要素である文字列の集合とする。Α and β are character strings, and “” is an empty character string. Also, x and y are character strings of length 1. T is a set of character strings that are elements of the synonym dictionary D.

【００８９】本実施例による類似度（以下、同義語辞書
情報のある文字列重みのあるＤＰ類似度）SIM₄は、以下
の式を再帰的に適用することで求める。The similarity (hereinafter, referred to as DP similarity having a character string weight having synonym dictionary information) SIM ₄ according to the present embodiment is obtained by recursively applying the following equation.

【００９０】まず、空文字列のときは SIM₄("","")＝0.0 (25) とする。それ以外のときは、 SIM₄(α,β)＝MAX(SIM_4s(α,β),SIM_4g(α,β)) (26) とする。First, in the case of an empty character string, SIM ₄ ("", "") = 0.0 (25). Otherwise, SIM ₄ (α, β) = MAX (SIM _4s (α, β), SIM _4g (α, β)) (26).

【００９１】ここで、SIM_4s(α,β)は、同義の単語ξと
ηが先頭にあるときは、 SIM_4s(α,β)＝MAX(SynonymScore(T)＋SIM₄(γ,δ)) (27) がα＝ξγ、β＝ηδ、ξ∈T、η∈Tとなる全てのξと
ηに関して成り立つように求め、そのようなξとηがな
いとき、 SIM_4s(α,β)＝0.0 (28) とする。Here, SIM _4s (α, β) means that when words ξ and η having the same meaning are at the head, SIM _4s (α, β) = MAX (SynonymScore (T) + SIM ₄ (γ, δ)) (27) is determined to hold for all ξ and η such that α = ξγ, β = ηδ, ξ∈T, η 、 T, and when there is no such ξ and η, SIM _4s (α, β) = Set to 0.0 (28).

【００９２】また、SIM_4g(xα,yβ)は、任意の文字列に
対して、 SIM_4g(xα,yβ)＝MAX(SIM_4g(α,yβ), SIM_4g(xα,β),SIM_4g(α,β)) (29) とする。以上の式を再帰的に適用することで、二つの文
字列それぞれの順序に適合し、同義語となっている部分
文字列が複数求められ、かつ、類似度が最大となる。Further, SIM _4g (xα, yβ) is calculated as follows: SIM _4g (xα, yβ) = MAX (SIM _4g (α, yβ), SIM _4g (xα, β), SIM _4g (α, β)) (29). By applying the above formula recursively, a plurality of partial character strings that match the order of each of the two character strings and are synonymous are obtained, and the similarity is maximized.

【００９３】同義語辞書を考慮した重み（以下、同義語
スコア）を与える関数SynonymScoreは、df(ξ,η,…ζ)
をξ,η又はζが含まれる文書の数とし、ξ,η,…ζをT
の要素とし、Ｌを対象となった文字列の長さとすると、 SynonymScore(T)＝L×idf(ξ,η,…ζ)＝L×log₂(N／df(ξ,η,…ζ)) (30) で実施できる。また、これ以外にも文字列の集合に関す
る関数で実施することもできる。A function SynonymScore for giving a weight (hereinafter, synonym score) considering the synonym dictionary is df (ξ, η,...)
Is the number of documents containing ξ, η or ζ, and ξ, η,… ζ is T
Let L be the length of the target string, SynonymScore (T) = L × idf (ξ, η,...) = L × log ₂ (N / df (ξ, η,...)) ) (30). In addition, it can also be implemented by a function relating to a set of character strings.

【００９４】この様にして同義語辞書を用いることで、
全角と半角などのコードの対応関係を加味した検索や、
名前の読みや別名のある情報の検索を実現できる。By using the synonym dictionary in this way,
Searches that take into account the correspondence between codes such as full-width and half-width,
You can read names and search for information with aliases.

【００９５】次に、同義語辞書情報のある文字列重みの
あるＤＰ類似度SIM₄により、入力された文字列と最も類
似度の高い文書を検索する文書検索装置の実施例を図３
に示す。この文書検索装置は、文書データベース１０、
文字列入力部１１、検索制御部１２、検索結果出力部１
３、類似度算出部１８、再帰実行制御部１９、及び同義
語辞書２０から構成されている。図中１０から１３まで
の符号を付した部分は、図１の同符号を付した部分と同
じ機能・構成を有するので、説明を省略する。Next, an embodiment of a document search apparatus for searching for a document having the highest similarity to an input character string by using a DP similarity SIM _{4 having} a character string weight with synonym dictionary information is shown in FIG.
Shown in This document search device includes a document database 10,
Character string input unit 11, search control unit 12, search result output unit 1
3, a similarity calculation unit 18, a recursive execution control unit 19, and a synonym dictionary 20. In the figure, portions denoted by reference numerals 10 to 13 have the same functions and configurations as the portions denoted by the same reference numerals in FIG.

【００９６】類似度算出部１８は、式(25)又は(26)に基
づいて、類似度を算出する。この類似度を算出する途中
で、一部分の文字列について同様に類似度を求める必要
がある。これは、再帰実行制御部１９により、類似度算
出部１８を繰り返し用いることで実施する。式の記号の
表記に従って説明するため、図３で同じ文字列に対して
異なる記号を割り当て直している部分があるのは、図１
と同様である。The similarity calculating section 18 calculates the similarity based on the equation (25) or (26). In the process of calculating the similarity, it is necessary to similarly calculate the similarity for a part of the character string. This is performed by the recursive execution control unit 19 using the similarity calculation unit 18 repeatedly. In order to explain according to the notation of the expression symbol, there is a portion in FIG. 3 where different symbols are reassigned to the same character string in FIG.
Is the same as

【００９７】同義語辞書２０は、同義語を登録してお
く。例えば、日本語では「米国」「合衆国」「アメリ
カ」「アメリカ合衆国」「メリケン」は同義なので、日
本語で検索するならこれらを同義語辞書２０の要素にT
＝{米国,合衆国,アメリカ,アメリカ合衆国,メリケン}を
登録しておく。表記のバラツキやよくある誤りに対応す
るために、読みを表す平仮名の「べいこく」や誤記「合
州国」を要素に加えておいても良い。同様に例えば、英
語なら「United States」「United States of Americ
a」「America」「U.S.A.」「U.S.」は同義なので、英語
で検索するなら同義語辞書２０の要素にT＝{United Sta
tes, United States of America, America, U.S.A., U.
S.}を登録しておく。文字コードの相違に対応するに
は、例えば、１バイト文字（半角文字）で表した「U.S.
A.」に加えて、２バイト文字（全角文字）で表した
「Ｕ．Ｓ．Ａ．」を要素に加えておくと好ましい。The synonym dictionary 20 registers synonyms. For example, in Japanese, "US", "US", "USA", "USA" and "MERIKEN" are synonymous.
= Register {United States, United States, United States, United States, Meriken}. In order to cope with variations in notation and common errors, it is also possible to add the hiragana “beikoku” or the erroneous notation “United States of America” as an element. Similarly, for example, in English, "United States""United States of Americ"
"a", "America", "USA", and "US" are synonymous, so if you search in English, the element of the synonym dictionary 20 will be
tes, United States of America, America, USA, U.
S.} is registered. To cope with the difference in character codes, for example, "US" represented by one-byte characters (half-width characters)
A. ”is preferably added to the element in addition to“ USA ”represented by two-byte characters (full-width characters).

【００９８】検索の目的にもよるが、語の意味の広狭も
許容することもできる。例えば、「帝都」「東京」「京
浜」「関東」は広狭はあるが、いずれも東京地方を示す
語である。この場合、同義語辞書２０の要素にT＝{帝
都,東京,京浜,関東}を登録すれば良い。この様に、同義
語辞書２０に、検索する手がかりとして有益な語（キー
ワード）の同義語を数多く登録しておくと、検索の漏れ
が少なくなって好適である。Depending on the purpose of the search, the meaning of the word can be widened or narrowed. For example, "Teito", "Tokyo", "Keihin", and "Kanto" are words that indicate the Tokyo area, although they are wide and narrow. In this case, T = {Teito, Tokyo, Keihin, Kanto} may be registered as an element of the synonym dictionary 20. As described above, it is preferable to register many synonyms of useful words (keywords) as clues to be searched in the synonym dictionary 20, because omission of search is reduced.

【００９９】類似度算出部１８は、同義語類似度算出部
２８、任意文字列類似度算出部２９、最大値選択部３０
により実施されている。同義語類似度算出部２８は、式
(26)のSIM_4s(α,β)を算出する。任意文字列類似度算出
部２９は、式(26)のSIM_4g(α,β)を算出する。最大値選
択部３０、これらに対して関数MAXを実施することで、
式(26)のSIM₄(α,β)を算出する。なお、類似度算出部
１８の受け取った文字列α、βの両方が空文字のとき
は、再帰実行制御部１９によりSIM₄(α,β) ＝0.0とす
ることで、式(25)を実施する。この際、同義語類似度算
出部２８、任意文字列類似度算出部２９、最大値選択部
３０は動作させない。The similarity calculator 18 includes a synonym similarity calculator 28, an arbitrary character string similarity calculator 29, and a maximum value selector 30.
Has been implemented. The synonym similarity calculation unit 28 calculates the expression
The SIM _4s (α, β) of (26) is calculated. The arbitrary character string similarity calculation unit 29 calculates SIM _4g (α, β) in Expression (26). By performing the function MAX on the maximum value selection unit 30 and these,
Calculate SIM ₄ (α, β) in equation (26). When both the character strings α and β received by the similarity calculation unit 18 are empty characters, the recursive execution control unit 19 sets SIM ₄ (α, β) = 0.0 to execute the equation (25). . At this time, the synonym similarity calculator 28, the arbitrary character string similarity calculator 29, and the maximum value selector 30 are not operated.

【０１００】同義語類似度算出部２８は、同義語分離制
御部６１、同義語分離類似度算出部６２、最大値選択部
６３により実施されており、式(26)のSIM_4s(α,β)を算
出する。同義語分離類似度算出部６２は、同義語分離部
７１、同義語スコア算出部７２、類似度算出部７３、加
算部７４により実施されている。The synonym similarity calculating section 28 is implemented by a synonym separation controlling section 61, a synonym separating similarity calculating section 62, and a maximum value selecting section 63. The SIM _4s (α, β) ) Is calculated. The synonym separation similarity calculation section 62 is implemented by a synonym separation section 71, a synonym score calculation section 72, a similarity calculation section 73, and an addition section 74.

【０１０１】まず、同義語分離部７１は、同義語辞書２
０を参照し、その要素Ｔに含まれるξとηを用いて、同
義語類似度算出部２８が受け取った文字列α、βをα＝
ξγ、β＝ηδに分離することを試みる。かかる分離が
できない場合、同義語分離制御部６１は、式(28)に従い
SIM_4s(α,β)＝0.0とする。この場合、同義語分離制御
部６１は、文字列分離類似度算出部３２、最大値選択部
３３をそれ以上動作させない。First, the synonym separation section 71 sets the synonym dictionary 2
0, and the character strings α and β received by the synonym similarity calculation unit 28 are calculated using α and η included in the element T as α = β.
Try to separate 分離 γ, β = ηδ. If such separation is not possible, the synonym separation control unit 61
SIM _4s (α, β) = 0.0. In this case, the synonym separation control unit 61 does not operate the character string separation similarity calculation unit 32 and the maximum value selection unit 33 any more.

【０１０２】一方、かかる分離ができる場合は、同義語
分離制御部６１は、α＝ξγ、β＝ηδとなる全てのξ
とηに関して、同義語分離類似度算出部６２を動作させ
て、式(27)に含まれるSynonymScore(T)＋SIM₄(γ,δ)を
計算させる。そして、最も大きな値を最大値選択部６３
により選択する。このことにより、式(27)に示すSIM
₄ _s(α,β)が求まる。On the other hand, if such separation can be performed, the synonym separation control unit 61 determines that all the ξ that satisfy α = ξγ and β = ηδ
With respect to and η, the synonym separation similarity calculation unit 62 is operated to calculate SynonymScore (T) + SIM ₄ (γ, δ) included in Expression (27). Then, the largest value is set to the maximum value selection unit 63.
Select by. This allows the SIM shown in equation (27)
₄ _s (α, β) is obtained.

【０１０３】同義語分離部７１は、文字列αとβを、
ξ、γ、η、δに分離して、ξとγが属する同義語辞書
Dの要素Tを同義語スコア算出部７２に与え、γとδを類
似度算出部７３に与える。同義語スコア算出部７２は、
式(30)に基づき、文書データベース１０を参照して式(2
7)のSynonymScore(T)を算出する。類似度算出部７３
は、式(27)のSIM₄(γ,δ)を算出する。類似度算出部７
３は、実際には、再帰実行制御部１９により、類似度算
出部１８をγとδに対して適用することで、実施する。
加算部７４は、式(27)の加算を行う。The synonym separation section 71 converts the character strings α and β into
Separately into ξ, γ, η, δ, synonym dictionary to which ξ and γ belong
The element T of D is provided to the synonym score calculation unit 72, and γ and δ are provided to the similarity calculation unit 73. The synonym score calculation unit 72
Based on the expression (30), referring to the document database 10, the expression (2
7) SynonymScore (T) is calculated. Similarity calculator 73
Calculates SIM ₄ (γ, δ) in equation (27). Similarity calculator 7
3 is actually implemented by the recursion execution control unit 19 applying the similarity calculation unit 18 to γ and δ.
The adding unit 74 performs addition of Expression (27).

【０１０４】任意文字列類似度算出部２９は、類似度算
出部８１〜８３、最大値選択部８４により実施されてお
り、式(29)のSIM_4g(xα,yβ)を算出する。受け取った文
字列xα、yβの先頭の１文字x、yの有無に関する各場合
に対応して、類似度算出部８１、８２、８３は、それぞ
れ式(29)のSIM₄(α,yβ)、SIM₄(xα,β)、SIM₄(α,β)
を求める。類似度算出部８１〜８３は、実際には、再帰
実行制御部１９により、類似度算出部１８を、αとy
β、xαとβ、αとβのそれぞれに対して適用すること
で、実施する。最大値選択部８４は、式(29)の関数MAX
を実施する。The arbitrary character string similarity calculating section 29 is implemented by the similarity calculating sections 81 to 83 and the maximum value selecting section 84, and calculates SIM _4g (xα, yβ) of the equation (29). In each case regarding the presence or absence of the first character x, y of the received character strings xα, yβ, the similarity calculation units 81, 82, 83 respectively calculate SIM ₄ (α, yβ), SIM ₄ (xα, β), SIM ₄ (α, β)
Ask for. Actually, the similarity calculation units 81 to 83 use the recursive execution control unit 19 to set the similarity calculation unit 18 to α and y
This is implemented by applying to each of β, xα and β, and α and β. The maximum value selecting unit 84 calculates the function MAX of Expression (29).
Is carried out.

【０１０５】（第４実施例）(Fourth embodiment)

【０１０６】同義語辞書情報のある文字列重みのあるＤ
Ｐ類似度において、同義語辞書として対訳辞書を用いれ
ば、異なる言語の文書の類似度を求めることができる。
このことにより、検索文章と異なる言語の文書の検索、
いわゆる言語横断情報検索（Cross Lingual Informatio
n Retrieval）を行う装置の実施例を図４に示す。Character string weighted D with synonym dictionary information
In P similarity, if a bilingual dictionary is used as a synonym dictionary, the similarity between documents in different languages can be obtained.
This allows you to search for documents in a language different from the search sentence,
Cross-lingual information retrieval (Cross Lingual Informatio
n Retrieval) is shown in FIG.

【０１０７】この言語横断情報検索装置は、文書データ
ベース１０、文字列入力部１１、検索制御部１２、検索
結果出力部１３、類似度算出部１８、再帰実行制御部１
９、対訳辞書９１から構成されている。図中１０から１
３までの符号を付した部分は図１の同符号を付した部分
と同じ機能・構成を有し、図中１８と１９の符号を付し
た部分は図３の同符号を付した部分と同じ機能・構成を
有するので、説明を省略する。This cross-language information search device includes a document database 10, a character string input unit 11, a search control unit 12, a search result output unit 13, a similarity calculation unit 18, and a recursive execution control unit 1.
9, a bilingual dictionary 91. 10 to 1 in the figure
3 have the same functions and configurations as the parts with the same reference numerals in FIG. 1, and the parts with the reference numerals 18 and 19 in the figure are the same as the parts with the same reference numerals in FIG. Since it has a function and a configuration, the description is omitted.

【０１０８】この言語横断情報検索装置は、文字列入力
部１１から入力された検索文章と、最も類似度の高い文
書を文書データベース１０から検索して出力する。検索
文章と、文書データベース１０の文書は異なる言語で記
載する。例えば、前者は日本語で、後者は英語で記載す
る。すると、日本語の文書を文字列入力部１１から入力
することで、最も類似した英語の文書を文書データベー
ス１０から検索することが実施できる。This cross-language information search device searches the document database 10 for a document having the highest similarity to the search sentence input from the character string input unit 11 and outputs it. The search text and the document in the document database 10 are described in different languages. For example, the former is written in Japanese and the latter is written in English. Then, by inputting a Japanese document from the character string input unit 11, the most similar English document can be searched from the document database 10.

【０１０９】対訳辞書９１は、図３に示す同義語辞書２
０と同様の機能を有するが、登録されている同義語が異
なる言語の組合せになっている点で異なる。それぞれの
言語は、文字列入力部１１から入力する検索文章の言語
と、文書データベース１０の文書の言語に一致させるの
は、言うまでもない。例えば、日本語の「アメリカ人」
は、英語の「American」と同義なので、T={アメリカ人,
American}を対訳辞書の要素として登録しておく。もち
ろん、複数の同義語を含めて、例えば、T={アメリカ人,
米国人,American}としておくと、検索の漏れが少なくな
ってより好ましい。The bilingual dictionary 91 is a synonym dictionary 2 shown in FIG.
It has the same function as 0, but differs in that registered synonyms are combinations of different languages. It goes without saying that each language matches the language of the search sentence input from the character string input unit 11 and the language of the document in the document database 10. For example, "American" in Japanese
Is synonymous with "American" in English, so T = {American,
American} is registered as a bilingual dictionary element. Of course, including multiple synonyms, for example, T = {American,
It is more preferable to set "American" as the number of search omissions is reduced.

【０１１０】（第５実施例）(Fifth Embodiment)

【０１１１】次に本発明による文章検索をソフトウエア
により実施する例を説明する。図５に、ソフトウエアの
実行に用いる計算機システムの一例を示す。この計算機
システムは、ディスプレイ１０１、プリンタ１０２、キ
ーボード１０３、フロッピー（登録商標）ディスク装置
１０４、ＣＤ−ＲＯＭ（Compact Disk− Read Only Mem
ory）装置１０５、読み出し専用メモリ（Read Only Mem
ory。以下、ＲＯＭ）１０６、読み書き可能なランダム
アクセスメモリ（Random Access Memory。以下、ＲＡ
Ｍ）１０７、磁気ディスク装置１０８、中央処理装置
（Central Processing Unit。以下、ＣＰＵ）、通信イ
ンターフェイス１１０、及び、これらを接続するバス１
１１から構成されている。フロッピーディスク装置１０
４はフロッピーディスク１１２の読み書きを行い、ＣＤ
−ＲＯＭ装置１０４はＣＤ−ＲＯＭ１１３の読み出しを
行う。また、通信インターフェイス１１０により、本計
算機システムは通信ネットワーク１１４に接続されてい
る。Next, an example in which the text search according to the present invention is implemented by software will be described. FIG. 5 shows an example of a computer system used for executing software. This computer system includes a display 101, a printer 102, a keyboard 103, a floppy (registered trademark) disk device 104, a CD-ROM (Compact Disk-Read Only Mem).
ory) device 105, read only memory (Read Only Mem)
ory. ROM) 106, a readable / writable random access memory (Random Access Memory; hereinafter, RA)
M) 107, magnetic disk device 108, central processing unit (Central Processing Unit; hereinafter, CPU), communication interface 110, and bus 1 connecting these.
11. Floppy disk drive 10
4 reads and writes the floppy disk 112, and
The ROM device 104 reads the CD-ROM 113; The computer system is connected to a communication network 114 via a communication interface 110.

【０１１２】本発明を実施する文章検索プログラムは、
ＲＯＭ１０６に記憶しておく。あるいは、フロッピーデ
ィスク１１２、ＣＲ−ＲＯＭ１１３、又は、磁気ディス
ク装置１０８に文章検索プログラムを記憶しておき、Ｒ
ＡＭ１０７に転送した後、ＣＰＵ１０９が実行するので
も良い。ＣＰＵ１０９は、ＲＡＭ１０７を作業領域に使
って文章検索プログラムを実行する。必要に応じて、磁
気ディスク装置１０８を作業領域に使っても良い。文章
検索プログラムの実行の指示はキーボード１０３から行
い、実行結果は、ディスプレイ１０１、又は、プリンタ
１０２に出力する。文章検索プログラムの実行を、フロ
ッピーディスク１１２から指示することや、実行結果を
フロッピーディスク１１２に書き込んでも良いのは言う
までもない。A sentence search program for implementing the present invention is:
It is stored in the ROM 106. Alternatively, a text search program is stored in the floppy disk 112, the CR-ROM 113, or the magnetic disk device 108, and the
After the transfer to the AM 107, the CPU 109 may execute the processing. The CPU 109 executes the text search program using the RAM 107 as a work area. If necessary, the magnetic disk device 108 may be used as a work area. An instruction to execute the text search program is issued from the keyboard 103, and the execution result is output to the display 101 or the printer 102. It goes without saying that the execution of the text search program may be instructed from the floppy disk 112 and the execution result may be written to the floppy disk 112.

【０１１３】文書データベースと、同義語辞書（必要な
場合）は、フロッピーディスク１１２、ＣＤ−ＲＯＭ１
１３、又は、磁気ディスク１０８に蓄えておく。高速な
アクセスのためにＲＡＭ１０７に転送しておくのでも良
い。ＲＡＭ１０７に転送する際に、容易に処理できる形
式に変換するのも良い。また、文章検索プログラム、文
書データベース、同義語辞書（必要な場合）、又は、実
行の指示を、ネットワーク１１４経由で本計算機システ
ムに入力したり、実行の結果をネットワーク１１４経由
で本計算機システムから出力したりしても良いことは、
もちろんである。The document database and the synonym dictionary (if necessary) are stored on the floppy disk 112, CD-ROM 1
13 or on the magnetic disk 108. The data may be transferred to the RAM 107 for high-speed access. When transferring the data to the RAM 107, the data may be converted into a format that can be easily processed. In addition, a text search program, a document database, a synonym dictionary (if necessary), or an execution instruction is input to the computer system via the network 114, and an execution result is output from the computer system via the network 114. What you can do is
Of course.

【０１１４】また、図に示されたものに限らず、各種の
記録媒体、入力手段、出力手段を用いて、本計算機シス
テムへの入力と出力を行うなど各種の実施態様への変形
が可能なことは言うまでもない。これらの、記録媒体、
入力手段、出力手段は本計算機システムが直接アクセス
するものの他、通信ネットワークを経由してアクセスす
るものであっても良いのはもちろんである。The present invention is not limited to those shown in the figures, and various recording media, input means, and output means can be used to input and output to the computer system, and can be modified into various embodiments. Needless to say. These recording media,
It goes without saying that the input means and the output means may be those directly accessed by the computer system, or those accessed via a communication network.

【０１１５】文字列重みのあるＤＰ類似度SIM₃による文
書検索プログラムの処理フローを図６から図１０に示
す。本プログラムは、検索文章を入力し、文章データベ
ースを検索し、類似度の高い複数の文書を出力する。FIGS. 6 to 10 show the processing flow of the document search program using the DP similarity SIM _{3 having} a character string weight. This program inputs a search sentence, searches a sentence database, and outputs a plurality of documents having high similarity.

【０１１６】図６は、検索文章に基づいて文書データベ
ースを検索し、類似度の高い文書を選び出して出力する
処理のフローを示す。まず、ステップＳ１１（以下、Ｓ
１１と略記）で、ある文字列の出現回数を効率よく計算
する準備のために、文書データベースに含まれる全文書
を総合してサフィックスファイル（Suffix File）を作
成する。サフィックスファイルの作り方と利用の仕方
は、例えばMikio Yamamoto and Kenneth W.Church, Usi
ng Suffix Arrays to Compute Term Frequency and Doc
ument Frequency for All Substrings in a Corpus, In
proceeding of 6th Workshop on Very Large Corpora,
Ed. Eugene Charniak, Motreal, pp28-37, 1998に開示
されている。FIG. 6 shows a flow of processing for searching a document database based on a search sentence, selecting and outputting a document having a high degree of similarity. First, step S11 (hereinafter, S
11), a suffix file (Suffix File) is created by combining all documents included in the document database in order to efficiently calculate the number of appearances of a certain character string. For details on how to create and use suffix files, see Mikio Yamamoto and Kenneth W. Church, Usi
ng Suffix Arrays to Compute Term Frequency and Doc
ument Frequency for All Substrings in a Corpus, In
proceeding of 6th Workshop on Very Large Corpora,
Ed. Eugene Charniak, Motreal, pp 28-37, 1998.

【０１１７】サフィックスファイルを使うと、ある文字
列が文書に出現する回数を高速に求めることができる。
サフィックスファイルは、全ての文書において生じうる
部分文字列を、文字コードに従ってソートして（文字コ
ードの順に並べ替えて）、通し番号（サフィックス）を
付けておくことで実施する。文字列が文書に出現する回
数は、その文字列と一致する文字列がサフィックスファ
イルの中にいくつあるかを算出することで求められる。By using a suffix file, the number of times a character string appears in a document can be obtained at high speed.
The suffix file is implemented by sorting partial character strings that can occur in all documents according to character codes (sorting them in the order of character codes) and adding serial numbers (suffixes). The number of times a character string appears in a document can be obtained by calculating the number of character strings that match the character string in the suffix file.

【０１１８】具体的には、まず、一致する文字列をサフ
ィックスファイルの先頭から探してそのサフィックスmi
nを求める。もし、一致する文字列がないなら、文書に
出現する回数は0である。次に、一致する文字列をサフ
ィックスファイルの末尾から探してそのサフィックスma
xを求める。すると、文字列が文書に出現する回数tf
は、 tf＝max−min＋1 (31) で求められる。なお、文書データベースの文書は、文書
番号（Document ID）によって互いに区別するものとす
る。サフィックスファイルに登録する部分文字列には、
この文書番号を付けておき、その部分文字列がどの文書
に由来するかが分かるようにしておく。Specifically, first, a matching character string is searched from the head of the suffix file, and the suffix mi is searched.
Find n. If there is no matching string, the number of occurrences in the document is zero. Next, search for a matching string from the end of the suffix file,
Find x. Then, the number of times the character string appears in the document tf
Is obtained by tf = max−min + 1 (31). The documents in the document database are distinguished from each other by a document number (Document ID). The substring registered in the suffix file contains
This document number is assigned so that the document from which the partial character string is derived can be known.

【０１１９】次にＳ１２で、検索文章を文字列Ｘに読み
込む。Ｓ１３では、文書データベースの文書から、ある
一つの文書を選んで文字列Yに読み込む。次にＳ１４
で、文字列Xと文字列Yの類似度を計算する。Ｓ１４で行
う処理は、図７を用いて後述する。Ｓ１５では、求めた
類似度と文書番号を組として文書管理テーブルに登録す
る。Ｓ１６では、文書データベースに含まれる全ての文
書について類似度を計算したかどうか判定する。もし、
まだ全ての文書について計算していなければ、計算を未
だ行っていない文書をＳ１３で選んで文字列Yに読み込
み、Ｓ１５までの処理を繰り返す。もし、全ての文書に
ついて計算していれば、Ｓ１７で、登録したテーブルを
類似度の高い順に並び替える。Ｓ１８では類似の高い文
書を出力する処理を行う。出力する文書は、一つだけに
する、あるいは、所定の複数にする、あるいは、所定の
類似度以上の全ての文書にする、など種々の態様が可能
である。Next, in step S12, the search text is read into the character string X. In S13, a certain document is selected from the documents in the document database and read into the character string Y. Next, S14
Calculates the similarity between the character strings X and Y. The processing performed in S14 will be described later with reference to FIG. In S15, the obtained similarity and the document number are registered as a set in the document management table. In S16, it is determined whether or not the similarity has been calculated for all the documents included in the document database. if,
If the calculation has not been performed for all the documents, a document that has not been calculated is selected in S13 and read into the character string Y, and the processing up to S15 is repeated. If the calculation has been performed for all the documents, the registered tables are sorted in descending order of similarity in S17. In S18, a process of outputting a document having a high similarity is performed. Various modes are possible, such as outputting only one document, a plurality of documents, or all documents having a predetermined similarity or more.

【０１２０】類似度は、式(15)に示す文字列スコアScor
e(ξ)を足し合わせることによって求める。この足し合
わせの途中の値を以下では、スコアと呼ぶ。式(11)と(1
3)の関数MAXから分かるように、文字列スコアを加算し
たスコアや長さを１文字ずつ変えた部分文字列に対する
スコアを計算し、それらの最大値を選択する必要があ
る。その計算と選択を効率よく進めるために、図１１に
示すスコア表を利用する。このスコア表は、横方向が文
字列Xに、縦方向が文字列Yに対応している。以下では、
文字列Ｘの長さをx_lenとし、文字列Ｙの長さをy_lenと
する。このスコア表に対して、左上から右下に向かって
値が増えるようにスコアを埋めることで、式(10)、(1
1)、(13)の関数MAXを実施する。The similarity is calculated using the character string score Scor shown in equation (15).
It is obtained by adding e (ξ). The value in the middle of this addition is hereinafter referred to as a score. Equations (11) and (1
As can be seen from the function MAX of 3), it is necessary to calculate the score obtained by adding the character string score and the score for the partial character string whose length is changed one character at a time, and select the maximum value. In order to efficiently advance the calculation and selection, a score table shown in FIG. 11 is used. In this score table, the horizontal direction corresponds to the character string X, and the vertical direction corresponds to the character string Y. Below,
The length of the character string X is x_len, and the length of the character string Y is y_len. By filling the score in this score table so that the value increases from the upper left to the lower right, Expressions (10) and (1
The function MAX of (1) and (13) is performed.

【０１２１】具体的には、文字列Xのi文字目までと、文
字列Yのj文字目までに対応するスコアをscore[i][j]と
する。文字列Ｘと文字列Ｙにおいて、一致する部分の文
字列γが、文字列Ｘのi文字目からi＋m文字目までと文
字列Ｙのj文字目からj＋n文字目までであり、score[i]
[j]に部分文字列γの文字列スコアを加算した値をvalue
とする。そして、 score[i＋m＋1][j＋n＋1]≧value (32) という関係を成り立たせるため、score[i＋m＋1][j＋n
＋1]がvalueより小さければ、valueをscore[i＋m＋1][j
＋n＋1]に書き込む。この処理を全ての部分文字列γに
関して行えば、式(11)の関数MAXと、式(10)のSIM_3s(α,
β)に関する関数MAXの実施となる。More specifically, scores corresponding to the i-th character of the character string X and the j-th character of the character string Y are score [i] [j]. In the character string X and the character string Y, the matching character strings γ are the i-th to i + m-th characters of the character string X and the j-th to j + n-th characters of the character string Y, and score [i]
Value obtained by adding the character string score of substring γ to [j]
And Then, score [i + m + 1] [j + n to satisfy the relationship of score [i + m + 1] [j + n + 1] ≧ value (32)
If +1] is smaller than value, the value is score [i + m + 1] [j
+ N + 1]. If this processing is performed for all the partial character strings γ, the function MAX of Expression (11) and the SIM _3s (α,
The function MAX for β) is implemented.

【０１２２】また、 score[i＋1][j]≧score[i][j] (33) score[i][j＋1]≧score[i][j] (34) score[i＋1][j＋1]≧score[i][j] (35) という関係を成り立たせるため、score[i＋1][j]、scor
e[i][j＋1]、score[i＋1][j＋1]がscore[i][j]よりも小
さければ、それぞれscore[i][j]を書き込む。iとjを逐
次増やしながらこの処理を行えば、式(13)の関数MAX
と、式(10)のSIM_3g(α,β)に関する関数MAXの実施とな
る。Score [i + 1] [j] ≧ score [i] [j] (33) score [i] [j + 1] ≧ score [i] [j] (34) score [i + 1] [j + 1] ≧ score [i] [j] (35) In order to hold the relationship, score [i + 1] [j], scor
If e [i] [j + 1] and score [i + 1] [j + 1] are smaller than score [i] [j], write score [i] [j] respectively. If this process is performed while sequentially increasing i and j, the function MAX of Expression (13) is obtained.
Then, the function MAX relating to SIM _3g (α, β) in Expression (10) is performed.

【０１２３】これらの比較と書きこみを、図７に示すフ
ローに従って順に処理を進め、スコア表を完成させる。
すると、図１１の右下のscore[x_len][y_len]がスコア
表の中で最も大きな値となり、これが求めるべき類似度
SIM₃(X,Y)となる。なお、このスコア表により類似度を
求める手順は、ＤＰ手法の実施例となっている。The comparison and writing are sequentially processed according to the flow shown in FIG. 7 to complete the score table.
Then, score [x_len] [y_len] at the lower right of FIG. 11 becomes the largest value in the score table, and this is the similarity to be obtained.
SIM ₃ (X, Y). The procedure for obtaining the similarity from the score table is an example of the DP method.

【０１２４】図７は、検索文章を読み込んだ文字列Ｘ
と、文書データベースの文書を読み込んだ文字列Ｙの類
似度を求める処理のフローを示す。以下では、文字列Ｘ
のi文字目からm文字の部分文字列をX[i,m]で、文字列Ｙ
のj文字目からn文字の部分文字列をY[j,n]で示す。例え
ば、文字列Xの先頭の２文字からなる部分文字列は、X
[0,2]である。FIG. 7 shows a character string X from which a search sentence is read.
5 shows a flow of processing for calculating the similarity of a character string Y obtained by reading a document in a document database. In the following, the character string X
X [i, m] is a substring of m characters from the i-th character of
A partial character string of n characters from the jth character of is denoted by Y [j, n]. For example, the substring consisting of the first two characters of the character string X is X
[0,2].

【０１２５】まず始めに、図７に示すＳ２１で、文字列
Xと文字列Yのスコア表score[i][j]の全ての値を０に初
期化する。次に、Ｓ２２では、文字列Xから部分文字列
を取り出す先頭を示すiを0に設定する。Ｓ２３では、文
字列Ｙから部分文字列を取り出す先頭を示すjを0に設定
する。Ｓ２４では、スコア表において現在着目している
スコアscore[i][j]を、処理の便宜のために、一時的な
変数currentに記憶する。First, at S21 shown in FIG.
Initialize all values of score table score [i] [j] of X and character string Y to 0. Next, in S22, i indicating the head of extracting a partial character string from the character string X is set to 0. In S23, j indicating the beginning of extracting the partial character string from the character string Y is set to 0. In S24, the score score [i] [j] currently focused on in the score table is stored in a temporary variable current for convenience of processing.

【０１２６】Ｓ２５では、文字列Xと文字列Yの一致して
いる部分の長さを示すkを0に設定する。Ｓ２６では、文
字列Xにおいて一致を判定する部分を示すi＋kが文字列X
の長さx_len未満であり、かつ、文字列X、Yそれぞれの
１文字X[i＋k,1]、Y[i＋k,1]が一致しているかを判定し
ている。もし、この条件が成り立つならＳ２７へ進み、
この条件が成り立たなければＳ３２へ進む。In S25, k indicating the length of the portion where the character strings X and Y match is set to 0. In S26, i + k indicating the part for which a match is determined in the character string X is the character string X
Is smaller than the length x_len, and whether the characters X [i + k, 1] and Y [i + k, 1] of the character strings X and Y match each other. If this condition is satisfied, proceed to S27,
If this condition is not satisfied, the process proceeds to S32.

【０１２７】Ｓ２７からＳ３１は、一致している部分に
関して文字列スコアをスコア表に加味しながら、一致し
ている部分の長さを逐次増やす処理である。Ｓ２６から
Ｓ２７に進んだときは、部分文字列X[i,k＋1]と部分文
字列Y[j,k＋1]は一致している。まず、Ｓ２６で、部分
文字列X[i,k＋1]の文字列スコアを求める。これは式(1
5)のScore(ξ)を求める処理であり、図８から図１０を
用いて後述する。この文字列スコアをtmp_scoreとす
る。次に、Ｓ２８で、tmp_scoreを現在着目しているス
コア(current)に加算して、この値を一時的な変数value
に記憶する。Ｓ２９とＳ３０は、式(32)を成り立たせる
ための、比較と書き込みである。Ｓ３１では、一致して
いる部分を長くしてＳ２６からＳ３０を繰り返すために
kを1増やす。Steps S27 to S31 are processes for sequentially increasing the length of the matching portion while adding the character string score to the score table for the matching portion. When the process proceeds from S26 to S27, the partial character string X [i, k + 1] matches the partial character string Y [j, k + 1]. First, in S26, a character string score of the partial character string X [i, k + 1] is obtained. This is given by equation (1
This is the process of calculating Score (ξ) in 5) , which will be described later with reference to FIGS. Let this character string score be tmp_score. Next, in S28, tmp_score is added to the score (current) currently focused on, and this value is added to a temporary variable value
To memorize. S29 and S30 are comparison and writing for satisfying the expression (32). In S31, in order to lengthen the matching part and repeat S26 to S30,
Increase k by 1.

【０１２８】Ｓ３２とＳ３３、Ｓ３４とＳ３５、Ｓ３６
とＳ３７は、それぞれ、式(33)、式(34)、式(35)を成り
立たせるための、比較と書き込みである。S32 and S33, S34 and S35, S36
Steps S37 and S37 are comparison and writing, respectively, for satisfying the equations (33), (34), and (35).

【０１２９】Ｓ３８では、文字列Ｙから部分文字列を取
り出す先頭を示すjが文字列Ｙの長さy_len未満であるか
を判定している。もし、この条件が成り立つなら、Ｓ３
９でjを1増やしてＳ２４からＳ３７までの処理を繰り返
す。文字列Yの末尾まで処理が終われば、Ｓ３８からＳ
４０へ進むIn S38, it is determined whether j indicating the head of extracting the partial character string from the character string Y is less than the length y_len of the character string Y. If this condition holds, S3
In step 9, j is incremented by 1 and the processing from S24 to S37 is repeated. If the processing is completed up to the end of the character string Y, S38 to S
Proceed to 40

【０１３０】Ｓ４０では、文字列Xから部分文字列を取
り出す先頭を示すiが文字列Xの長さx_len未満であるか
を判定している。もし、この条件が成り立つ場合は、Ｓ
４１でiを1増やして、Ｓ２３からＳ３８までの処理を繰
り返す。文字列Xの末尾まで処理が終われば、Ｓ４０か
らＳ４２に進む。In S40, it is determined whether or not i indicating the head of extracting the partial character string from the character string X is less than the length x_len of the character string X. If this condition holds, S
In step 41, i is incremented by 1 and the processing from S23 to S38 is repeated. When the process is completed up to the end of the character string X, the process proceeds from S40 to S42.

【０１３１】もし、文字列Xの末尾まで処理が終われ
ば、スコア表は完成している。そこで、Ｓ４２で、scor
e[x_len][y_len]を文字列Ｘと文字列Ｙの類似度として
返す処理を行う。If the processing is completed up to the end of the character string X, the score table is completed. Therefore, in S42, scor
A process of returning e [x_len] [y_len] as the similarity between the character strings X and Y is performed.

【０１３２】図８に、サフィックスファイルを利用して
文字列aの文字列スコアscを算出する処理のフローを示
す。文字列aには、図７のＳ２７で、部分文字列X[i,k＋
1]が与えられている。まず、図８のＳ５１で、文字列a
の出現する文書の数dfを求める。Ｓ５１の具体的な処理
は、図９を用いて後述する。次にＳ５２で、全文書にお
いて文字列aの出現する回数tfを求める。Ｓ５２の具体
的な処理は、図１０を用いて後述する。FIG. 8 shows a flow of processing for calculating the character string score sc of the character string a using the suffix file. In the character string a, the partial character string X [i, k +
1] is given. First, in S51 of FIG.
Find the number df of documents in which appears. The specific processing of S51 will be described later with reference to FIG. Next, in S52, the number of occurrences tf of the character string a in all documents is obtained. The specific processing of S52 will be described later with reference to FIG.

【０１３３】Ｓ５３では、文字列の出現回数tfが２回未
満かどうかを判定し、２回未満ならＳ５４で一時的な変
数scに0.0を与える。これは、文書中に稀にしか現れな
い文字列は、意味のある語ではない、すなわち、情報検
索に有益ではないと考えられるからである。文字列の出
現回数tfが２回以上の場合は、Ｓ５５の判定を行う。[0133] In S53, the number of occurrences tf of string it is determined whether or not less than 2 times, giving a 0.0 to a temporary variable sc in if less than 2 times S54. This is because a character string that appears rarely in a document is not a meaningful word, that is, it is not considered useful for information retrieval. If the number of appearances tf of the character string is two or more, the determination in S55 is performed.

【０１３４】Ｓ５５は、全文書の数Nに対する、文字列a
の出現する文書の数dfの比が、５％より大きいかを判定
する処理である。５％よりも大きい場合には、Ｓ５４で
スコアscに0.0を与える。これは、文書中に頻出する文
字列は、区切りの働きをする語（ストップワード）であ
って有用な文字列ではない、すなわち、情報検索に有益
ではないと考えられるからである。At S55, the character string a for the number N of all documents
Is a process for determining whether or not the ratio of the number df of documents in which. When than 5% larger, giving a 0.0 <br/> score sc in S54. This is because a character string that frequently appears in a document is a word (stop word) serving as a delimiter and is not a useful character string, that is, it is not considered useful for information retrieval.

【０１３５】Ｎとdfの比が５％以下の場合にはＳ５６
で、式(15)によりスコアscを求める計算を行う。最後
に、Ｓ５７でscを文字列スコアとして返す。If the ratio between N and df is 5% or less, the flow goes to S56.
Then, calculation for obtaining the score sc is performed by the equation (15). Finally, sc is returned as a character string score in S57.

【０１３６】図９に、文字列aの出現する文書の数dfを
求める処理のフローを示す。この処理では、同一の文字
列に対する処理時間を短縮するために、文字列aと計算
したdfを、文書の数を記憶するためのハッシュテーブル
（以下、文書数ハッシュテーブル）に登録することで、
再度の計算を不要としている。まず、Ｓ６１で、文字列
aが文書数ハッシュテーブルに登録されているかを判定
し、もし、登録済みならＳ６２で登録されているdfを求
める。一方、文書数ハッシュテーブルに登録されてない
なら、Ｓ６３からＳ６８においてdfを算出する。FIG. 9 shows a flow of processing for obtaining the number df of documents in which the character string a appears. In this processing, in order to reduce the processing time for the same character string, the character string a and the calculated df are registered in a hash table for storing the number of documents (hereinafter, a document number hash table).
It eliminates the need for recalculation. First, in S61, a character string
It is determined whether or not a is registered in the document number hash table. If registered, df registered in S62 is obtained. On the other hand, if it is not registered in the document number hash table, df is calculated in S63 to S68.

【０１３７】Ｓ６３は、サフィックスファイルの先頭か
ら順に文字列aを探し、そのサフィックスをminとおく処
理である。minが求まらない場合、すなわち、サフィッ
クスファイルに文字列aが含まれていない場合は、文字
列aが文書に出現しない場合である。これは、Ｓ６４で
判定して、Ｓ６５でdfを0とする。In step S63, a character string a is searched in order from the beginning of the suffix file, and the suffix is set to min. When min cannot be obtained, that is, when the character string a is not included in the suffix file, the character string a does not appear in the document. This is determined in S64, and df is set to 0 in S65.

【０１３８】Ｓ６４は、サフィックスファイルの末尾か
ら順に文字列aを探し、そのサフィックスをmaxとおく処
理である。サフィックスファイルにおいて、サフィック
スがminからmaxまでの範囲が、文字列aと一致する文字
列である。Ｓ６７は、これらの文字列に付された文書番
号で相異なるものの数を求める処理である。この数が、
文字列aの出現する文書の数dfになる。Step S64 is a process for sequentially searching for a character string a from the end of the suffix file and setting the suffix to max. In the suffix file, the range of the suffix from min to max is a character string that matches the character string a. S67 is a process of calculating the numbers of different document numbers assigned to these character strings. This number is
The number of documents in which the character string a appears is df.

【０１３９】Ｓ６８では、文字列aと文書の数dfを文書
数ハッシュテーブルに登録する処理である。Ｓ６９は、
文書の数dfを返す処理である。In step S68, the character string a and the number of documents df are registered in the document number hash table. S69 is
This is the process of returning the number of documents df.

【０１４０】図１０に、全文書において文字列aの出現
する回数tfを求める処理のフローを示す。Ｓ７１は、サ
フィックスファイルの先頭から順に文字列aを探し、そ
のサフィックスをminとおく処理である。minが求まらな
い場合、すなわち、サフィックスファイルに文字列aが
含まれていない場合は、文字列aが文書に出現しない場
合である。これは、Ｓ７２で判定して、Ｓ７３でtfを0
とする。FIG. 10 shows a flow of processing for obtaining the number of times tf that the character string a appears in all the documents. S71 is a process of searching for a character string a in order from the beginning of the suffix file and setting the suffix to min. When min cannot be obtained, that is, when the character string a is not included in the suffix file, the character string a does not appear in the document. This is determined in S72, and tf is set to 0 in S73.
And

【０１４１】Ｓ７４は、サフィックスファイルの末尾か
ら順に文字列aを探し、そのサフィックスをmaxとおく処
理である。Ｓ７５では、式(31)の計算を行う。Ｓ７６
は、出現する回数tfを返す処理である。S74 is a process of searching for a character string a in order from the end of the suffix file and setting the suffix to max. In S75, calculation of Expression (31) is performed. S76
Is a process for returning the number of appearances tf.

【０１４２】なお、式(15)の代わりに、式(36)から(39)
を用いて実施する場合には、図８と図９に示すフローの
代わりに、それぞれ図１２と図１３に示すフローを用い
れば良い。式(36)と式(40)から(41)を使う場合も同様の
フローを用いて実現できる。図１２のＳ８２、Ｓ８３、
Ｓ８４、Ｓ８５、Ｓ８９に示す処理は、それぞれ、図８
におけるＳ５２、Ｓ５３、Ｓ５４、Ｓ５５、Ｓ５７に示
す処理と同等なので、説明は省略する。図１２のＳ８１
は、文字列aが出現する文書の数df₁ と、文字列aが２回
以上出現する文書の数df₂ を求める。Ｓ８１の具体的な
処理は、図１３を用いて後述する。図１２のＳ８６で
は、式(36)に示す出現集中度を求めている。Ｓ８７は、
出現集中度を正の実定数Ｋ（例えば、2.0）と比較する
処理であり、その大小関係に応じてＳ８８又はＳ８４に
進む。Ｓ８８は式(38)に従って文字列スコアを求める処
理であり、Ｓ８４は式(39)に従って文字列スコアを0.0
とする処理である。 It should be noted that instead of equation (15), equations (36) to (39)
In the case of using the method shown in FIG. 8 and FIG.
Instead, the flow shown in FIGS. 12 and 13, respectively, is used.
Just do it. The same applies when using Equation (36) and Equations (40) to (41).
It can be realized using a flow. S82 and S83 in FIG.
The processes shown in S84, S85, and S89 are respectively performed in FIG.
Shown in S52, S53, S54, S55 and S57
Since the processing is the same as the above, the description is omitted. S81 in FIG.
Is the number of documents in which character string a appears df ₁ and character string a occurs twice
Determine the number of document df ₂ occurs more than. Specific of S81
The processing will be described later with reference to FIG. In S86 of FIG.
Obtains the appearance concentration degree shown in Expression (36). S87 is
Compare the occurrence concentration with a positive real constant K (for example, 2.0)
The processing is performed in S88 or S84 according to the magnitude relation.
move on. S88 is a process for obtaining a character string score according to equation (38).
In S84, the character string score is set to 0.0 according to the equation (39).
Is the processing to be performed.

【０１４３】図１３に、文字列aの出現する文書の数df₁
と、文字列aが２回以上出現する文書の数df₂ を求める処
理のフローを示す。この処理は、図９に示した処理と同
様に、同一の文字列に対する処理時間を短縮するために
文書数ハッシュテーブルを用いるが、文字列aとdf₁ と共
にdf₂ も登録する点で異なっている。まず、Ｓ９１で、
文字列aが文書数ハッシュテーブルに登録されているか
を判定し、もし、登録済みならＳ９２で登録されている
df₁ とdf₂ を求める。一方、文書数ハッシュテーブルに登
録されてないなら、Ｓ９３からＳ９９において、df₁ とd
f₂ を算出する。図１３に示すＳ９３、Ｓ９４、Ｓ９６、
Ｓ９７は、それぞれ図９に示すＳ６３、Ｓ６４、Ｓ６
６、Ｓ６７に示す処理と同じなので、説明は省略する。
図１３に示すＳ９５は、文字列aが文書に出現しない場
合に、df₁ とdf₂ を0にする処理である。Ｓ９３とＳ９６
により求まったminからmaxまでの範囲が、サフィックス
ファイルにおいて、文字列aと一致する文字列である。
Ｓ９８では、これらの文字列に付された文書番号を参照
し、同一の文書番号が２つ以上存在するものの数を求め
る処理である。この数が、文字列aが２回以上出現する
文書の数df₂ になる。Ｓ９９では、文字列aと文書の数df
₁ 、df₂ を文書数ハッシュテーブルに登録する処理であ
る。Ｓ１００は、df₁ 、df₂ をそれぞれ、文字列aの出現
する文書の数、文字列aが２回以上出現する文書の数と
して返す処理である。 [0143]FIG. 13 shows the number of documents df _{1 in} which the character string a appears.
And the number of documents in which the character string a appears twice or more df ₂ Where to seek
The flow of the processing is shown. This processing is the same as the processing shown in FIG.
To reduce the processing time for the same character string
Use document count hash table, but with strings a and df ₁ With
To df ₂ Are also different in that they are also registered. First, in S91,
Whether the character string a is registered in the document count hash table
Is determined, and if registered, registered in S92.
df ₁ And df ₂ Ask for. On the other hand, the
If it has not been recorded, in steps S93 to S99, df ₁ And d
f ₂ Is calculated. S93, S94, S96 shown in FIG.
S97 corresponds to S63, S64, S6 shown in FIG.
6, since the processing is the same as that shown in S67, the description is omitted.
S95 shown in FIG. 13 is used when character string a does not appear in the document.
If df ₁ And df ₂ Is a process of setting the value to 0. S93 and S96
The range from min to max determined by is the suffix
This is a character string that matches the character string a in the file.
In S98, reference is made to the document numbers assigned to these character strings.
Then, find the number of documents with two or more identical document numbers.
This is the process to be performed. This number indicates that the character string a appears twice or more
Number of documents df ₂ become. In S99, the character string a and the number of documents df
₁ , Df ₂ Is registered in the document count hash table.
You. S100 is df ₁ , Df ₂ Each occurrence of the string a
The number of documents to be written, the number of documents where the character string a appears twice or more,
And return it.

【０１４４】（第６実施例）(Sixth Embodiment)

【０１４５】次に、類似度SIM_3aによる文書検索をソフ
トウエアにより実施する例を説明する。ソフトウエアの
実行に用いる計算機システムは、前述した図５に示すも
のを用いる。文書検索プログラム等を記録媒体に記録し
ておける点は、第５実施例と同様である。文書検索プロ
グラムのフローは、図６、図１４、図８から図１０に示
される。図８と図９に示すフローの代わりに、それぞれ
図１２と図１３に示すフローを用いても良いのは第５実
施例の場合と同様である。この実施例は、文字列重みの
あるＤＰ類似度SIM₃に比べると、図７に示すフローの代
わりに図１４に示すフローを用いる点で異なる。そこ
で、共通する説明は割愛し、以下では図１４のフローの
み説明する。Next, an example in which a document search by the similarity SIM _{3a is performed} by software will be described. The computer system shown in FIG. 5 is used for executing the software. The point that a document search program and the like can be recorded on a recording medium is the same as in the fifth embodiment. The flow of the document search program is shown in FIG . 6, FIG. 14 , FIG. 8 to FIG. Instead of the flow shown in FIGS. 8 and 9,
The flow shown in FIGS. 12 and 13 may be used in the fifth embodiment.
This is the same as in the case of the embodiment. This embodiment differs from the DP similarity SIM ₃ having a character string weight in that the flow shown in FIG. 14 is used instead of the flow shown in FIG . Therefore, the common description is omitted, and only the flow of FIG. 14 will be described below.

【０１４６】本実施例において、図１４は、図６に示す
Ｓ１４で行う文字列Xと文字列Yの類似度を計算する処理
のフローを示している。なお、図１４に示すＳ１２１か
らＳ１４２は、それぞれ、図７に示すＳ２１からＳ４２
と関連がある。類似度SIM_3aは類似度SIM₃に比べて、一
致する文字列の最大長を求めてから文字列スコアを計算
する点で異なっている。図１４に示すフローは、図７に
示すフローに比べて、この点で異なっている。図１４に
示すフローを以下、順に説明する。 FIG. 14 shows the flow of the processing for calculating the similarity between the character strings X and Y performed in S14 shown in FIG. 6 in this embodiment. Note that S121 to S142 shown in FIG. 14 are respectively S21 to S42 shown in FIG.
Related to The similarity SIM _3a is different from the similarity SIM ₃ in that a character string score is calculated after obtaining the maximum length of a matching character string. The flow shown in FIG. 14 is different in this point from the flow shown in FIG. The flow shown in FIG. 14 will be described below in order.

【０１４７】まず、Ｓ１２１で、文字列Xと文字列Yのス
コア表score[i][j]の全ての値を０に初期化する。次
に、Ｓ１２２では、文字列Xから部分文字列を取り出す
先頭を示すiを0に設定する。Ｓ１２３では、文字列Ｙか
ら部分文字列を取り出す先頭を示すjを0に設定する。Ｓ
１２４では、スコア表において現在着目しているスコア
score[i][j]を、処理の便宜のために、一時的な変数cur
rentに記憶する。First, in S121, all values of the score tables score [i] [j] of the character strings X and Y are initialized to zero. Next, in S122, i indicating the head of extracting a partial character string from the character string X is set to 0. In S123, j indicating the beginning of extracting the partial character string from the character string Y is set to 0. S
At 124, the score currently focused on in the score table
score [i] [j] is a temporary variable cur for convenience of processing
Remember in rent.

【０１４８】Ｓ１２５では、文字列Xと文字列Yの一致し
ている部分の長さを示すkを0に設定する。Ｓ１２６で
は、文字列Xにおいて一致を判定する部分を示すi＋kが
文字列Xの長さx_len未満であり、かつ、文字列X、Yそれ
ぞれの１文字X[i＋k,1]、Y[i＋k,1]が一致しているかを
判定している。もし、この条件が成り立つならＳ１３１
へ進んで再びＳ１２６を行う。このことにより、どこま
で一致しているかを探す。一致していなければ、Ｓ１２
０へ進む。In S125, k indicating the length of the part where the character strings X and Y match is set to 0. In S126, i + k indicating the part of the character string X for which a match is determined is less than the length x_len of the character string X, and one character of each of the character strings X and Y X [i + k, 1], Y [i + k, 1 ] Are matched. If this condition holds, S131
Then, S126 is performed again. By doing so, a search is made to find out how far they match. If they do not match, S12
Go to 0.

【０１４９】Ｓ１２０では、一致する部分の有無を判定
している。もし、k＞0、すなわち、一致する部分があれ
ば、Ｓ１２７へ進む。一方、一致する部分がなければ、
Ｓ１３２へ進む。In S120, it is determined whether there is a matching part. If k> 0, that is, if there is a matching part, the process proceeds to S127. On the other hand, if there is no match,
Proceed to S132.

【０１５０】Ｓ１２７からＳ１３０は、一致している部
分に関して文字列スコアをスコア表に加味する処理であ
る。Ｓ１２０からＳ１２７に進んだときは、部分文字列
X[i,k＋1]と部分文字列Y[j,k＋1]は一致している。ま
ず、Ｓ１２７で、部分文字列X[i,k＋1]の文字列スコア
を求める。これは式(15)のScore(ξ)を求める処理であ
り、図８から図１０を用いて前述した。あるいは、式(3
6)から(39)を用いてScore(ξ)を算出する処理であり、
図１２、図１３、図１０に示すフローで実施される。あ
るいは式(36)と式(40)から(41)に示すScore関数を用い
るのでもよい。こうして求められる文字列スコアをtmp_
scoreとする。次に、Ｓ１２８で、tmp_scoreを現在着目
しているスコア(current)に加算して、この値を一時的
な変数valueに記憶する。Ｓ１２９とＳ１３０は、式(3
2)を成り立たせるための、比較と書き込みである。その
後、Ｓ１３２へ進む。Steps S127 to S130 are processing for adding the character string score to the score table for the matching part. When proceeding from S120 to S127, a partial character string
X [i, k + 1] and partial character string Y [j, k + 1] match. First, in S127 , a character string score of the partial character string X [i, k + 1] is obtained. This is a process for obtaining Score (ξ) in Expression (15), which has been described above with reference to FIGS. Alternatively, equation (3
6) is a process of calculating Score (ξ) using (39),
This is carried out according to the flowcharts shown in FIGS. Ah
Alternatively, using the Score function shown in Equation (36) and Equations (40) to (41)
It may be. The calculated string score is tmp_
Score. Next, in S128, tmp_score is added to the current score (current), and this value is stored in a temporary variable value. S129 and S130 are calculated by the formula (3)
Comparison and writing to make 2) hold. Thereafter, the process proceeds to S132.

【０１５１】Ｓ１３２とＳ１３３、Ｓ１３４とＳ１３
５、Ｓ１３６とＳ１３７は、それぞれ、式(33)、式(3
4)、式(35)を成り立たせるための、比較と書き込みであ
る。S132 and S133, S134 and S13
5, S136 and S137 are expressed by equations (33) and (3
4) and comparison and writing to make equation (35) hold.

【０１５２】Ｓ１３８では、文字列Ｙから部分文字列を
取り出す先頭を示すjが文字列Ｙの長さy_len未満である
かを判定している。もし、この条件が成り立つなら、Ｓ
１３９でjを1増やしてＳ１２４からＳ１３７までの処理
を繰り返す。文字列Yの末尾まで処理が終われば、Ｓ１
３８からＳ１４０へ進むIn S138, it is determined whether or not j indicating the head of extracting the partial character string from the character string Y is less than the length y_len of the character string Y. If this condition holds, S
At 139, j is incremented by 1 and the processing from S124 to S137 is repeated. When the processing is completed up to the end of the character string Y, S1
Proceed from S38 to S140

【０１５３】Ｓ１４０では、文字列Xから部分文字列を
取り出す先頭を示すiが文字列Xの長さx_len未満である
かを判定している。もし、この条件が成り立つ場合は、
Ｓ１４１でiを1増やして、Ｓ１２３からＳ１３８までの
処理を繰り返す。文字列Xの末尾まで処理が終われば、
Ｓ１４０からＳ１４２に進む。In S140, it is determined whether or not i indicating the head of extracting the partial character string from the character string X is less than the length x_len of the character string X. If this condition holds,
In S141, i is increased by 1, and the processing from S123 to S138 is repeated. When processing is completed up to the end of the character string X,
The process proceeds from S140 to S142.

【０１５４】もし、文字列Xの末尾まで処理が終われ
ば、スコア表は完成している。そこで、Ｓ１４２で、sc
ore[x_len][y_len]を文字列Ｘと文字列Ｙの類似度とし
て返す処理を行う。If the processing is completed up to the end of the character string X, the score table is completed. Therefore, in S142, sc
ore [x_len] [y_len] is returned as the similarity between the character strings X and Y.

【０１５５】（第７実施例）(Seventh Embodiment)

【０１５６】次に、同義語辞書情報のある文字列重みの
あるＤＰ類似度SIM₄による文書検索をソフトウエアによ
り実施する例を説明する。ソフトウエアの実行に用いる
計算機システムは、前述した図５に示すものを用いる。
文書検索プログラム等を記録媒体に記録しておける点
は、第５実施例と同様である。文書検索プログラムのフ
ローは、図１５から図１８、図８から図１０に示され
る。すなわち、図８から図１０に示すスコアを求める処
理は、文字列重みのあるＤＰ類似度SIM₃及び類似度SIM
_3aと共通している。Next, an example will be described in which a document search is performed by software using the DP similarity SIM _{4 having} a character string weight with synonym dictionary information. The computer system shown in FIG. 5 is used for executing the software.
The point that a document search program and the like can be recorded on a recording medium is the same as in the fifth embodiment. The flow of the document search program is shown in FIGS. 15 to 18 and FIGS. 8 to 10. That is, the processing for obtaining the scores shown in FIGS. 8 to 10 is performed by the DP similarity SIM ₃ and the similarity SIM
Common to _3a .

【０１５７】図１５は、検索文章に基づいて文書データ
ベースを検索し、類似度の高い文書を選び出して出力す
る処理のフローを示す。なお、図１５に示すＳ２１１か
らＳ２１８は、それぞれ、図６に示すＳ１１からＳ１８
と同等の処理である。FIG. 15 shows a flow of processing for searching a document database based on a search sentence, selecting and outputting a document having a high degree of similarity. Note that S211 to S218 shown in FIG. 15 correspond to S11 to S18 shown in FIG.
This is the same processing as.

【０１５８】図１５に示すＳ２１０で、ＣＤ−ＲＯＭな
どの記録媒体から同義語に関する情報を読み出して同義
語辞書を作成する。この同義語辞書を作成する処理のフ
ローを図１６に示す。この同義語辞書では、辞書の見出
しとなる同義語キーと複数の同義語の組をハッシュテー
ブル（以下、同義語ハッシュテーブル）に登録してい
る。同義語を辞書から検索することは、同義語キーに対
応する同義語を全て取り出すことで実施する。一方、フ
ロッピーディスクやＣＤ−ＲＯＭなどの記録媒体は、同
義語辞書の編集や拡張を容易に行うために、同義語キー
に対応する同義語が全て一カ所にまとめて記録してある
とは限らない。ここでは、記録媒体には同義語キーと同
義語とが一つずつ組になって記録されている実施例を説
明する。In S210 shown in FIG . 15 , information on synonyms is read from a recording medium such as a CD-ROM and a synonym dictionary is created. FIG. 16 shows the flow of the process of creating this synonym dictionary. In this synonym dictionary, a set of a synonym key serving as a dictionary head and a plurality of synonyms is registered in a hash table (hereinafter, synonym hash table). Searching for a synonym from the dictionary is performed by extracting all synonyms corresponding to the synonym key. On the other hand, recording media such as floppy disks and CD-ROMs do not always have all the synonyms corresponding to synonym keys recorded in one place in order to easily edit and expand the synonym dictionary. Absent. Here, a description will be given of an embodiment in which a synonym key and a synonym are recorded in pairs on a recording medium.

【０１５９】まずＳ２２１で、記録媒体から、同義語キ
ーと同義語の組を読み込む。次にＳ２２２で、読み込ん
だ同義語キーが既に登録されているか判定する。まだ登
録されていない場合は、Ｓ２２３で同義語キーと同義語
を同義語ハッシュテーブルに登録する。既に同義語キー
が登録されている場合には、Ｓ２２４で、同義語キーに
対応する同義語の後に、「,」（コンマ）や「\t」（タ
ブ）などの区切り文字と登録すべき同義語が付け加わる
ように、同義語ハッシュテーブルに追加登録する。この
ように登録された一つの同義語、又は、区切り文字で区
切られた複数の同義語を以下では、同義語列と呼ぶこと
にする。First, in step S221 , a set of a synonym key and a synonym is read from the recording medium. Next, in S222 , it is determined whether the read synonym key is already registered. If not registered, the synonym key and the synonym are registered in the synonym hash table in S223 . If the synonym key has already been registered, in S224 , after the synonym corresponding to the synonym key, a synonym to be registered as a delimiter such as "," (comma) or "\ t" (tab) Register additional words in the synonym hash table so that words are added. In the following, one registered synonym or a plurality of synonyms separated by delimiters will be referred to as a synonym string.

【０１６０】Ｓ２２３、Ｓ２２４の後は、Ｓ２２５で登
録すべき同義語がまだ存在するか判定する。まだ存在す
る場合には、Ｓ２２１からＳ２２４までの処理を繰り返
して登録する。言うまでもなく、記録媒体が複数ある場
合（例えば、基本的な同義語を登録したＣＤ−ＲＯＭ
と、特定の利用分野の同義語を登録するフロッピーディ
スクとを併用する場合）は、全ての記録媒体に対して、
以上の処理を行う。このようにして、全ての同義語を登
録し終えたなら、同義語辞書の作成は終了する。After S223 and S224 , it is determined in S225 whether a synonym to be registered still exists. If it still exists, the process from S221 to S224 is repeated and registered. Needless to say, when there are a plurality of recording media (for example, a CD-ROM in which basic synonyms are registered)
Is used together with a floppy disk that registers synonyms for a specific field of use).
The above processing is performed. When all the synonyms have been registered in this way, the creation of the synonym dictionary ends.

【０１６１】次に、図１５に示すＳ２１１で、ある文字
列の出現回数を効率よく計算する準備のために、文書デ
ータベースに含まれる全文書を総合して、前述したサフ
ィックスファイルを作成する。そして、Ｓ２１２で、検
索文章を文字列Ｘに読み込む。Ｓ２１３では、文書デー
タベースの文書から、ある一つの文書を選んで文字列Y
に読み込む。次にＳ２１４で、文字列Xと文字列Yの類似
度SIM₄(X,Y)を計算する。Ｓ２１４で行う処理は、図１
７と図１８を用いて後述する。Ｓ２１５では、求めた類
似度と文書番号を組として文書管理テーブルに登録す
る。Ｓ２１６では、文書データベースに含まれる全ての
文書について類似度を計算したかどうか判定する。も
し、まだ全ての文書について計算していなければ、計算
を未だ行っていない文書をＳ２１３で選んで文字列Yに
読み込み、Ｓ２１５までの処理を繰り返す。もし、全て
の文書について計算していれば、Ｓ２１７で、登録した
テーブルを類似度の高い順に並び替える。Ｓ２１８では
類似度の高い文書から、一つ又は幾つかを出力する処理
を行う。Next, in step S211 shown in FIG. 15 , all the documents included in the document database are integrated to create the above-mentioned suffix file in order to efficiently calculate the number of appearances of a certain character string. Then, in S212 , the search text is read into the character string X. In S213 , a certain document is selected from the documents in the document database and the character string Y is selected.
Read in. Next, in S214 , the similarity SIM ₄ (X, Y) between the character strings X and Y is calculated. The processing performed in S214 is the same as that in FIG.
7 and FIG . In S215 , the obtained similarity and the document number are registered as a set in the document management table. In S216 , it is determined whether the similarity has been calculated for all the documents included in the document database. If the calculation has not been performed for all the documents, a document that has not been calculated is selected in S213 and read into the character string Y, and the processing up to S215 is repeated. If the calculation has been performed for all the documents, in S217 , the registered tables are rearranged in descending order of similarity. In S218 , a process of outputting one or several documents from a document having a high degree of similarity is performed.

【０１６２】類似度は、式(30)に示す同義語スコアSyno
nymScore(T)を足し合わせることによって求める。この
足し合わせの途中の値も以下では、スコアと呼ぶ。式(2
6)と(29)の関数MAXから分かるように、同義語スコアを
加算したスコアや長さを１文字ずつ変えた部分文字列に
対してスコアを計算し、それらの最大値を選択する必要
がある。そこで、図１１を用いて前述したスコア表を利
用する。本実施例においては、スコア表を完成させる
と、右下のscore[x_len][y_len]が類似度SIM₄(X,Y)とな
る。The similarity is calculated using the synonym score Syno shown in equation (30).
It is determined by adding nymScore (T). In the following, the value in the middle of this addition is also called a score. Equation (2
As can be seen from the functions MAX in (6) and (29), it is necessary to calculate the score for the sum of the synonym scores and for the substring in which the length is changed one character at a time, and select the maximum value. is there. Therefore, the score table described with reference to FIG. 11 is used. In this embodiment, when the score table is completed, the lower right score [x_len] [y_len] becomes the similarity SIM ₄ (X, Y).

【０１６３】図１７と図１８は、検索文章を読み込んだ
文字列Ｘと、文書データベースの文書を読み込んだ文字
列Ｙの類似度を求める処理のフローを示す。まず始め
に、図１７に示すＳ２３１で、文字列Xと文字列Yのスコ
ア表score[i][j]の全ての値を０に初期化する。次に、
Ｓ２３２では、文字列Xから部分文字列を取り出す先頭
を示すiを0に設定する。Ｓ２３３では、文字列Xから取
り出す部分文字列の長さを示すmを1に設定する。 FIGS. 17 and 18 show a flow of processing for obtaining the similarity between the character string X from which the search text has been read and the character string Y from which the document in the document database has been read. First, in S231 shown in FIG. 17 , all values of the score tables score [i] [j] of the character strings X and Y are initialized to zero. next,
In S232, i indicating the beginning of extracting the partial character string from the character string X is set to 0. In S233, m indicating the length of the partial character string extracted from the character string X is set to 1.

【０１６４】Ｓ２３４では、文字列Xの部分文字列X[i,
m]が同義語ハッシュテーブルの同義語キーに存在するか
を判定する。もし、同義語キーに存在しなければＳ２３
６へ進む。もし、同義語キーに存在すれば、Ｓ２３５
で、X[i,m]に対応する同義語列Ａを同義語辞書ハッシュ
テーブルから取り出す。In S234, the substring X [i,
m] exists in the synonym key of the synonym hash table. If it does not exist in the synonym key, S23
Proceed to 6. If it exists in the synonym key, S235
Then, the synonym string A corresponding to X [i, m] is extracted from the synonym dictionary hash table.

【０１６５】次にＳ２３６では、文字列Ｙから部分文字
列を取り出す先頭を示すjを0に設定する。Ｓ２３７で
は、スコア表において現在着目しているスコアscore[i]
[j]を、処理の便宜のために、一時的な変数currentに記
憶する。Next, in S236, j indicating the beginning of extracting the partial character string from the character string Y is set to 0. In S237, the score score [i] currently focused on in the score table
[j] is stored in a temporary variable current for convenience of processing.

【０１６６】Ｓ２３８では、Ｓ２３４と同じく、文字列
Xの部分文字列X[i,m]が同義語ハッシュテーブルの同義
語キーに存在するかを判定する。もし、同義語キーに存
在しなければ、同義語スコアを検討する必要がないの
で、図１８に示すＳ２５１に進む。もし、同義語キーに
存在すれば、図１７に示すＳ２３９からＳ２５０におい
て、最大となる同義語スコアを探して処理を行う。これ
は式(27)に相当する処理である。In step S238, the character string is set in the same manner as in step S234.
It is determined whether the substring X [i, m] of X exists in the synonym key of the synonym hash table. If it does not exist in the synonym key, there is no need to consider the synonym score, so the process proceeds to S251 shown in FIG . If it exists in the synonym key, the processing is performed by searching for the maximum synonym score from S239 to S250 shown in FIG . This is processing corresponding to equation (27).

【０１６７】Ｓ２３９では、同義語ハッシュテーブルに
登録されている同義語列Ａから一つの同義語aを取り出
す。これが文字列Yの部分文字列Y[j,n]と一致するかを
Ｓ２４０からＳ２４３で確かめる。Ｓ２４０では、文字
列Yから取り出す部分文字列の長さを示すnを１に設定す
る。そして、Ｓ２４１で部分文字列Y[j,n]が同義語aと
等しい文字列か否かを判定する。等しくなければ、Ｓ２
４２でj＋n < y_lenか、すなわち、部分文字列の末尾が
文字列Yの末尾に至っていないかを判定する。もし、こ
の条件に合うなら、Ｓ２４３でnを1増やすことで部分文
字列の長さを増やし、Ｓ２４１の判定を再度行う。も
し、この条件に合わない場合は、同義語aが部分文字列Y
[j,n]と合わない場合なので、Ｓ２５０に進む。In S239, one synonym a is extracted from the synonym string A registered in the synonym hash table. It is confirmed from S240 to S243 whether this matches the partial character string Y [j, n] of the character string Y. In S240, n indicating the length of the partial character string extracted from the character string Y is set to 1. Then, in S241, it is determined whether or not the partial character string Y [j, n] is a character string equal to the synonym a. If not equal, S2
At 42, it is determined whether j + n <y_len, that is, whether the end of the partial character string has not reached the end of the character string Y. If this condition is met, the length of the partial character string is increased by increasing n by 1 in S243, and the determination in S241 is performed again. If this condition is not met, the synonym a is
Since it does not match [j, n], the process proceeds to S250.

【０１６８】Ｓ２４１において、部分文字列Y[j,n]が同
義語aと等しい場合には、Ｓ２４４で同義語aの同義語ス
コアを求める。これは式(30)のSynonymScoreを求める処
理である。式(15)と式(30)が同じ形であることから分か
るように、同義語aに対する文字列スコアを求めると、
それは同義語スコアである。そこで、図８から図１０を
用いて前述した処理により、同義語スコアを求める。If the partial character string Y [j, n] is equal to the synonym a in S241, the synonym score of the synonym a is determined in S244. This is a process for calculating the SynonymScore in Expression (30). As can be seen from the fact that the expressions (15) and (30) have the same form, when the character string score for the synonym a is obtained,
It is a synonym score. Therefore, a synonym score is obtained by the processing described above with reference to FIGS.

【０１６９】図１７のＳ２４４では、同義語aの同義語
スコアをtmp_scoreとする。Ｓ２４５ではtmp_scoreが0
であるかどうかを判定する。もし、0ならこの同義語aは
検索には役立たない。そこで、次回のtmp_scoreを求め
る手間を省くため、Ｓ２４６で同義語ハッシュテーブル
から削除する。一方、tmp_score(a)が0でなければ、Ｓ
２４７で現在着目しているスコア(current)に加算し
て、この値を一時的な変数valueに記憶する。Ｓ２４８
とＳ２４９は、式(32)を成り立たせるための、比較と書
き込みである。In S244 of FIG . 17 , the synonym score of the synonym a is set to tmp_score. Tmp_score is 0 in S245
Is determined. If 0, this synonym a is useless for searching. Therefore, in order to save the trouble of obtaining the next time tmp_score, it is deleted from the synonym hash table in S246. On the other hand, if tmp_score (a) is not 0, S
At 247, the value is added to the current score (current), and this value is stored in a temporary variable value. S248
And S249 are comparison and writing for satisfying the expression (32).

【０１７０】Ｓ２５０では、同義語列Aに含まれている
同義語を、全て取り出したか判定する。全て取り出して
いない場合には、残っている同義語に対してＳ２３９か
らＳ２４９までの処理を繰り返す。全ての同義語を取り
出した場合には、図１８に示すＳ２５１に進む。In S250, it is determined whether all synonyms included in the synonym string A have been extracted. If not all have been extracted, the processing from S239 to S249 is repeated for the remaining synonyms. If all synonyms have been extracted, the process proceeds to S251 shown in FIG .

【０１７１】Ｓ２５１とＳ２５２、Ｓ２５３とＳ２５
４、Ｓ２５５とＳ２５６は、それぞれ、式(33)、式(3
4)、式(35)を成り立たせるための、比較と書き込みであ
る。S251 and S252, S253 and S25
4, S255 and S256 are expressed by equations (33) and (3
4) and comparison and writing to make equation (35) hold.

【０１７２】Ｓ２５７では、文字列Ｙから部分文字列を
取り出す先頭を示すjが文字列Ｙの長さy_len未満である
かを判定している。もし、この条件が成り立つなら、Ｓ
２５８でjを1増やしてＳ２３７からＳ２５６までの処理
を繰り返す。In S257, it is determined whether j indicating the head of extracting the partial character string from the character string Y is less than the length y_len of the character string Y. If this condition holds, S
In step 258, j is incremented by 1, and the processing from S237 to S256 is repeated.

【０１７３】Ｓ２５９では、文字列Xから取り出す部分
文字列Xの文字数mを1増やしている。そしてＳ２６０
で、mがMAX_JAよりも小さく、かつ、i+mが文字列Xの長
さx_lenよりも小さいかを判定している。もし、この条
件が成り立つならＳ２３４からＳ２５９までの処理を繰
り返す。ここで、MAX_JAは、意味のある文字列の最大文
字数を示す。あまり長い文字列からなる同義語は同義語
辞書に載っていない。そこで、かかる長い文字列で同義
語辞書を参照する無駄を省くためにMAX_JAによる制限を
設けておく。日本語を実施する本実施例では、MAX_JAは
20としている。この値は、同義語ハッシュテーブルに登
録した最も長い同義語キーか同義語の長さに設定してお
くと、無駄がなくてより好ましい。In S259, the number m of characters of the partial character string X extracted from the character string X is increased by one. And S260
It is determined whether m is smaller than MAX_JA and i + m is smaller than the length x_len of the character string X. If this condition is satisfied, the processing from S234 to S259 is repeated. Here, MAX_JA indicates the maximum number of characters of a meaningful character string. Synonyms consisting of too long strings are not listed in the synonym dictionary. Therefore, a limit by MAX_JA is provided in order to eliminate the waste of referring to the synonym dictionary with such a long character string. In this embodiment that implements Japanese, MAX_JA is
20. It is more preferable to set this value to the longest synonym key or the length of the synonym registered in the synonym hash table without waste.

【０１７４】さて、Ｓ２６０に示す条件が成り立たない
場合は、Ｓ２６１に進む。Ｓ２６１では、文字列Xから
部分文字列を取り出す先頭を示すiが文字列Xの長さx_le
n未満であるかを判定している。もし、この条件が成り
立つ場合は、Ｓ２６２でiを1増やして、Ｓ２３３からＳ
２６０までの処理を繰り返す。もし、この条件が成り立
たなければ、スコア表が完成した場合なので、Ｓ２６３
で、score[x_len][y_len]を文字列Ｘと文字列Ｙの類似
度として返す処理を行う。If the condition shown in S260 is not satisfied, the flow advances to S261. In S261, i indicating the beginning of extracting a partial character string from the character string X is the length x_le of the character string X
It is determined whether it is less than n. If this condition is satisfied, i is incremented by 1 in S262, and S233 to S
The processing up to 260 is repeated. If this condition is not satisfied, it means that the score table has been completed.
Then, a process of returning score [x_len] [y_len] as the similarity between the character strings X and Y is performed.

【０１７５】（第８実施例）(Eighth Embodiment)

【０１７６】第４実施例において説明したのと同様に、
第７実施例における同義語辞書として、対訳辞書を用い
れば、異なる言語の文書の類似度を求めることができ
る。すなわち、ソフトウエアにより言語横断情報検索を
行う情報検索装置が実施できる。As described in the fourth embodiment,
If a bilingual dictionary is used as the synonym dictionary in the seventh embodiment, the similarity between documents in different languages can be obtained. That is, an information search device that performs cross-language information search by software can be implemented.

【０１７７】以上の実施例の説明から分かるように、第
１の発明である文字列類似度算出方法は、第１実施例、
第２実施例、第５実施例、及び、第６実施例で実施され
ている。As can be seen from the above description of the embodiment, the character string similarity calculation method according to the first invention is the same as that of the first embodiment,
This is performed in the second embodiment, the fifth embodiment, and the sixth embodiment.

【０１７８】第２の発明である文字列類似度算出方法
は、第１実施例、第２実施例、第５実施例、及び、第６
実施例が、式(16)に示す場合を有するScore関数を用い
ることで実施されている。The character string similarity calculating method according to the second aspect of the present invention includes a first embodiment, a second embodiment, a fifth embodiment, and a sixth embodiment.
The embodiment is implemented using a Score function having the case shown in equation (16).

【０１７９】第３の発明である文字列類似度算出方法
は、第１実施例、第２実施例、第５実施例、及び、第６
実施例が、式(15)に示すScore関数を用いることで実施
されている。The character string similarity calculating method according to the third aspect of the present invention includes the first embodiment, the second embodiment, the fifth embodiment, and the sixth embodiment.
The embodiment is implemented by using the Score function shown in Expression (15).

【０１８０】第４の発明である文字列類似度算出方法
は、第１実施例、第２実施例、第５実施例、及び、第６
実施例が、式(36)に示す出現集中度に応じて、式(38)に
示すScore関数を用いること、または式(40)に示すScore
関数を用いることで実施されている。 A character string similarity calculation method according to the fourth invention
Are the first embodiment, the second embodiment, the fifth embodiment, and the sixth embodiment.
In the example, the expression (38) is used in accordance with the appearance concentration shown in the expression (36).
Use the Score function shown, or use the Score shown in equation (40)
It is implemented using functions.

【０１８１】第５の発明である文字列類似度算出方法
は、第２実施例、及び、第６実施例が、式(21)を用いる
ことで実施されている。In the character string similarity calculating method according to the fifth invention, the second and sixth embodiments are implemented by using equation (21).

【０１８２】第６の発明である文字列類似度算出方法
は、第３実施例、第４実施例、第７実施例、及び、第８
実施例で実施されている。The character string similarity calculating method according to the sixth aspect of the present invention includes a third embodiment, a fourth embodiment, a seventh embodiment, and an eighth embodiment.
This is implemented in the embodiment.

【０１８３】第７の発明である文字列類似度算出方法
は、第３実施例、第４実施例、第７実施例、及び、第８
実施例が、式(30)に示すSynonymScore関数を用いること
で実施されている。The character string similarity calculating method according to the seventh aspect of the present invention includes a third embodiment, a fourth embodiment, a seventh embodiment, and an eighth embodiment.
The embodiment is implemented by using the SynonymScore function shown in Expression (30).

【０１８４】第８の発明である文字列類似度算出方法
は、第４実施例、及び、第８実施例が、同義語辞書とし
て対訳辞書を用いることで実施されている。The character string similarity calculating method according to the eighth aspect of the present invention is implemented in the fourth and eighth embodiments by using a bilingual dictionary as a synonym dictionary.

【０１８５】第９の発明である文字列類似度算出方法
は、第１実施例、及び、第５実施例が、式(11)を用いる
ことで実施されている。また、第３実施例、第４実施
例、第７実施例、及び、第８実施例が、式(27)を用いる
ことでも実施されている。The character string similarity calculation method according to the ninth aspect of the present invention is implemented in the first embodiment and the fifth embodiment by using equation (11). Further, the third, fourth, seventh, and eighth embodiments are also implemented by using equation (27).

【０１８６】第１０の発明である文字列類似度算出装置
は、第１実施例、第２実施例、第５実施例、及び、第６
実施例で実施されている。なお、第５実施例において
は、文字列スコア算出部は、図７に示すＳ２７で実施さ
れている。一致文字列類似度算出部は、Ｓ２８で実施さ
れている。任意文字列類似度算出部は、iとjの値を順次
増加しながら、Ｓ３２からＳ３７までの処理を繰り返す
ことで実施されている。選択部は、Ｓ２９とＳ３０、Ｓ
３２からＳ３７まで、の両方を行うことによって実施さ
れている。第６実施例においては、文字列スコア算出部
は、図１４に示すＳ１２７で実施されている。一致文字
列類似度算出部は、Ｓ１２８で実施されている。任意文
字列類似度算出部は、iとjの値を順次増加しながら、Ｓ
１３２からＳ１３７までの処理を繰り返すことで実施さ
れている。選択部は、Ｓ１２９とＳ１３０、Ｓ１３２か
らＳ１３７まで、の両方を行うことによって実施されて
いる。The character string similarity calculating apparatus according to the tenth aspect of the present invention includes a first embodiment, a second embodiment, a fifth embodiment, and a sixth embodiment.
This is implemented in the embodiment. In the fifth embodiment, the character string score calculation unit is performed in S27 shown in FIG. The matching character string similarity calculation unit is performed in S28. The arbitrary character string similarity calculation unit is implemented by repeating the processes from S32 to S37 while sequentially increasing the values of i and j. The selection unit includes S29 and S30, S
This is implemented by performing both of steps S32 to S37. In the sixth embodiment, the character string score calculator is implemented in S127 shown in FIG . The matching character string similarity calculation unit is performed in S128. The arbitrary character string similarity calculator calculates the S while sequentially increasing the values of i and j.
This is implemented by repeating the processes from 132 to S137. The selection unit is implemented by performing both of S129 and S130 and S132 to S137.

【０１８７】第１１の発明である文字列類似度算出装置
は、第１実施例、第２実施例、第５実施例、及び、第６
実施例が、式(15)に示すScore関数を用いることで実施
されている。The character string similarity calculating apparatus according to the eleventh aspect of the present invention comprises a first embodiment, a second embodiment, a fifth embodiment, and a sixth embodiment.
The embodiment is implemented by using the Score function shown in Expression (15).

【０１８８】第１２の発明である文字列類似度算出装置
は、第１実施例、第２実施例、第５実施例、及び、第６
実施例が、式(36)に示す出現集中度に応じて、式(38)に
示すScore関数を用いること、または式(40)に示すScore
関数を用いることで実施されている。 A character string similarity calculating apparatus according to the twelfth invention
Are the first embodiment, the second embodiment, the fifth embodiment, and the sixth embodiment.
In the example, the expression (38) is used in accordance with the appearance concentration shown in the expression (36).
Use the Score function shown, or use the Score shown in equation (40)
It is implemented using functions.

【０１８９】第１３の発明である文字列類似度算出装置
は、第３実施例、第４実施例、第７実施例、及び、第８
実施例で実施されている。なお、第７実施例において
は、同義語スコア算出部は、図１７に示すＳ２４４で実
施されている。一致文字列類似度算出部は、Ｓ２４７で
実施されている。任意文字列類似度算出部は、iとjの値
を順次増加しながら、図１８に示すＳ２５１からＳ２５
６までの処理を繰り返すことで実施されている。選択部
は、図１７に示すＳ２４８とＳ２４９、図１８に示すＳ
２５１からＳ２５６まで、の両方を行うことによって実
施されている。A character string similarity calculating apparatus according to a thirteenth aspect of the present invention includes a third embodiment, a fourth embodiment, a seventh embodiment, and an eighth embodiment.
This is implemented in the embodiment. In the seventh embodiment, the synonym score calculation unit is performed in S244 shown in FIG . The matching character string similarity calculation unit is performed in S247. The arbitrary character string similarity calculating unit sequentially increases the values of i and j, and proceeds from S251 to S25 shown in FIG.
This is implemented by repeating the processing up to 6. Selecting unit, S248 and S249 shown in FIG. 17, S shown in FIG. 18
251 to S256 are performed.

【０１９０】第１４の発明である文字列類似度算出装置
は、第３実施例、第４実施例、第７実施例、及び、第８
実施例が、式(30)に示すSynonymScore関数を用いること
で実施されている。The character string similarity calculating apparatus according to the fourteenth aspect of the present invention includes a third embodiment, a fourth embodiment, a seventh embodiment, and an eighth embodiment.
The embodiment is implemented by using the SynonymScore function shown in Expression (30).

【０１９１】第１５の発明である文字列類似度算出装置
は、第１から第８実施例で実施されている。A character string similarity calculating apparatus according to a fifteenth aspect is implemented in the first to eighth embodiments.

【０１９２】第１６の発明に記載の文字列類似度算出プ
ログラムは、第５実施例、及び、第６実施例で実施され
ている。なお、第５実施例においては、対比設定過程
は、図７に示すＳ４１とＳ３９によってiとjを順に増や
すことで実施されている。共通部分文字列特定過程は、
Ｓ２５、Ｓ２６とＳ３１で実施されている。文字列スコ
ア設定過程は、Ｓ２７で実施されている。文字列スコア
加算過程は、Ｓ２８で実施されている。進行処理過程
は、Ｓ２９とＳ３０、Ｓ３２からＳ３７まで、の両方に
よって実施されている。また、第６実施例においては、
対比設定過程は、図１４に示すＳ１４１とＳ１３９によ
ってiとjを順に増やすことで実施されている。共通部分
文字列特定過程は、Ｓ１２５、Ｓ１２６とＳ１３１で実
施されている。文字列スコア設定過程は、Ｓ１２７で実
施されている。文字列スコア加算過程は、Ｓ１２８で実
施されている。進行処理過程は、以上の過程を繰り返し
ながら、Ｓ１２９とＳ１３０、Ｓ１３２からＳ１３７ま
で、の両方を行うことによって実施されている。The character string similarity calculation program according to the sixteenth invention is implemented in the fifth embodiment and the sixth embodiment. In the fifth embodiment, the comparison setting process is performed by sequentially increasing i and j in S41 and S39 shown in FIG. The common substring identification process is as follows:
This is performed in S25, S26, and S31. The character string score setting process is performed in S27. The character string score adding process is performed in S28. The progress process is performed by both S29 and S30, and S32 to S37. In the sixth embodiment,
Contrast setting process is performed by increasing the i and j in sequence by S141 and S139 shown in FIG. 14. The common part character string specifying process is performed in S125, S126, and S131. The character string score setting process is performed in S127. The character string score adding process is performed in S128. The progress processing step is performed by repeating both of the above steps and performing both steps S129 and S130 and steps S132 to S137.

【０１９３】第１７の発明に記載の文字列類似度算出プ
ログラムは、第５実施例、及び、第６実施例が、文字列
スコアを定めるために、図８に示すＳ５６を行うことで
実施されている。The character string similarity calculation program described in the seventeenth invention is implemented by the fifth and sixth embodiments by performing S56 shown in FIG. 8 in order to determine a character string score. ing.

【０１９４】第１８の発明に記載の文字列類似度算出プ
ログラムは、第５実施例、及び、第６実施例が、文字列
スコアを定めるために、図１２に示すＳ８７の判定に応
じて、Ｓ８８を行うことで実施されている。 A character string similarity calculation program according to the eighteenth aspect of the present invention.
The program is the same as the fifth and sixth embodiments except that the character string
In order to determine the score, it is necessary to respond to the determination in S87 shown in FIG.
First, S88 is performed.

【０１９５】第１９の発明に記載の文字列類似度算出プ
ログラムは、第５実施例、及び、第６実施例が、図９と
図１０に示す処理を実行することによって実施されてい
る。The character string similarity calculation program according to the nineteenth aspect of the present invention is implemented by executing the processing shown in FIGS. 9 and 10 in the fifth and sixth embodiments .

【０１９６】第２０の発明に記載の文字列類似度算出プ
ログラムは、第７実施例、及び、第８実施例で実施され
ている。それぞれの実施例において、対比設定過程は、
図１８に示すＳ２５８とＳ２６２によってiとjを順に増
やすことで実施されている。同義語特定過程は、図１７
に示すＳ２３４とＳ２４１で実施されている。同義語ス
コア設定過程は、Ｓ２４４で実施されている。同義語ス
コア加算過程は、Ｓ２４７で実施されている。進行処理
過程は、以上の過程を繰り返しながら、Ｓ２４８とＳ２
４９、図１８に示すＳ２５１からＳ２５６まで、の両方
を行うことによって実施されている。The character string similarity calculation program according to the twentieth invention is implemented in the seventh and eighth embodiments. In each embodiment, the comparison setting process includes:
This is implemented by sequentially increasing i and j by S258 and S262 shown in FIG . The synonym identification process is shown in FIG.
S234 and S241 shown in FIG. The synonym score setting process is performed in S244. The synonym score adding process is performed in S247. In the progress processing process, S248 and S2
49, and is performed by performing both of S251 to S256 shown in FIG .

【０１９７】第２１の発明に記載の文字列類似度算出プ
ログラムは、第７実施例、及び、第８実施例が、同義語
スコアを求めるために、図８に示すＳ５６を行うことで
実施されている。The character string similarity calculation program according to the twenty-first invention is implemented by the seventh and eighth embodiments by performing S56 shown in FIG. 8 in order to obtain a synonym score. ing.

【０１９８】第２２の発明に記載の文字列類似度算出プ
ログラムは、第７実施例、及び、第８実施例が、図９と
図１０に示す処理を実行することによって実施されてい
る。The character string similarity calculation program described in the twenty-second invention is implemented by the seventh embodiment and the eighth embodiment by executing the processing shown in FIGS. 9 and 10.

【０１９９】第２３の発明に記載の文字列類似度算出プ
ログラムは、第５から第８実施例が、図１１に示すスコ
ア表を用いて、式(32)から(35)を成り立たせるための比
較と書き込みを行うことで実施されている。The character string similarity calculation program according to the twenty- third aspect is provided so that the fifth to eighth embodiments allow the expressions (32) to (35) to be satisfied by using the score table shown in FIG. It is implemented by comparing and writing.

【０２００】第２４の発明に記載の文章検索プログラム
は、第５から第８実施例で実施されている。The text search program according to the twenty-fourth invention is implemented in the fifth to eighth embodiments.

【０２０１】なお、本発明の技術的範囲は、これら実施
例に限られるものではなく、請求項と均等の範囲内を含
んでおり、発明の趣旨を変えない範囲で、種々の変形が
可能である。The technical scope of the present invention is not limited to these embodiments, but includes the scope equivalent to the claims and can be variously modified without changing the gist of the invention. is there.

【０２０２】[0202]

【発明の効果】第１の発明によれば、二つの文字列それ
ぞれにおける順序に適合し、かつ、共通する部分文字列
に着目して類似度を求めることができる。すなわち、文
字列の出現順序を考慮した類似度が求められる。According to the first aspect of the present invention, the degree of similarity can be obtained by focusing on the common partial character string, which is suitable for the order of the two character strings. That is, the similarity is calculated in consideration of the appearance order of the character strings.

【０２０３】第２の発明によれば、さらに、共通する部
分文字列が長い方が、重みがより重くなる場合がある。
そのため、類似度も高くなる傾向になる。従って、共通
する部分文字列としてより長いものが選ばれて、求まっ
た類似度はより適切な値となる。According to the second invention, the longer the common partial character string, the higher the weight may be.
Therefore, the similarity tends to increase. Therefore, a longer one is selected as the common partial character string, and the obtained similarity becomes a more appropriate value.

【０２０４】第３の発明によれば、さらに、情報量を反
映した類似度を求めることができる。従って、情報量に
関する情報理論の知見を応用しやすくなる。According to the third aspect, it is possible to further obtain a similarity reflecting the amount of information. Therefore, it becomes easy to apply the knowledge of the information theory regarding the amount of information.

【０２０５】第４の発明によれば、さらに、出現集中度
を反映した類似度を求めることができる。出現集中度を
利用することで、検索により有効な文字列をより多く重
みを与えることができ、検索精度を向上することができ
る。 According to the fourth aspect, the degree of appearance concentration is further improved .
Can be obtained. Concentration of appearance
By using, more effective character strings can be
Can improve the search accuracy
You .

【０２０６】第５の発明によれば、さらに、共通する部
分文字列を敢えて分割してみることがないので、演算量
が節約できる。According to the fifth aspect of the present invention, it is not necessary to divide a common partial character string, so that the amount of calculation can be reduced.

【０２０７】第６の発明によれば、二つの文字列それぞ
れにおける順序に適合し、かつ、同義語辞書の要素であ
る部分文字列に着目して類似度を求めることができる。
従って、同義語が使われている場合であっても、類似度
を適切に求めることができる。According to the sixth aspect, the degree of similarity can be obtained by focusing on the partial character string which is suitable for the order of each of the two character strings and is an element of the synonym dictionary.
Therefore, even when a synonym is used, the similarity can be appropriately obtained.

【０２０８】第７の発明によれば、さらに、情報量を反
映した類似度を求めることができる。従って、情報量に
関する情報理論の知見を応用しやすくなる。According to the seventh aspect , a similarity reflecting the amount of information can be obtained. Therefore, it becomes easy to apply the knowledge of the information theory regarding the amount of information.

【０２０９】第８の発明によれば、異なる言語の文字列
の類似度を求めることができる。[0209] According to the eighth aspect , the similarity between character strings in different languages can be obtained.

【０２１０】第９の発明によれば、さらに、類似度が最
大値となるように、部分文字列が求められる。従って、
算出された類似度が最大値であるという点で、合理的な
類似度が求められる。According to the ninth aspect , further, a partial character string is obtained such that the similarity has the maximum value. Therefore,
A reasonable similarity is obtained in that the calculated similarity is the maximum value.

【０２１１】第１０の発明によれば、二つの文字列それ
ぞれにおける順序に適合し、かつ、共通する部分文字列
に着目して類似度を求める装置が実現できる。すなわ
ち、文字列の出現順序を考慮した類似度が、この装置に
よって求められる。According to the tenth aspect , it is possible to realize a device that conforms to the order of two character strings and obtains a similarity by focusing on a common partial character string. That is, the similarity in which the appearance order of the character strings is considered is obtained by this device.

【０２１２】第１１の発明によれば、さらに、情報量を
反映した類似度を求める装置が実現できる。従って、情
報量に関する情報理論の知見を応用しやすくなる。According to the eleventh aspect , it is possible to realize a device for obtaining a similarity reflecting the amount of information. Therefore, it becomes easy to apply the knowledge of the information theory regarding the amount of information.

【０２１３】第１２の発明によれば、さらに、出現集中
度を反映した類似度を求める装置が実現できる。出現集
中度を利用することで、検索により有効な文字列をより
多く重みを与えることができ、検索精度を向上すること
ができる。 According to the twelfth aspect, the appearance and concentration
A device that obtains a similarity reflecting the degree can be realized. Appearance collection
By using medium, more effective character strings can be searched
Can give more weight and improve search accuracy
Can be.

【０２１４】第１３の発明によれば、二つの文字列それ
ぞれにおける順序に適合し、かつ、同義語辞書の要素で
ある部分文字列に着目して類似度を求める装置が実現で
きる。従って、同義語が使われている場合であっても、
この装置によって類似度を適切に求めることができる。According to the thirteenth aspect , it is possible to realize an apparatus which conforms to the order of two character strings and obtains a similarity by focusing on a partial character string which is an element of a synonym dictionary. Therefore, even if synonyms are used,
With this device, the similarity can be determined appropriately.

【０２１５】第１４の発明によれば、さらに、情報量を
反映した類似度を求める装置が実現できる。従って、情
報量に関する情報理論の知見を応用しやすくなる。[0215] According to the fourteenth aspect, an apparatus for obtaining a similarity reflecting the amount of information can be realized. Therefore, it becomes easy to apply the knowledge of the information theory regarding the amount of information.

【０２１６】第１５の発明によれば、さらに、文書デー
タベースの中から、検索文章と類似度の高い文書を、検
索する装置が実現できる。According to the fifteenth aspect , it is possible to realize an apparatus for searching a document database for a document having a high similarity to a search sentence.

【０２１７】第１６の発明によれば、二つの文字列それ
ぞれにおける順序に適合し、かつ、共通する部分文字列
に着目して類似度を求めるプログラムが実現できる。す
なわち、文字列の出現順序を考慮した類似度が、このプ
ログラムによって求められる。According to the sixteenth aspect , it is possible to realize a program that matches the order of two character strings and obtains a similarity by focusing on a common partial character string. That is, the similarity in consideration of the appearance order of the character strings is obtained by this program.

【０２１８】第１７の発明によれば、さらに、情報量を
反映した類似度を求めるプログラムが実現できる。従っ
て、情報量に関する情報理論の知見を応用しやすくな
る。According to the seventeenth aspect , it is possible to realize a program for obtaining a similarity reflecting the amount of information. Therefore, it becomes easy to apply the knowledge of the information theory regarding the amount of information.

【０２１９】第１８の発明によれば、さらに、出現集中
度を反映した類似度を求めるプログラムが実現できる。
出現集中度を利用することで、検索により有効な文字列
をより多く重みを与えることができ、検索精度を向上す
ることができる。 According to the eighteenth aspect, the appearance and concentration
A program that obtains a similarity reflecting the degree can be realized.
Character strings that are more effective in searching by using the appearance concentration
Can give more weight and improve search accuracy
Can be

【０２２０】第１９の発明によれば、さらに、サフィッ
クスファイルを使って高速に処理するプログラムが実現
できる。According to the nineteenth aspect , a program that performs high-speed processing using a suffix file can be realized.

【０２２１】第２０の発明によれば、二つの文字列それ
ぞれにおける順序に適合し、かつ、同義語辞書の要素で
ある部分文字列に着目して類似度を求めるプログラムが
実現できる。従って、同義語が使われている場合であっ
ても、このプログラムによって類似度を適切に求めるこ
とができる。According to the twentieth aspect , it is possible to realize a program that conforms to the order of two character strings and obtains a similarity by focusing on a partial character string that is an element of a synonym dictionary. Therefore, even when synonyms are used, the similarity can be appropriately obtained by this program.

【０２２２】第２１の発明によれば、さらに、情報量を
反映した類似度を求めるプログラムが実現できる。従っ
て、情報量に関する情報理論の知見を応用しやすくな
る。According to the twenty-first aspect , a program for obtaining a similarity reflecting an information amount can be realized. Therefore, it becomes easy to apply the knowledge of the information theory regarding the amount of information.

【０２２３】第２２の発明によれば、さらに、サフィッ
クスファイルを使って高速に処理するプログラムが実現
できる。According to the twenty-second aspect , a program that performs high-speed processing using a suffix file can be realized.

【０２２４】第２３の発明によれば、さらに、ダイナミ
ックプログラミング手法によって高速に処理するプログ
ラムが実現できる。According to the twenty-third aspect, a program which can be processed at a high speed by the dynamic programming technique can be realized.

【０２２５】第２４の発明によれば、さらに、文書デー
タベースの中から、検索文章と類似度の高い文書を、検
索する装置が実現できる。According to the twenty-fourth aspect , an apparatus for searching a document database for a document having a high similarity to a search sentence can be realized.

【０２２６】次に、本発明による検索性能を説明する。
従来技術と本発明の実施例で最も近いもの同士を対比す
るために、文字重みＤＰ類似度SIM₂と、文字列重みＤＰ
類似度SIM₃の検索性能を図１９に示す。重みづけを統一
して比較するために、いずれの類似度に対しても、重み
を与えるScore関数は図８に示すフローで求まるものを
適用した。言うまでもなく、文字重みＤＰ類似度SIM₂を
求めるときに、図８に示すＳ５１の文字列aに与えるの
は、長さ１の文字列（文字）である。Next, search performance according to the present invention will be described.
In order to compare the closest ones between the prior art and the embodiment of the present invention, a character weight DP similarity SIM ₂ and a character string weight DP
The search performance of the similarity SIM ₃ is shown in FIG . In order to unify the weights and compare them, for any similarity, a Score function obtained by the flow shown in FIG. Needless to say, when the character weight DP similarity SIM ₂ is obtained, what is given to the character string a in S51 shown in FIG. 8 is a character string (character) of length 1.

【０２２７】検索文章を種々に変えると、文書データベ
ースの中で関連ある文書（正しく検索されるべき文書）
の数も変化する。これを横軸に示す。縦軸に、正当率
（precision）と再現率（recall）を示す。正当率は、
検索された文書の中で、関連ある文書の数の割合であ
る。もし、関連ない文書を混同することなく関連ある文
書のみ検索ができれば、正当率は1となる。再現率は、
文書データベースの中の全ての関連ある文書の中で、検
索された関連ある文書の数の割合である。もし、関連あ
る文書を文書データベースから漏れなく検索できれば、
再現率は1となる。When the search text is variously changed, related documents (documents to be correctly searched) in the document database
Also vary. This is shown on the horizontal axis. The vertical axis shows the validity rate (precision) and the recall rate (recall). The legitimate rate is
This is the ratio of the number of relevant documents in the retrieved documents. If only relevant documents can be searched without confusing unrelated documents, the validity rate is 1. The recall is
It is the ratio of the number of relevant documents retrieved among all relevant documents in the document database. If relevant documents can be retrieved from the document database without omission,
The recall is 1.

【０２２８】従来技術による類似度SIM₂の正当率２０１
に比べて、本発明による類似度SIM₃の正当率２０２は大
幅に向上している。同様に、従来技術による類似度SIM₂
の再現率２０３に比べて、本発明によるSIM₃の再現率２
０２も大幅に向上している。例えば、関連ある文書の数
が400の場合は、正当率、再現率とも約４倍の検索性能
が得られている。このように、本発明によって、飛躍的
に検索性能を向上させることができる。The legitimacy rate 201 of the similarity SIM ₂ according to the prior art
As compared with the above, the validity rate 202 of the similarity SIM ₃ according to the present invention is greatly improved. Similarly, the similarity SIM ₂ according to the prior art
Of the SIM ₃ according to the present invention as compared with the recall 203 of
02 has also improved significantly. For example, when the number of relevant documents is 400, the search performance is about four times as high in both the correct rate and the recall rate. As described above, according to the present invention, search performance can be dramatically improved.

[Brief description of the drawings]

【図１】本発明による文書検索装置の実施例を示す図で
ある。FIG. 1 is a diagram showing an embodiment of a document search device according to the present invention.

【図２】本発明による文書検索装置の別の実施例を示す
図である。FIG. 2 is a diagram showing another embodiment of the document search device according to the present invention.

【図３】本発明による文書検索装置の別の実施例を示す
図である。FIG. 3 is a diagram showing another embodiment of the document search device according to the present invention.

【図４】本発明による文書検索装置の別の実施例を示す
図である。FIG. 4 is a diagram showing another embodiment of the document search device according to the present invention.

【図５】本発明により文章検索を行う計算機システムの
実施例を示す図である。FIG. 5 is a diagram showing an embodiment of a computer system for performing a sentence search according to the present invention.

【図６】本発明の実施例の文章検索を行う処理のフロー
チャートである。FIG. 6 is a flowchart of a process for performing a sentence search according to the embodiment of this invention.

【図７】本発明の実施例の類似度を求める処理のフロー
チャートである。FIG. 7 is a flowchart of a process for obtaining a similarity according to the embodiment of the present invention.

【図８】文字列スコア又は同義語スコアを求める処理の
フローチャートである。FIG. 8 is a flowchart of a process for obtaining a character string score or a synonym score.

【図９】文書の数を求める処理のフローチャートであ
る。FIG. 9 is a flowchart of a process for obtaining the number of documents.

【図１０】出現する回数を求める処理のフローチャート
である。FIG. 10 is a flowchart of a process for obtaining the number of appearances.

【図１１】スコア表を示す図である。 FIG. 11 is a diagram showing a score table.

【図１２】文字列スコアを求める処理のフローチャートFIG. 12 is a flowchart of a process for obtaining a character string score.
である。It is.

【図１３】文書の数を求める処理のフローチャートであFIG. 13 is a flowchart of a process for obtaining the number of documents.
る。You.

【図１４】本発明の別の実施例の類似度を求める処理の
フローチャートである。 FIG. 14 is a flowchart of a process for obtaining a similarity according to another embodiment of the present invention.

【図１５】本発明の別の実施例の文章検索を行う処理の
フローチャートである。 FIG. 15 is a flowchart of a process for performing a sentence search according to another embodiment of the present invention.

【図１６】同義語辞書を作成する処理のフローチャート
である。 FIG. 16 is a flowchart of a process for creating a synonym dictionary.

【図１７】本発明に別の実施例の類似度を求める処理の
前半フローチャートである。 FIG. 17 is a first half flowchart of a process for obtaining a similarity according to another embodiment of the present invention;

【図１８】図１７に続く後半フローチャートである。 FIG. 18 is a second half flowchart following FIG. 17 ;

【図１９】従来技術による検索性能と本発明による検索
性能を示す図である。 FIG. 19 is a diagram showing search performance according to the related art and search performance according to the present invention.

[Explanation of symbols]

１０：文書データベース１０ａ、１０ｂ、１０ｃ：文書１１：文字列入力部１２：検索制御部１３：検索結果出力部１４、１６、１８、４３、４６、５１、５２、５３、５
５、５６、５７、７３、８１：類似度算出部１５、１７、１９：再帰実行制御部２０：同義語辞書２１、２４：一致文字列類似度算出部２２、２５、２９：任意文字列類似度算出部２３、２６、３０、３３、５４、５８、６３、８４：最
大値選択部２８：同義語類似度算出部３１：文字列分離制御部３２：文字列分離類似度算出部３４：一致文字列判定部４２、４５：文字列スコア算出部４１：文字列分離部４４、４７、７４：加算部６１：同義語分離制御部６２：同義語分離類似度算出部７１：同義語分離部７２：同義語スコア算出部９１：対訳辞書１０１：ディスプレイ１０２：プリンタ１０３：キーボード１０４：フロッピーディスク装置１０５：ＣＤ−ＲＯＭ装置１０６：読み出し専用メモリ（ＲＯＭ）１０７：ランダムアクセスメモリ（ＲＡＭ）１０８：磁気ディスク装置１０９：中央処理装置（ＣＰＵ）１１０：通信インターフェイス１１１：バス１１２：フロッピーディスク１１３：ＣＤ−ＲＯＭ１１４：通信ネットワーク２０１：SIM₂による正当率２０２：SIM₃による正当率２０３：SIM₂による再現率２０４：SIM₃による再現率10: Document database 10a, 10b, 10c: Document 11: Character string input unit 12: Search control unit 13: Search result output unit 14, 16, 18, 43, 46, 51, 52, 53, 5
5, 56, 57, 73, 81: Similarity calculating unit 15, 17, 19: Recursive execution control unit 20: Synonym dictionary 21, 24: Matching character string similarity calculating unit 22, 25, 29: Arbitrary character string similarity Degree calculation units 23, 26, 30, 33, 54, 58, 63, 84: maximum value selection unit 28: synonym similarity calculation unit 31: character string separation control unit 32: character string separation similarity calculation unit 34: match Character string determination units 42, 45: Character string score calculation unit 41: Character string separation units 44, 47, 74: Addition unit 61: Synonym separation control unit 62: Synonym separation similarity calculation unit 71: Synonym separation unit 72 : Synonym score calculation unit 91: Bilingual dictionary 101: Display 102: Printer 103: Keyboard 104: Floppy disk device 105: CD-ROM device 106: Read-only memory (ROM) 107: La Dam access memory (RAM) 108: magnetic disk device 109: a central processing unit (CPU) 110: Communication interface 111: bus 112: Floppy disk 113: CD-ROM 114: Communication Network 201: justification rate by SIM ₂ 202: SIM ₃ 203: Recall by SIM ₂ 204: Recall by SIM ₃

Claims

[Claims]

1. A method for calculating a degree of similarity between two character strings, comprising: a plurality of partial character strings conforming to an order in each of the two character strings, the partial character strings being common to the two character strings. A character string similarity calculation method comprising: determining a weight for each of the plurality of obtained partial character strings; and calculating a similarity by summing the weights.

2. The weight of the partial character string may be heavier than the sum of the weights of the partial character strings obtained by dividing the partial character string into two or more. The character string similarity calculation method described in 1.

3. The method according to claim 1, wherein one of the two character strings is selected from a document database, and the weight corresponds to an information amount of a partial character string in the document database. The character string similarity calculation method described in 1.

4. The method according to claim 1, wherein the weight is stored in the document database.
Of substring information and the concentration of occurrence of substrings
4. The character string according to claim 3, wherein the character string corresponds to a degree.
Similarity calculation method.

5. As the degree of similarity is the highest, the string similarity calculation method according to any one of claims 1 to 4, characterized in that for obtaining a plurality without dividing the partial string.

6. A method for calculating the similarity between two character strings, comprising: a plurality of partial character strings that match the order of each of the two character strings and are included in a synonym dictionary element. A character string similarity calculating method, wherein a similarity is calculated by determining weights for elements of the synonym dictionary corresponding to the plurality of obtained partial character strings, and summing the weights.

7. A are those where one of the two strings is selected from the document database, claim the weight, characterized in that corresponding to the amount of information elements synonym dictionary in the document database 6 The character string similarity calculation method described in 1.

Wherein said two strings are represented in different languages, the elements of the synonym dictionary, a character string according to claim 6 or 7, characterized in that it comprises a synonym of said different languages Similarity calculation method.

9. As the degree of similarity is the highest, the string similarity according to any one of claims 1 to 4, 6 8, characterized in that for obtaining a plurality to permit divided substrings Calculation method.

10. A character string similarity calculating apparatus for calculating a similarity between two character strings, comprising: a character string score calculating unit for determining a weight of a partial character string common to the two character strings; A matching character string similarity calculating unit that calculates the similarity by adding the similarities of the partial character strings; and the highest similarity among the similarities of the character strings reduced by one character at a time in one or both of the two character strings. A character string similarity calculating apparatus, comprising: an arbitrary character string similarity calculating unit that calculates the similarity; and a selecting unit that selects the highest similarity among the obtained similarities.

11. are those where one of the two strings is selected from the document database, according to claim 10, wherein the weights, characterized in that corresponding to the information amount of the partial strings in the document database Character string similarity calculation device.

12. The method according to claim 11, wherein the weight is stored in the document database.
Of information on substrings in substrings and appearances of substrings
The statement according to claim 11, characterized in that it corresponds to medium.
Character string similarity calculation device.

13. A character string similarity calculating apparatus for calculating a similarity between two character strings, wherein the character string similarity calculating apparatus calculates a partial character string corresponding to a partial character string included in a synonym dictionary element. A synonym score calculating unit that determines the weight of the elements of the synonym dictionary; a matching character string similarity calculating unit that calculates the similarity by adding the similarity of the remaining partial character strings to the weight; An arbitrary character string similarity calculating unit that obtains the highest similarity among the similarities of the character strings reduced by one character in one or both of them; a selecting unit that selects the highest similarity among the obtained similarities; A character string similarity calculation device characterized by having:

14. are those where one of the two strings is selected from the document database, claim the weight, characterized in that corresponding to the amount of information elements synonym dictionary in the document database 13
The character string similarity calculating device according to item 1.

15. A sentence search apparatus for selecting a document similar to a search sentence from a document database, wherein the search sentence and a document in the document database are selected.
A character string similarity calculating apparatus according to claim 10 , wherein similarities are obtained as two character strings, and a document having a high calculated similarity is selected from a document database. Search device.

16. A character string similarity calculation program for calculating a similarity between two character strings, comprising: a comparison setting step of sequentially setting a part to be compared with the two character strings; and a part starting from the part to be compared. A character string, a common partial character string specifying step of specifying a partial character string common to the two character strings; a character string score setting step of determining a weight of the specified partial character string; A computer-readable recording medium that records a character string similarity calculation program for causing a computer to execute a character string score adding step of adding a character string score to a computer, and a progress processing step of performing these steps to increase similarity.

17. are those where one of the two strings is selected from the document database, according to claim 16, wherein the weights, characterized in that corresponding to the information amount of the partial strings in the document database A computer-readable recording medium on which a character string similarity calculation program is recorded.

18. The method according to claim 18, wherein the weight is stored in the document database.
Of information on substrings in substrings and appearances of substrings
The sentence according to claim 17, characterized in that it corresponds to medium.
Computer reading recorded character string similarity calculation program
A removable recording medium.

19. The character string similarity calculation program according to claim 17 , wherein the character string score setting step obtains the information amount by using a suffix file. A computer-readable recording medium on which a program is recorded.

20. A character string similarity calculation program for calculating a similarity between two character strings, comprising: a comparison setting step of sequentially setting a part to be compared with the two character strings; and a part starting from the part to be compared. A synonym identification step of identifying a character string and a substring included in a synonym dictionary element; and a synonym score setting step of determining a weight of an element of the synonym dictionary corresponding to the identified substring. A character string similarity calculation program for causing a computer to execute a synonym score adding step of adding the weight to the similarity, and a progress processing step of advancing these steps to increase the similarity. Computer readable recording medium.

21. are those where one of the two strings is selected from the document database, claim the weight, characterized in that corresponding to the amount of information elements synonym dictionary in the document database 20
A computer-readable recording medium recording the character string similarity calculation program described in 1. above.

22. The character string similarity calculation program according to claim 21 , wherein the synonym score setting step obtains the information amount using a suffix file. A computer-readable recording medium that has been recorded.

23. A computer-readable program storing a character string similarity calculation program according to claim 16 , wherein the progress processing step is performed by a dynamic programming method. Recording medium.

24. A sentence search program for selecting a document similar to a search sentence from a document database, wherein the search sentence and a document in the sentence database are
A similarity is calculated by the character string similarity calculation program according to any one of claims 16 to 23 as two character strings, and the computer is caused to select a document having a high calculated similarity from a document database. Computer-readable recording medium on which a document search program for recording is stored.