JP2002297660A

JP2002297660A - Method, device, program, and recording medium for character similarity calculation

Info

Publication number: JP2002297660A
Application number: JP2002012259A
Authority: JP
Inventors: Kyoji Umemura; 恭司梅村
Original assignee: INFORMATION TECHNOLOGY PROMOTI; INFORMATION-TECHNOLOGY PROMOTION AGENCY JAPAN; Sumitomo Electric Industries Ltd
Current assignee: INFORMATION TECHNOLOGY PROMOTI; INFORMATION-TECHNOLOGY PROMOTION AGENCY JAPAN; Sumitomo Electric Industries Ltd
Priority date: 2001-01-24
Filing date: 2002-01-22
Publication date: 2002-10-11
Anticipated expiration: 2022-01-22
Also published as: JP4065695B2

Abstract

PROBLEM TO BE SOLVED: To speed up document retrieval by selecting a partial character string used for similarity calculation. SOLUTION: An input character string X and a document Y in a document database are regarded as two character strings and their similarity is calculated. Partial character strings cut out of the input character string are sorted according to their appearance frequencies and recorded in a partial character string management table. Then matching information is gathered as to the respective partial character strings in the partial character string management table and recorded in a matching information management table. A list regarding the document Y is taken out of the table and the similarity to the input character string X is calculated. The document number and the similarity are recorded in a pair in a document management table. Those processes are repeated for all documents. Lastly, the document management table is rearranged in the decreasing order of the similarity and a document having high similarity is selected as a retrieval result from the database.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、二つの文字列の類
似度判定に関するものであり、特に情報検索において、
入力された文字列とデータベースに登録された文書との
類似度判定に用いると好適である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to similarity determination between two character strings, and more particularly, to information retrieval.
It is suitable for use in determining the similarity between an input character string and a document registered in a database.

【０００２】[0002]

【従来の技術】文書のデータベースから、所望の文書を
取り出す情報検索がよく行なわれている。このような情
報検索において、文書は、複数の文字からなる単語を組
み合わせた文字列の集合として扱う。そして、検索文字
列と検索対象文書中の文字列同士を比較し、類似度の高
いものを一つあるいは複数選び出すことで情報検索を行
っている。この文字列同士の類似度は、大きく分けて、
形態素解析を用いる方法と、長さｎの部分文字列（以
下、ｎグラムと称する）の一致を求める方法の２通りが
ある。2. Description of the Related Art Information retrieval for extracting a desired document from a document database is often performed. In such information retrieval, a document is treated as a set of character strings obtained by combining words including a plurality of characters. Then, the information search is performed by comparing the search character string with the character strings in the search target document, and selecting one or a plurality of characters having a high degree of similarity. The similarity between these character strings can be roughly divided into
There are two methods, a method using morphological analysis and a method for finding a match between partial character strings of length n (hereinafter, referred to as n-grams).

【０００３】形態素解析を用いる方法は、例えば、Gera
rd Salton and Christopher Buckley, Term-Weighting
Approaches in Automatic Text Retrieval, Informatio
n Proceeding and Management, 24, pp.513-523, 1988.
に開示されている。この方法で二つの文字列同士の類似
度を求める基本的手順は以下のようになる。まず、両方
の文字列を、辞書と文法知識を用いた形態素解析により
単語の列に分解する。次に、両方の単語列を比較して、
一致する単語を求める。そして、一致する単語に対して
重みを設定する。その上で、この重みを、すべての一致
する単語に関して加算する。この加算の結果得られた総
和が、形態素解析による類似度となる。A method using morphological analysis is described, for example, in Gera
rd Salton and Christopher Buckley, Term-Weighting
Approaches in Automatic Text Retrieval, Informatio
n Proceeding and Management, 24, pp.513-523, 1988.
Is disclosed. The basic procedure for obtaining the similarity between two character strings by this method is as follows. First, both character strings are decomposed into word strings by morphological analysis using a dictionary and grammatical knowledge. Next, compare both word strings,
Find matching words. Then, weights are set for the matching words. The weight is then added for all matching words. The sum obtained as a result of this addition is the similarity based on the morphological analysis.

【０００４】形態素解析を用いる方法は、形態素解析自
体の精度が低いと情報検索が不調に終わるという本質的
な問題を有している。形態素解析の精度を上げるには、
単語辞書や文法規則などが大規模にならざるを得ず、簡
単に情報検索を利用することが難しくなる。さらに、流
行語、造語、限られた分野でのみ用いられる専門用語が
出現する文書では、単語辞書の整備の手間が大きな負担
となる。[0004] The method using morphological analysis has an essential problem that if the accuracy of the morphological analysis itself is low, the information retrieval ends abnormally. To improve the accuracy of morphological analysis,
Word dictionaries and grammar rules must be large, making it difficult to easily use information retrieval. Furthermore, for documents in which buzzwords, coined words, and technical terms used only in limited fields appear, the time and effort required to maintain a word dictionary is a large burden.

【０００５】次に、ｎグラムによる方法は、例えば、Ya
sushi Ogawa and Toru Matsuda, Overlapping statisti
cal word indexing: A new indexing method for Japan
esetext, In proceeding of SIGIR'97, Philadelphia P
A, USA, pp.226-234, 1997.に開示されている。この方
法で文字列同士の類似度を求める基本的手順は以下のよ
うになる。まず、両方の文字列に共通して含まれるｎ文
字の部分文字列を求める。次に、この共通する部分文字
列に対して、重みを設定する。そして、この重みを、す
べての一致する部分に関して加算する。この加算の結果
得られた総和が、ｎグラムによる類似度となる。[0005] Next, the method using the n-gram is, for example, Ya
sushi Ogawa and Toru Matsuda, Overlapping statisti
cal word indexing: A new indexing method for Japan
esetext, In proceeding of SIGIR'97, Philadelphia P
A, USA, pp. 226-234, 1997. The basic procedure for obtaining the similarity between character strings by this method is as follows. First, a partial character string of n characters included in both character strings is obtained. Next, a weight is set for the common partial character string. Then, this weight is added for all matching parts. The sum obtained as a result of this addition is the similarity based on n-grams.

【０００６】共通する部分文字列の重みの設定に関して
は、特定の文書に集中的に出現して、他の文書には出現
しない文字列には大きな値が与えられる。逆に多くの文
書に出現する文字列には小さな値しか与えられない。こ
れは多くの文書に出現する文字列は文書を特徴づける要
素になっておらず、検索に際し、有効に利用できないこ
とを反映したものである。Regarding the setting of the weight of a common partial character string, a large value is given to a character string that appears intensively in a specific document and does not appear in other documents. Conversely, strings that appear in many documents are given only small values. This reflects that character strings appearing in many documents are not elements that characterize documents, and cannot be used effectively in searching.

【０００７】ｎグラムによる方法は、形態素解析を要し
ないため、新しい技術用語などの未知語にも対応するこ
とができ、簡単に利用できる。The method based on n-grams does not require morphological analysis, and therefore can deal with unknown words such as new technical terms and can be used easily.

【０００８】ｎグラムによる方法の中でも、特に、長さ
２の文字列（以下、bigramと称する）による類似度の算
出においては、切り出されるすべてのbigramを対象にす
るのではなく、ひらがなを含まないbigramに限定して、
文字列同士に関する一致情報を求め、類似度を算出する
方法がある。これは、ひらがなを含むbigramは、文書デ
ータベース内の多くの文書で出現する可能性が高く、各
文書を特徴づける文字列となる確率が極めて小さいこと
を考慮したものである。ひらがなを含むbigramを文字列
の比較の対象に含めると計算量が大きくなるばかりで、
検索の精度は大きな向上が期待できないと認識されてい
る。[0008] Among the methods using n-grams, particularly, in calculating the similarity using a character string of length 2 (hereinafter, referred to as bigram), not all the extracted bigrams are used, but no hiragana is included. limited to bigram,
There is a method of calculating coincidence between character strings and calculating similarity. This is because bigrams including Hiragana are highly likely to appear in many documents in the document database, and the probability of becoming a character string characterizing each document is extremely small. If you include bigrams including hiragana in the comparison of character strings, the amount of calculation will only increase,
It is recognized that search accuracy cannot be expected to significantly improve.

【０００９】[0009]

【発明が解決しようとする課題】上記のように、ｎグラ
ムによる方法は形態素解析による方法と比較して有利な
面が多く、情報検索の分野で利用される場面が多い。ｎ
グラム法における問題は、文書データベースが大きくな
るに従って計算量が増し、検索結果を得るまでの時間が
かかることである。As described above, the method based on n-grams has many advantages in comparison with the method based on morphological analysis, and is often used in the field of information retrieval. n
The problem with the Gram method is that the amount of calculation increases as the size of the document database increases, and it takes time to obtain search results.

【００１０】ところで、文字列から切り出される部分文
字列の中には、必ずしも類似度の算出において有効でな
い部分文字列、すなわち文書データベース内の多くの文
書に含まれるため重みが小さく、類似度に与える影響の
小さい部分文字列が多く含まれていると考えられてい
る。そのため、文字列から切り出されるすべてのｎグラ
ムについて一致するかどうかを調べる方法は、計算時間
の点からみて効率がよくない。By the way, the partial character strings cut out from the character string are not necessarily effective in calculating the similarity, that is, they are included in many documents in the document database, so that the weight is small and given to the similarity. It is thought that many substrings with small effects are included. Therefore, the method of checking whether all n-grams extracted from a character string match each other is not efficient in terms of calculation time.

【００１１】また、切り出されたbigramに対し、ひらが
なを含む文字列を一致情報として扱わないという方法
は、計算時間の点については効率化されているが、本来
は検索に有効な文字列までも一律に切り捨てられてしま
う。その結果、検索精度が低下してしまうという問題が
あった。The method of not treating a character string including hiragana as matching information in the extracted bigram is efficient in terms of calculation time, but even a character string that is originally effective for retrieval is used. It will be truncated uniformly. As a result, there is a problem that search accuracy is reduced.

【００１２】本発明は上記したｎグラム法における問題
を解決し、計算時間短縮と検索精度向上を両立する方法
を提供するものである。The present invention solves the above-mentioned problem in the n-gram method, and provides a method that achieves both reduction in calculation time and improvement in search accuracy.

【００１３】[0013]

【課題を解決するための手段】請求項１に記載の発明
は、二つの文字列の類似度を算出する方法において、第
１の文字列から切り出した部分文字列のうち、類似度算
出に対する効果に基づいて選別した部分文字列につい
て、第２の文字列との一致情報を収集し、前記一致情報
から一致した部分文字列の重みを算出し、前記重みに基
づいて類似度を算出することを特徴とする。このように
文字列同士の一致情報を求める際に、切り出されるすべ
てのｎグラムを扱うのではなく、類似度の算出に対する
効果を推定することで、類似度の算出に有効なｎグラム
を選別している。According to a first aspect of the present invention, there is provided a method for calculating a similarity between two character strings, wherein the effect on the similarity calculation among partial character strings cut out from the first character string is provided. Collecting the matching information with the second character string for the partial character string selected on the basis of, calculating the weight of the matching partial character string from the matching information, and calculating the similarity based on the weight. Features. As described above, when obtaining matching information between character strings, n-grams that are effective in calculating the similarity are selected by estimating the effect on the calculation of the similarity instead of treating all the n-grams cut out. ing.

【００１４】かかるように構成されているので、選別し
ない方法による検索精度とほぼ同等の検索精度を保ちつ
つ、計算時間の短縮を図ることが可能となる。With such a configuration, it is possible to reduce the calculation time while maintaining the search accuracy almost equal to the search accuracy by the method of not selecting.

【００１５】請求項２に記載の発明は、二つの文字列の
類似度を算出する方法において、第１の文字列から切り
出した部分文字列のうち、類似度算出に対する効果に基
づいて選別した部分文字列について、第２の文字列との
一致情報を収集し、前記一致情報に含まれる部分文字列
の中から、第１および第２の文字列に出現する順序が適
合する部分文字列の重みに基づいて類似度を算出するこ
とを特徴とする。このように、文字列同士の類似度の算
出において、一致情報に記録された部分文字列の中か
ら、さらにそれぞれの文字列に出現する順序が適合する
部分文字列を選び、その重みを類似度の算出に用いてい
る。According to a second aspect of the present invention, in the method for calculating the similarity between two character strings, a part selected based on an effect on similarity calculation among partial character strings cut out from the first character string. For character strings, matching information with the second character string is collected, and among partial character strings included in the matching information, weights of the partial character strings in which the order of appearance in the first and second character strings matches The similarity is calculated based on As described above, in calculating the similarity between character strings, a partial character string whose order of appearance in each character string matches is further selected from the partial character strings recorded in the matching information, and the weight is determined by the similarity degree. Is used to calculate

【００１６】かかるように構成されているので、重みを
算出する部分文字列が限定され、検索精度を保ったま
ま、一層の計算の高速化が達成できる。With such a configuration, the partial character string for calculating the weight is limited, and the calculation can be further speeded up while maintaining the search accuracy.

【００１７】請求項３に記載の発明は、請求項１または
２に記載の発明において、部分文字列の選別を、第１の
文字列から切り出した部分文字列が第２の文字列に出現
する回数を加味して行うことを特徴とする。According to a third aspect of the present invention, in the first or second aspect of the present invention, the selection of the partial character string is performed such that the partial character string cut out from the first character string appears in the second character string. It is characterized in that it is performed taking the number of times into account.

【００１８】かかるように構成されているので、類似度
の算出に効果のある部分文字列が効率的に選別でき、検
索精度を保ちつつ、一層の計算時間の短縮を図ることが
可能となる。With such a configuration, partial character strings that are effective in calculating the similarity can be efficiently selected, and the calculation time can be further reduced while maintaining the search accuracy.

【００１９】請求項４に記載の発明は、請求項１から３
のいずれかに記載の発明において、文字列同士の一致情
報として、部分文字列の長さ、第１の文字列における部
分文字列の出現場所、第２の文字列における部分文字列
の出現場所、第２の文字列内で何番目の一致かを表すシ
ーケンス番号、を含むことを特徴とする。The invention according to claim 4 is the invention according to claims 1 to 3.
In the invention according to any one of the above, as the matching information between the character strings, the length of the partial character string, the location of the partial character string in the first character string, the location of the partial character string in the second character string, And a sequence number indicating the number of matches in the second character string.

【００２０】かかるように構成されているので、一致し
たｎグラムの重みを加算して類似度を算出する方法だけ
でなく、多くの類似度算出方法を併用することが可能と
なる。With such a configuration, not only a method of calculating the similarity by adding the weights of the matched n-grams but also many similarity calculating methods can be used together.

【００２１】請求項５に記載の発明は、二つの文字列の
類似度を算出する文字列類似度算出装置において、第１
の文字列から切り出した部分文字列のうち、類似度算出
に対する効果に基づいて選別する部分文字列選別部と、
第２の文字列との一致情報を収集する一致情報収集部
と、前記一致情報に基づき一致した部分文字列の重みを
算出し、前記重みを総和することで類似度を算出する類
似度算出部と、を有することを特徴とする文字列類似度
算出装置である。According to a fifth aspect of the present invention, in the character string similarity calculating device for calculating the similarity between two character strings,
A partial character string selecting unit that selects based on an effect on similarity calculation among partial character strings cut out from the character string of
A matching information collecting unit that collects matching information with the second character string; and a similarity calculating unit that calculates a weight of the partial character string that matches based on the matching information, and calculates a similarity by summing the weights. And a character string similarity calculating device.

【００２２】かかるように構成されているので、選別し
ない方法による検索精度とほぼ同等の検索精度を保ちつ
つ、計算時間の短縮を図る装置を実現できる。With such a configuration, it is possible to realize a device that shortens the calculation time while maintaining the search accuracy substantially equal to the search accuracy by the method of not selecting.

【００２３】請求項６に記載の発明は、二つの文字列の
類似度を算出する文字列類似度算出装置において、第１
の文字列から切り出した部分文字列について、類似度算
出に対する効果に基づいて選別する部分文字列選別部
と、前記選別部分文字列について、第２の文字列との一
致情報を収集する一致情報収集部と、前記一致情報に含
まれる部分文字列の中から、それぞれの文字列に出現す
る順序が適合する部分文字列を選択する適合部分文字列
選択手段と、前記適合部分文字列に付けられた重みを総
和する類似度算出部と、を有することを特徴とする文字
列類似度算出装置である。According to a sixth aspect of the present invention, in the character string similarity calculating device for calculating the similarity between two character strings,
A partial character string selection unit that selects a partial character string extracted from the character string based on the effect on similarity calculation, and a match information collection unit that collects match information of the selected partial character string with a second character string Part, a matching partial character string selecting means for selecting a matching partial character string from among the partial character strings included in the matching information, the order of appearance in each character string, and a matching partial character string attached to the matching partial character string. A character string similarity calculation device, comprising: a similarity calculation unit that sums weights.

【００２４】かかるように構成されているので、重みを
算出する部分文字列が限定され、検索精度を保ったま
ま、一層の計算の高速化が達成できる装置を実現でき
る。With such a configuration, a partial character string for calculating the weight is limited, and an apparatus can be realized in which the calculation speed can be further increased while maintaining the search accuracy.

【００２５】請求項７に記載の発明は、二つの文字列の
類似度を算出する文字列類似度算出プログラムであっ
て、コンピュータを、第１の文字列から切り出した部分
文字列が第２の文字列に出現する回数に基づいて選別す
る選別手段と、前記選別部分文字列について、第２の文
字列との一致情報を収集する一致情報収集手段と、前記
一致情報に基づき一致した部分文字列の重みを算出し、
前記重みを総和することで類似度を算出する類似度算出
手段、として機能させることを特徴とする文字列類似度
算出プログラムである。According to a seventh aspect of the present invention, there is provided a character string similarity calculating program for calculating a similarity between two character strings, wherein the computer is configured to execute a partial character string cut out from the first character string. Selecting means for selecting based on the number of appearances in the character string; matching information collecting means for collecting matching information of the selected partial character string with a second character string; and a partial character string matching based on the matching information Calculate the weight of
A character string similarity calculation program that functions as similarity calculation means for calculating a similarity by summing the weights.

【００２６】かかるように構成されているので、コンピ
ュータを、選別しない方法による検索精度とほぼ同等の
検索精度を保ちつつ、計算時間の短縮を図る手段として
機能させることができる。With such a configuration, the computer can be made to function as means for shortening the calculation time while maintaining the search accuracy almost equal to the search accuracy by the method of not selecting.

【００２７】請求項８に記載の発明は、二つの文字列の
類似度を算出する文字列類似度算出プログラムであっ
て、コンピュータを、第１の文字列から切り出した部分
文字列について、類似度算出に対する効果に基づいて選
別する選別手段と、前記選別部分文字列について、第２
の文字列との一致情報として、部分文字列の長さ、第１
の文字列における部分文字列の出現場所、第２の文字列
における部分文字列の出現場所、第２の文字列内で何番
目の一致かを表すシーケンス番号、を収集する一致情報
収集手段と、前記一致情報に含まれる部分文字列の中か
ら、それぞれの文字列に出現する順序が適合する部分文
字列に付けられた重みに基づいて類似度を算出する類似
度算出手段、として機能させることを特徴とする文字列
類似度算出プログラムである。An eighth aspect of the present invention is a character string similarity calculation program for calculating the similarity between two character strings, wherein the computer is configured to execute the similarity degree calculation on the partial character string cut out from the first character string. Selecting means for selecting based on the effect on the calculation;
The length of the partial character string, the first
Matching information collecting means for collecting the occurrence position of the partial character string in the character string, the occurrence position of the partial character string in the second character string, and the sequence number indicating the number of the match in the second character string; Functioning as similarity calculating means for calculating a similarity based on a weight given to a partial character string whose order of appearance in each character string matches from among partial character strings included in the matching information. This is a character string similarity calculation program as a feature.

【００２８】かかるように構成されているので、コンピ
ュータを、重みを算出する部分文字列を限定し検索精度
を保ったまま、一層の計算の高速化を図る手段として機
能させることができる。With such a configuration, the computer can be made to function as a means for further increasing the calculation speed while maintaining the search accuracy by limiting the partial character string for calculating the weight.

【００２９】請求項９に記載の発明は、請求項７または
８に記載のプログラムを記録したコンピュータ読取可能
な記録媒体である。According to a ninth aspect of the present invention, there is provided a computer-readable recording medium storing the program according to the seventh or eighth aspect.

【００３０】かかるように構成されているので、記録さ
れた機能を必要な場所で実現することができる。With such a configuration, the recorded function can be realized at a necessary place.

【００３１】[0031]

【発明の実施の形態】本発明は、ｎグラムによる文字列
同士の類似度の算出方法およびそれを実現する装置、プ
ログラム、記録媒体に関する。入力された文字列とデー
タベースに登録された複数の文書との類似度を算出する
ことを想定しているが、それ以外の応用も可能である。
文字列同士の一致部分を求める際に、データベース中の
すべての文書それぞれに対して、入力文字列と文書に共
通するｎグラムを求めるという方法ではなく、入力文字
列からｎグラムを切り出し、それぞれのｎグラムを含む
文書をサフィックスファイルの利用によって効率的にデ
ータベース内から検索するという方法を用いる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention relates to a method for calculating similarity between character strings using n-grams, and an apparatus, a program, and a recording medium for realizing the method. Although it is assumed that the similarity between the input character string and a plurality of documents registered in the database is calculated, other applications are also possible.
When finding a matching part between character strings, instead of obtaining an n-gram common to the input character string and the document for each document in the database, an n-gram is cut out from the input character string and A method is used in which a document including an n-gram is efficiently searched from a database by using a suffix file.

【００３２】入力文字列から切り出されるｎグラムの中
には、類似度の算出に与える影響の少ないｎグラムが多
く含まれていると考えられる。入力文字列から切り出さ
れる部分文字列の数は、例えば、入力文字列の長さをｍ
とし、部分文字列を２文字であるbigramに限定するとｍ
−１となる。従って、部分文字列の選別を行わずに、切
り出されたbigramすべてに対し一致情報を求めると、入
力文字列長ｍが大きくなるほど、計算時間も大きくな
る。そこで、本発明では、切り出されたｎグラムの中か
ら、類似度の算出に対する効果を推定することでｎグラ
ムを選別し、定数個のｎグラムのみを一致情報の収集に
適用する。It is considered that the n-grams cut out from the input character string include many n-grams having little influence on the calculation of the similarity. The number of partial character strings cut out from the input character string is, for example, m
And if the partial character string is limited to bigram which is two characters, m
It becomes -1. Therefore, if matching information is obtained for all of the extracted bigrams without selecting the partial character strings, the calculation time increases as the input character string length m increases. Therefore, in the present invention, n-grams are selected by estimating the effect on the calculation of similarity from the extracted n-grams, and only a constant number of n-grams are applied to the collection of coincidence information.

【００３３】部分文字列の選別については、類似度に加
算される重みの大きな部分文字列を選別するのが検索精
度向上のために必要である。重みは、その部分文字列を
含むデータベース中の文書の数（以下、ｄｆと記す）に
よって決まるので、ｄｆの値を元に選別するのが自然で
ある。しかし、本発明の一つの局面では、ｄｆの代わり
に、ｎグラムがデータベース内の文書に出現する数（以
下、出現度数、もしくは、ｔｆと記す）に基づき部分文
字列の選別を行うことが推奨される。In selecting a partial character string, it is necessary to select a partial character string having a large weight added to the similarity in order to improve search accuracy. Since the weight is determined by the number of documents in the database including the partial character string (hereinafter referred to as df), it is natural to select based on the value of df. However, in one aspect of the present invention, it is recommended to select a partial character string based on the number of occurrences of an n-gram in a document in a database (hereinafter, referred to as occurrence frequency or tf) instead of df. Is done.

【００３４】ｔｆは直接的に算出でき、その値が増えて
も計算時間はほとんど変わらないのに対し、ｄｆを算出
するには、ｔｆを算出した後に文書内の重複を再計算し
なければならず、ｔｆの値が大きな場合は計算時間が大
きくなる。一方、ｔｆとｄｆには大きな相関があり、ｄ
ｆの代わりにｔｆを用いても、検索の精度に影響しない
ことが期待できる。Although tf can be calculated directly and the calculation time hardly changes even if its value increases, to calculate df, the duplication in the document must be recalculated after calculating tf. On the other hand, when the value of tf is large, the calculation time becomes long. On the other hand, there is a large correlation between tf and df, and df
Even if tf is used instead of f, it can be expected that search accuracy is not affected.

【００３５】一致情報の収集は選別された各ｎグラムに
対し、次のような方法で行う。まず、ｎグラムの重みを
計算する。次に、文書データベース全体からそのｎグラ
ムを含む文書を求め、その文書内におけるｎグラムの出
現場所、入力文字列におけるｎグラムの出現場所、ｎグ
ラムの長さ、文書内において何番目の一致かを表すシー
ケンス番号、ｎグラムの重みを一致情報として記録す
る。The collection of the matching information is performed on each of the selected n-grams in the following manner. First, the weight of the n-gram is calculated. Next, a document containing the n-gram is obtained from the entire document database, and the appearance location of the n-gram in the document, the appearance location of the n-gram in the input character string, the length of the n-gram, and the number of matches in the document And the weight of the n-gram is recorded as matching information.

【００３６】通常、得られた一致情報は、記録・管理す
ることなく、そのまま重みの加算がされ類似度が算出さ
れるが、本発明では、これを記録・管理することによ
り、一致したｎグラムの重みを加算して類似度を算出す
る方法だけでなく、高速性を保ったまま、多くの類似度
算出方法に適用することも可能にしている。Normally, the obtained coincidence information is directly added to the weight without calculating and managing the similarity, and the similarity is calculated. In the present invention, the matching n-gram is obtained by recording and managing the similarity. In addition to the method of calculating the similarity by adding the weights of the above, the present invention can be applied to many similarity calculation methods while maintaining high speed.

【００３７】入力文字列とデータベース内の文書との類
似度は、一致したｎグラムに付けられた重みを加算する
ことによって算出される。ただし、一致するｎグラムが
同じ文書内に２回以上出現する場合でも、その重みは１
度しか加算されない。つまり、一致するｎグラムが２回
現れるからといって、加算される重みが２倍になるわけ
ではない。そうではなく、ｎグラムが文書内に出現する
回数（以下、ｄｔｆ）に応じて、一致情報として記憶さ
れているｎグラムの重みは各文書によって異なり、ｄｔ
ｆの値が大きいほど、与えられる重みは大きくなってい
る。また、この重みは、各ｎグラムに対して、ｄｔｆ＝
０の場合にも定義されており、そのｎグラムを含まない
文書すべてに対して、０以下の重みが加算される。これ
は、ｎグラムを含まないということは類似していないと
いうことを示すものである、という考え方に基づくもの
で、加算される重みは、類似していない度合いを数値化
したものと言える。従って、類似度を示す値は負の値を
含む実数値を取り、その数直線上での値が大きいほど入
力文字列との類似度は高くなる。The similarity between the input character string and the document in the database is calculated by adding the weights given to the matched n-grams. However, even if the matching n-gram appears more than once in the same document, its weight is 1
Only degrees are added. That is, just because the matching n-gram appears twice does not mean that the added weight is doubled. Rather, the weight of the n-gram stored as the matching information differs for each document according to the number of times that the n-gram appears in the document (hereinafter, dtf).
The greater the value of f, the greater the weight given. This weight is given by dtf =
It is also defined as 0, and a weight of 0 or less is added to all documents that do not include the n-gram. This is based on the idea that not including an n-gram indicates that there is no similarity, and it can be said that the weight to be added is a numerical value of the degree of dissimilarity. Accordingly, the value indicating the similarity takes a real value including a negative value, and the greater the value on the number line, the higher the similarity with the input character string.

【００３８】類似度を算出する方法は、一致したｎグラ
ムに付けられた重みを加算するという前記の方法に限定
されるものではなく、これ以外の類似度算出方法を適用
することも可能である。その一例が、一致したｎグラム
の中からそれぞれの文字列に出現する順序が適合するｎ
グラムだけを選び、それらのｎグラムに付けられた重み
を総和した値を類似度とする方法である。The method of calculating the similarity is not limited to the above-described method of adding the weights assigned to the matched n-grams, and other similarity calculation methods can be applied. . One example is n that matches the order in which each character string appears in the matched n-grams.
In this method, only the gram is selected, and the value obtained by summing the weights assigned to the n-grams is used as the similarity.

【００３９】文字列の出現順序を考慮して類似度を算出
する方法は、これまでも提案されている。しかしそれら
は一致する全ての長さの部分文字列について重みを計算
するため、一致したｎグラムの重みを加算する方法に比
べて計算量が増え、扱う文字列の長さが大きくなるほど
計算時間が大きくなる。A method of calculating the similarity in consideration of the appearance order of character strings has been proposed. However, since they calculate the weight for the substrings of all matching lengths, the amount of calculation increases as compared with the method of adding the weights of the matched n-grams. growing.

【００４０】本発明では、選別されたｎグラムに重みを
与え、それらのｎグラムが一致した時にのみ類似度に重
みが加算される。従って、選別された部分文字列に対し
てのみ、一致および出現順序を考慮すればよいので、こ
れまでの方法に比べて高速に計算でき、扱う文字列の長
さが大きくなっても計算時間に与える影響は小さい。こ
の方法では、適合するｎグラムの組み合わせの中で最も
類似度が高くなる組み合わせを効率的に見つけるため、
動的計画法（Dynamic Programming、以下ＤＰと称す
る）を用いて類似度を計算する。以下では、ＤＰを使っ
て、ｎグラムＤＰ類似度を求める方法を説明する。In the present invention, weights are assigned to the selected n-grams, and the weight is added to the similarity only when the n-grams match. Therefore, it is only necessary to consider the match and the appearance order only for the selected partial character strings, so that the calculation can be performed at a higher speed as compared with the conventional method. The effect is small. In this method, to efficiently find the combination with the highest similarity among the combinations of matching n-grams,
The similarity is calculated using dynamic programming (hereinafter, referred to as DP). In the following, a method for obtaining n-gram DP similarity using DP will be described.

【００４１】α、β、γ、δを長さ０以上の文字列、
ξ、ζ、ηを長さ１以上の文字列、""を空文字列とす
る。複数の文字列（例えば、ξとγ）を繋げた文字列
（例えば、α）は、要素となる文字列の記号を続けて書
くことで表す。（例えば、α=ξγ）。Α, β, γ, δ are character strings of length 0 or more,
ξ, ζ, η are character strings of length 1 or more, and “” is an empty character string. A character string (for example, α) connecting a plurality of character strings (for example, ξ and γ) is represented by successively writing symbols of the character strings that are elements. (Eg, α = ξγ).

【００４２】ｎグラムＤＰ類似度SimDPは、引数の文字
列の部分に関する一致パターンに応じて、以下の式を再
帰的に当てはめることで求める。まず、両方とも空文字
列の時は、 SimDP("", "") = 0 （１）とする。それ以外のときは、 SimDP(α, β) = MAX( SimDPs(α, β), SimDPg(α, β) ) （２）とする。ここで、SimDPs(α, β)は、ξを、一致情報管
理テーブルのαとβに関するリストに記録されている文
字列とし、α=ξγ、β=ξδとすると、 SimDPs(α, β) = MAX( Score(ξ) + SimDP(γ, δ) ) （３）を、全てのξに関して計算することによって求められる。そのような文字列ξが存在しないときは、 SimDPs(α, β) = 0.0 （４）とする。Score(ξ)は、一致情報管理テーブルに記録さ
れたξの重みを返す関数である。The n-gram DP similarity SimDP is obtained by recursively applying the following equation according to the matching pattern for the character string portion of the argument. First, when both are empty character strings, SimDP ("", "") = 0 (1). Otherwise, SimDP (α, β) = MAX (SimDPs (α, β), SimDPg (α, β)) (2) Here, SimDPs (α, β) is expressed as follows: ξ is a character string recorded in a list regarding α and β in the matching information management table, and α = ξγ, β = ξδ, SimDPs (α, β) = MAX (Score (ξ) + SimDP (γ, δ)) (3) is obtained by calculating for all ξ. If there is no such character string ξ, SimDPs (α, β) = 0.0 (4). Score (ξ) is a function that returns the weight of ξ recorded in the matching information management table.

【００４３】また、ζとηをα＝ζγ、β＝ηδを満た
し、かつ、一致情報管理テーブルのαとβに関するリス
トに記録された文字列と共通部分を持たない最大の文字
列とすると、SimDPg(α, β)は、 SimDPg(α,β)=MAX(SimDP(α,δ), SimDP(γ,β), SimDP(γ,δ) ) （５）によって求められる。この式は、二つの文字列（ζγと
ηδ）から、ζかηの一方、もしくはζとηの両方を取
り除いた残りに相当する文字列同士（αとδ、γとβ、
γとδ）の類似度のうち、最も高い類似度を採用するこ
とを意味する。If すると and η are the largest character strings that satisfy α = ζγ and β = ηδ and do not have a common part with the character strings recorded in the lists related to α and β in the coincidence information management table, SimDPg (α, β) is obtained by SimDPg (α, β) = MAX (SimDP (α, δ), SimDP (γ, β), SimDP (γ, δ)) (5). This expression is equivalent to the character strings (α and δ, γ and β, γ and β, and γ) that are obtained by removing one of ζ and η or both ζ and η from the two character strings (ζγ and ηδ).
This means that the highest similarity among the similarities between γ and δ) is adopted.

【００４４】以上の式を再帰的に適用することで、一致
情報管理テーブルに記録された部分文字列の中から、二
つの文字列それぞれの順序に適合する部分文字列が求め
られ、かつ、類似度が最大となる。By applying the above expression recursively, a partial character string conforming to the order of each of the two character strings is obtained from the partial character strings recorded in the matching information management table, and the similarity is obtained. The degree is maximum.

【００４５】上記で示したｎグラムＤＰ類似度は、式
（３）のように、文字（長さ１の文字列）ではなく、ｎ
グラム（長さｎの文字列）に対する一致を考慮し、重み
を与えることによって、文字の連続性を加味した類似度
算出法になっている。また、全ての部分文字列について
文字列同士の一致を考慮するのではなく、一致情報管理
テーブルに記録されたｎグラムに限定することで、式
（５）のように、一致情報管理テーブルに記録された部
分文字列に関係のない部分については、一致の有無を判
定する必要がなく、取り除くことができる。これによっ
て、類似度の算出対象とする文字列の長さがどんなに大
きくても、一致情報管理テーブルに記録された部分文字
列に関係する部分だけを考慮すればよいので、全ての部
分文字列について文字列同士の一致を考慮する従来の方
法に比べ、高速に類似度を算出することができる。The n-gram DP similarity shown above is not a character (a character string of length 1) but n, as shown in equation (3).
A similarity calculation method that takes into account character continuity by giving weights in consideration of matching with a gram (character string of length n). In addition, instead of considering the match between the character strings for all the partial character strings, the character strings are limited to n-grams recorded in the match information management table, and are recorded in the match information management table as shown in Expression (5). The part irrelevant to the set partial character string can be removed without the need to determine the presence or absence of a match. Thus, no matter how long the length of the character string for which the similarity is to be calculated, only the part related to the partial character string recorded in the match information management table needs to be considered. The similarity can be calculated at a higher speed than in a conventional method that considers the matching between character strings.

【００４６】（第１実施例）まず、ｎグラムを用いて文
字列同士の類似度を算出する方法の実施例を示す。図１
は選別された部分文字列に基づき、入力された文字列と
最も類似度の高い文書を検索する文書検索装置の例であ
る。この文書検索装置は、文書データベース１０、文字
列入力部１１、部分文字列選別部１３、一致情報収集部
１４、類似度算出部１５、類似度算出制御部１６、及
び、検索結果出力部１２から構成されている。(First Embodiment) First, an embodiment of a method of calculating the similarity between character strings using n-grams will be described. FIG.
Is an example of a document search device that searches for a document having the highest similarity to the input character string based on the selected partial character string. This document search apparatus includes a document database 10, a character string input unit 11, a partial character string selection unit 13, a coincidence information collection unit 14, a similarity calculation unit 15, a similarity calculation control unit 16, and a search result output unit 12. It is configured.

【００４７】文書データベース１０には、検索対象とな
る複数の文書１０ａ、１０ｂ、…、１０ｃが登録されて
いる。検索のためには、キーワード、語、語句、文、文
章などを入力する（以下、代表して検索文章と呼ぶ）。
文字列入力部１１は、検索文章を文字列Ｘとして部分文
字列選別部１３に与える。A plurality of documents 10a, 10b,..., 10c to be searched are registered in the document database 10. For the search, a keyword, a word, a phrase, a sentence, a sentence, and the like are input (hereinafter, referred to as a search sentence).
The character string input unit 11 provides the search sentence to the partial character string selection unit 13 as a character string X.

【００４８】部分文字列選別部１３は、文字列入力部１
１から与えられた文字列Ｘから部分文字列を切り出し、
出現頻度ｔｆを算出した後、算出したｔｆの小さいもの
から定数個を取り出し、部分文字列管理テーブルＴ１に
登録する。部分文字列選別部１３は、部分文字列切り出
し制御部３１、部分文字列切り出し部３２、文字列出現
頻度算出部３３、部分文字列登録部３４から成る。部分
文字列切り出し制御部３１は、部分文字列切り出し部３
２がどの部分文字列を切り出すかを制御する。部分文字
列切り出し部３２は文字列Ｘより部分文字列ｘを切り出
す。文字列出現頻度算出部３３は部分文字列ｘの文書デ
ータベース１０内における出現頻度ｔｆを算出する。部
分文字列登録部３４は切り出された部分文字列をｔｆの
値の小さい順に定数個選び、部分文字列管理テーブルＴ
１に登録する。The partial character string selection section 13 is a character string input section 1
Cut out a partial character string from the character string X given from 1,
After calculating the appearance frequency tf, a constant number is extracted from the one with the smaller calculated tf and registered in the partial character string management table T1. The partial character string selection unit 13 includes a partial character string cutout control unit 31, a partial character string cutout unit 32, a character string appearance frequency calculation unit 33, and a partial character string registration unit 34. The partial character string cutout control unit 31
2 controls which partial character string is cut out. The partial character string extracting unit 32 extracts a partial character string x from the character string X. The character string appearance frequency calculation unit 33 calculates an appearance frequency tf of the partial character string x in the document database 10. The partial character string registration unit 34 selects a constant number of the extracted partial character strings in ascending order of the value of tf, and sets the partial character string management table T
Register to 1.

【００４９】一致情報収集部１４は部分文字列管理テー
ブルＴ１に登録された各部分文字列に対し、文字列に与
える重みを算出し、文書データベース内における各部分
文字列の出現場所を検出し、一致情報として、文書内に
おける出現場所、入力文字列における出現場所、部分文
字列の文字列の長さ、文書内において何番目の一致かを
表すシーケンス番号、文字列の重みを一致情報管理テー
ブルＴ２に記録する。一致情報収集部１４は、一致情報
収集制御部４１、文字列出現場所検索部４２、文字列重
み算出部４３、一致情報登録制御部４４、一致情報登録
部４５から成る。一致情報収集制御部４１は部分文字列
管理テーブルＴ１に登録された部分文字列ａを１つずつ
取り出し、文字列出現場所検索部４２と文字列重み算出
部４３に与える。文字列出現場所検索部４２は与えられ
た部分文字列ａの文書データベース内における出現場所
の全てについて、出現場所、部分文字列ａの長さ、文書
内において何番目の一致かを表すシーケンス番号を求め
る。文字列重み算出部４３は与えられた文字列ａに与え
る重みを計算する。一致情報登録制御部４４は部分文字
列ａの出現場所を１つずつ選び、文書内における出現場
所、入力文字列における出現場所、部分文字列ａの長
さ、文書内におけるシーケンス番号、部分文字列ａの重
みと組にして、一致情報登録部４５に与える。一致情報
登録部４５は受け取った各組を、一致情報管理テーブル
Ｔ２の該当する文書番号のリストに一致情報として登録
する。The matching information collecting unit 14 calculates the weight given to the character string for each partial character string registered in the partial character string management table T1, detects the appearance position of each partial character string in the document database, As the matching information, the occurrence location in the document, the occurrence location in the input character string, the length of the character string of the partial character string, the sequence number indicating the number of the match in the document, and the weight of the character string are described in the matching information management table T2. To record. The match information collection unit 14 includes a match information collection control unit 41, a character string appearance location search unit 42, a character string weight calculation unit 43, a match information registration control unit 44, and a match information registration unit 45. The matching information collection control unit 41 extracts the partial character strings a registered in the partial character string management table T1 one by one, and gives them to the character string appearance location search unit 42 and the character string weight calculation unit 43. The character string appearance location search unit 42 determines, for all occurrence locations of the given partial character string a in the document database, the appearance location, the length of the partial character string a, and a sequence number indicating the number of matches in the document. Ask. The character string weight calculator 43 calculates the weight given to the given character string a. The matching information registration control unit 44 selects the occurrence positions of the partial character string a one by one, and determines the occurrence position in the document, the occurrence position in the input character string, the length of the partial character string a, the sequence number in the document, the partial character string The combination with the weight of a is given to the coincidence information registration unit 45. The matching information registration unit 45 registers each received pair as matching information in a list of corresponding document numbers in the matching information management table T2.

【００５０】類似度算出制御部１６は一致情報管理テー
ブルＴ２から、ある１つの文書Ｙに関するリストを取り
出し、類似度算出部１５に与える。The similarity calculation control unit 16 extracts a list relating to a certain document Y from the coincidence information management table T2, and supplies the list to the similarity calculation unit 15.

【００５１】類似度算出部１５は、与えられた一致情報
のリストより、ＸとＹの類似度を算出する。類似度算出
部１５は、文字列重み加算制御部５１、文字列重み加算
部５２から成る。文字列重み加算制御部５１は一致情報
のリストより１つの一致情報を選び、その一致情報の持
つシーケンス番号が１であれば、その文字列の重みscor
eを文字列重み加算部５２に与える。文字列重み加算部
５２は与えられた重みをＸとＹの類似度Sim(X,Y)に加算
する。The similarity calculator 15 calculates the similarity between X and Y from the given list of matching information. The similarity calculation unit 15 includes a character string weight addition control unit 51 and a character string weight addition unit 52. The character string weight addition control unit 51 selects one piece of matching information from the list of matching information, and if the sequence number of the matching information is 1, the weight scor of the character string
e is given to the character string weight adding unit 52. The character string weight adding unit 52 adds the given weight to the similarity Sim (X, Y) between X and Y.

【００５２】検索結果出力部１２は類似度が最も高い文
書を選択し出力する。この時、類似度が一定値以上の文
書や、上位から一定の順位までの文書を合わせて出力し
ても良い。The search result output unit 12 selects and outputs a document having the highest similarity. At this time, documents having a similarity equal to or higher than a certain value or documents from a higher rank to a certain order may be output together.

【００５３】（第２実施例）本発明による文章検索をソ
フトウエアにより実施する実施例を以下に説明する。(Second Embodiment) An embodiment in which a text search according to the present invention is implemented by software will be described below.

【００５４】図２に、文章検索の実行に用いる計算機シ
ステムの一例を示す。この計算機システムは、ディスプ
レイ１０１、プリンタ１０２、キーボード１０３、フロ
ッピー（Ｒ）ディスク装置１０４、ＣＤ−ＲＯＭ（Comp
act Disk− Read Only Memory）装置１０５、読み出し
専用メモリ（Read Only Memory。以下、ＲＯＭ）１０
６、読み書き可能なランダムアクセスメモリ（Random A
ccess Memory。以下、ＲＡＭ）１０７、磁気ディスク装
置１０８、中央処理装置（Central Processing Unit。
以下、ＣＰＵ）１０９、通信インターフェイス１１０、
及び、これらを接続するバス１１１から構成されてい
る。フロッピー（Ｒ）ディスク装置１０４はフロッピー
（Ｒ）ディスク１１２の読み書きを行い、ＣＤ−ＲＯＭ
装置１０４はＣＤ−ＲＯＭ１１３の読み出しを行う。ま
た、通信インターフェイス１１０により、本計算機シス
テムは通信ネットワーク１１４に接続されている。FIG. 2 shows an example of a computer system used for executing a sentence search. The computer system includes a display 101, a printer 102, a keyboard 103, a floppy (R) disk device 104, a CD-ROM (Comp
act Disk-Read Only Memory (Device) 105, Read Only Memory (hereinafter referred to as ROM) 10
6. Read / write random access memory (Random A)
ccess Memory. Hereinafter, a RAM 107, a magnetic disk device 108, and a central processing unit (Central Processing Unit).
Hereinafter, CPU) 109, communication interface 110,
And a bus 111 for connecting them. The floppy (R) disk device 104 reads and writes a floppy (R) disk 112, and stores a CD-ROM.
The device 104 reads the CD-ROM 113. The computer system is connected to a communication network 114 via a communication interface 110.

【００５５】本発明を実施する文章検索プログラムは、
ＲＯＭ１０６に記憶しておく。あるいは、フロッピー
（Ｒ）ディスク１１２、ＣＤ−ＲＯＭ１１３、又は、磁
気ディスク装置１０８に文章検索プログラムを記憶して
おき、ＲＡＭ１０７に転送した後、ＣＰＵ１０９が実行
するのでも良い。ＣＰＵ１０９は、ＲＡＭ１０７を作業
領域に使って文章検索プログラムを実行する。必要に応
じて、磁気ディスク装置１０８を作業領域に使っても良
い。文章検索プログラムの実行の指示はキーボード１０
３から行い、実行結果は、ディスプレイ１０１、又は、
プリンタ１０２に出力する。文章検索プログラムの実行
を、フロッピー（Ｒ）ディスク１１２から指示すること
や、実行結果をフロッピー（Ｒ）ディスク１１２に書き
込んでも良いのは言うまでもない。A sentence search program for implementing the present invention is:
It is stored in the ROM 106. Alternatively, the text search program may be stored in the floppy (R) disk 112, the CD-ROM 113, or the magnetic disk device 108, transferred to the RAM 107, and then executed by the CPU 109. The CPU 109 executes the text search program using the RAM 107 as a work area. If necessary, the magnetic disk device 108 may be used as a work area. The instruction to execute the sentence search program is sent to the keyboard 10
3 and the execution result is the display 101 or
Output to the printer 102. It goes without saying that the execution of the text search program may be instructed from the floppy (R) disk 112 and the execution result may be written to the floppy (R) disk 112.

【００５６】文書データベースは、フロッピー（Ｒ）デ
ィスク１１２、ＣＤ−ＲＯＭ１１３、又は、磁気ディス
ク１０８に蓄えておく。高速なアクセスのためにＲＡＭ
１０７に転送しておくのでも良い。ＲＡＭ１０７に転送
する際に、容易に処理できる形式に変換するのも良い。
また、文章検索プログラム、文書データベース、又は、
実行の指示を、ネットワーク１１４経由で本計算機シス
テムに入力したり、実行の結果をネットワーク１１４経
由で本計算機システムから出力したりしても良いこと
は、もちろんである。The document database is stored on the floppy (R) disk 112, the CD-ROM 113, or the magnetic disk 108. RAM for fast access
It may be transferred to 107. When transferring the data to the RAM 107, the data may be converted into a format that can be easily processed.
Also, a text search program, a document database, or
Needless to say, an execution instruction may be input to the computer system via the network 114, or an execution result may be output from the computer system via the network 114.

【００５７】また、図に示されたものに限らず、各種の
記録媒体、入力手段、出力手段を用いて、本計算機シス
テムへの入力と出力を行うなど各種の実施態様への変形
が可能なことは言うまでもない。これらの、記録媒体、
入力手段、出力手段は本計算機システムが直接アクセス
するものの他、通信ネットワークを経由してアクセスす
るものであっても良いのはもちろんである。Further, the present invention is not limited to those shown in the drawings, and various recording media, input means, and output means can be used to input and output to the computer system, and can be modified into various embodiments. Needless to say. These recording media,
The input means and the output means may, of course, be those directly accessed by the computer system or those accessed via a communication network.

【００５８】図３から図７に示すのは、計算対象とする
ｎグラムを選別して算出する文字列類似度による文書検
索プログラムの処理フローである。FIGS. 3 to 7 show the processing flow of a document search program based on character string similarity calculated by selecting n-grams to be calculated.

【００５９】図３は、検索文章に基づいて文書データベ
ースを検索し、類似度の高い文書を選び出して出力する
処理フローを示す。FIG. 3 shows a processing flow for searching a document database based on a search sentence, selecting and outputting a document having a high degree of similarity.

【００６０】まず、ステップＳ１１（以下、Ｓ１１と略
記）で、ある文字列の出現回数を効率よく計算する準備
のために、文書データベースに含まれる全文書を統合し
てサフィックスファイル（Suffix File）を作成する。
サフィックスファイルの作成法と利用法は、M. Yamamot
o and K. W. Church, Using Suffix Arrays to Compute
Term Frequency and Document Frequency for All Sub
strings in a Corpus,In proceeding of 6^th Workshop
on Very Large Corpora, Ed. Eugene Charniak, Motrea
l, pp28-37, 1998に開示されている。First, in step S11 (hereinafter abbreviated as S11), in order to efficiently calculate the number of occurrences of a certain character string, all documents included in the document database are integrated and a suffix file (Suffix File) is created. create.
For information on how to create and use suffix files, see M. Yamamot
o and KW Church, Using Suffix Arrays to Compute
Term Frequency and Document Frequency for All Sub
strings in a Corpus, In proceeding of 6 ^th Workshop
on Very Large Corpora, Ed.Eugene Charniak, Motrea
1, pp 28-37, 1998.

【００６１】サフィックスファイルを使うと、ある文字
列が文書データベース内に出現する回数を高速に求める
ことができる。サフィックスファイルは、すべての文書
において生じうる部分の文字列を、文字コード順に並び
替えて、通し番号(サフィックス)を付けておくことで実
施する。文字列が文書データベースに出現する回数は、
その文字列と一致する文字列がサフィックスファイルの
中にいくつあるかを算出することで求められる。The use of a suffix file makes it possible to quickly determine the number of times a character string appears in a document database. The suffix file is implemented by rearranging the character strings of parts that can occur in all documents in the order of character codes and adding serial numbers (suffixes). The number of times the string appears in the document database is
It is obtained by calculating the number of character strings that match the character string in the suffix file.

【００６２】具体的には、まず、一致する文字列のサフ
ィックスの最小値minと最大値maxをそれぞれ二分探索法
により求める。一致する文字列がなければ、文書データ
ベースに出現する回数は０である。minとmaxが求まれ
ば、文字列が出現する回数ｔｆは、ｔｆ=max‐min+1で
求められる。Specifically, first, the minimum value min and the maximum value max of the suffix of the matching character string are respectively obtained by the binary search method. If there is no matching character string, the number of occurrences in the document database is zero. Once min and max are determined, the number of times tf that a character string appears can be determined by tf = max-min + 1.

【００６３】文書データベースの文書は、文書番号によ
って互いに区別されるものとし、サフィックスファイル
に登録する部分文字列にはこの文書番号を付けておく。
これによって、ある部分文字列を含む文書を効率的に検
索することができる。また、ｄｆは、重複する文書番号
の数を数え上げ、その数をｔｆから引くことによって計
算することができる。Documents in the document database are distinguished from each other by a document number, and this document number is assigned to a partial character string registered in a suffix file.
Thus, a document including a certain partial character string can be efficiently searched. Df can be calculated by counting the number of duplicate document numbers and subtracting the number from tf.

【００６４】次にＳ１２で、検索文章を文字列Ｘに読み
込む。Next, in step S12, the search text is read into the character string X.

【００６５】Ｓ１３では、文字列Ｘから切り出される部
分文字列を、文書データベース内における出現頻度ｔｆ
に基づいて選別し、ｔｆと組にして部分文字列管理テー
ブルに記録する。Ｓ１３で行う処理については、図４を
用いて後述する。In S13, the partial character string cut out from the character string X is converted into an appearance frequency tf in the document database.
, And recorded in the partial character string management table as a pair with tf. The processing performed in S13 will be described later with reference to FIG.

【００６６】Ｓ１４では、部分文字列管理テーブルに記
録された各部分文字列に対し、一致情報を収集し、一致
情報管理テーブルへの記録を行う。一致情報管理テーブ
ルには、文書番号毎に、一致情報のリストとして記録さ
れる。Ｓ１４で行う処理については、図５を用いて後述
する。In S14, the matching information is collected for each of the partial character strings recorded in the partial character string management table, and is recorded in the matching information management table. The matching information management table records a list of matching information for each document number. The processing performed in S14 will be described later with reference to FIG.

【００６７】Ｓ１５では、一致情報管理テーブルから、
ある一つの文書Ｙのリストを取り出す。In S15, from the matching information management table,
A list of one document Y is extracted.

【００６８】次にＳ１６で、取り出したリストよりＸと
Ｙの類似度を計算する。Ｓ１６で行う処理については、
図６を用いて後述する。Next, in S16, the similarity between X and Y is calculated from the extracted list. Regarding the processing performed in S16,
This will be described later with reference to FIG.

【００６９】Ｓ１７では、求めた類似度と文書番号を組
にして文書管理テーブルに登録する。In S17, the obtained similarity and the document number are paired and registered in the document management table.

【００７０】Ｓ１８では、一致情報管理テーブルに記録
された全てのリストについて類似度を計算したかどうか
を判定する。もし、まだ全てのリストについて類似度を
計算していなければ、まだ類似度の計算を行っていない
リストをＳ１５で選んで取り出し、Ｓ１７までの処理を
繰り返す。もし、全てのリストについて計算していれ
ば、Ｓ１９で、登録したテーブルを類似度の高い順に並
び替える。In S18, it is determined whether or not the similarity has been calculated for all the lists recorded in the matching information management table. If the similarity has not been calculated for all the lists, a list for which the similarity has not been calculated is selected and extracted in S15, and the processing up to S17 is repeated. If the calculation has been performed for all the lists, the registered tables are rearranged in descending order of similarity in S19.

【００７１】Ｓ２０では、類似度の高い文書の出力する
処理を行う。出力する文書は、一つだけにする、あるい
は、所定の複数にする、所定の類似度以上である全ての
文書にする、など種々の態様が可能である。In S20, a process of outputting a document having a high degree of similarity is performed. Various modes are possible, such as outputting only one document, outputting a plurality of documents, or outputting all documents having a predetermined similarity or more.

【００７２】図４は、検索文章を読み込んだ文字列Ｘか
ら部分文字列を切り出し、一致情報の収集に利用する部
分文字列を出現頻度に基づき選別し、部分文字列管理テ
ーブルに記録する処理のフローを示す。FIG. 4 shows a process of extracting a partial character string from a character string X from which a search sentence has been read, selecting a partial character string to be used for collecting matching information based on the frequency of appearance, and recording the partial character string in a partial character string management table. Shows the flow.

【００７３】まず、Ｓ３１で、部分文字列管理テーブル
に記録された部分文字列の数を表す変数num_substring
と切り出す部分文字列の長さを表す変数ｊを初期化して
いる。MinNgramLengthは、切り出す部分文字列の長さの
最小値を決めるパラメータである。First, in S31, a variable num_substring representing the number of partial character strings recorded in the partial character string management table
And a variable j representing the length of the partial character string to be cut out. MinNgramLength is a parameter that determines the minimum value of the length of the partial character string to be cut out.

【００７４】次にＳ３２で、文字列Ｘから長さｊの部分
文字列を一つ切り出し、文書データベース内における出
現頻度ｔｆを計算する。Next, in S32, one character string of length j is cut out from the character string X, and the appearance frequency tf in the document database is calculated.

【００７５】Ｓ３３では、切り出された部分文字列のｔ
ｆの値が０かどうかを判定する。もし、ｔｆ＝０なら
ば、文書データベース内にその部分文字列は存在しない
ため、一致情報の収集に利用するのは不適当である。し
たがって、Ｓ３４の処理を飛ばして、Ｓ３５に進む。ｔ
ｆ≠０ならば、Ｓ３４に進む。At S33, the t of the cut-out partial character string
It is determined whether the value of f is 0. If tf = 0, the partial character string does not exist in the document database, so that it is inappropriate to use it for collecting matching information. Therefore, the process of S34 is skipped, and the process proceeds to S35. t
If f ≠ 0, the process proceeds to S34.

【００７６】Ｓ３４では、切り出された部分文字列をｔ
ｆと共に部分文字列管理テーブルに記録し、num_substr
ingの値に１を加える。At S34, the cut-out partial character string is
Record in the substring management table with f, num_substr
Add 1 to the value of ing.

【００７７】Ｓ３５では、Ｘから切り出される長さｊの
全ての部分文字列についてｔｆを計算したかどうかを判
定する。もし、まだ長さｊの全ての部分文字列について
計算していなければ、まだ計算していない長さｊの部分
文字列をＳ３２で選んでｔｆを計算し、Ｓ３４までの処
理を繰り返す。もし、長さｊの全ての部分文字列につい
て計算していれば、Ｓ３６で、ｊに１を加える。In S35, it is determined whether tf has been calculated for all partial character strings of length j cut out from X. If the calculation has not yet been performed for all the partial character strings of the length j, a partial character string of the length j that has not been calculated is selected in S32, tf is calculated, and the processing up to S34 is repeated. If the calculation has been performed for all the partial character strings having the length j, 1 is added to j in S36.

【００７８】Ｓ３７では、Ｘから切り出す部分文字列の
長さｊの値が、切り出す部分文字列の長さの最大値を決
めるパラメータMaxNgramLengthより大きいかどうかを判
定する。もし、ｊの値がMaxNgramLength以下なら、Ｓ３
２に戻り、長さｊの部分文字列に対し、Ｓ３６までの処
理を繰り返す。もし、ｊの値がMaxNgramLengthより大き
ければ、長さがMinNgramLength以上、MaxNgramLength以
下のすべての部分文字列に対しｔｆの計算を終えている
ので、Ｓ３８に進み、部分文字列管理テーブルに記録さ
れた部分文字列をｔｆの小さい順に並び替える。In S37, it is determined whether or not the value of the length j of the partial character string to be extracted from X is greater than the parameter MaxNgramLength which determines the maximum value of the length of the partial character string to be extracted. If the value of j is less than MaxNgramLength, S3
2, the process up to S36 is repeated for the partial character string of length j. If the value of j is greater than MaxNgramLength, the calculation of tf has been completed for all partial character strings whose length is equal to or greater than MinNgramLength and equal to or less than MaxNgramLength. The character strings are rearranged in ascending order of tf.

【００７９】Ｓ３９では、部分文字列管理テーブルに登
録された部分文字列の数num_substringが、一致情報の
収集に用いる部分文字列の数の上限値を決めるパラメー
タSubStringLimitより大きいかどうかを判定する。も
し、num_substringがSubStringLimitより大きければ、
Ｓ４０に進み、ｔｆの小さい順にSubStringLimit個の部
分文字列を取り出し、これらの部分文字列を改めて部分
文字列管理テーブルに記録する。もし、num_substring
がSubStringLimit以下ならば、Ｓ４０をスキップしてＳ
４１に進む。In S39, it is determined whether or not the number num_substring of partial character strings registered in the partial character string management table is larger than a parameter SubStringLimit which determines the upper limit of the number of partial character strings used for collecting matching information. If num_substring is greater than SubStringLimit,
In S40, SubStringLimit partial character strings are extracted in ascending order of tf, and these partial character strings are newly recorded in the partial character string management table. If num_substring
If is less than or equal to SubStringLimit, skip S40 and proceed to S
Go to 41.

【００８０】Ｓ４１は、記録された部分文字列管理テー
ブルを返す処理である。Step S41 is a process of returning the recorded partial character string management table.

【００８１】図５は、部分文字列管理テーブルに記録さ
れた各部分文字列と、文書データベース内の各文書との
一致情報を収集し、その情報を一致情報管理テーブルに
記録する処理のフローを表す。FIG. 5 is a flowchart showing a process of collecting matching information between each partial character string recorded in the partial character string management table and each document in the document database and recording the information in the matching information management table. Represent.

【００８２】まず、Ｓ５１では、変数p0fit_sumを０に
初期化する。変数p0fit_sumは、類似度を一致したｎグ
ラムの重みの加算で算出する際に、計算手間を高速化す
るために用いる変数で、文書データベース内の文書全体
に関する類似度のオフセットである。First, in S51, a variable p0fit_sum is initialized to 0. The variable p0fit_sum is a variable used to speed up the calculation when calculating the similarity by adding the weights of the matched n-grams, and is an offset of the similarity for the entire document in the document database.

【００８３】Ｓ５２では、部分文字列管理テーブルから
ある一つの部分文字列を選びａに読み込む。At S52, one partial character string is selected from the partial character string management table and read into a.

【００８４】Ｓ５３では、p0fit、p1fit、p2fit、p3fi
t、p4fitを計算し、p0fit_sumにp0fitを加算する。p0fi
t、p1fit、p2fit、p3fit、p4fitは、それぞれ、ａがあ
る文書内に、出現しなかった、１回出現した、２回出現
した、３回出現した、４回以上出現したときの、その文
書におけるａの重みである。p0fit、p1fit、p2fit、p3f
it、p4fitの計算方法については、図７を用いて後述す
る。In S53, p0fit, p1fit, p2fit, p3fi
Calculate t and p4fit, and add p0fit to p0fit_sum. p0fi
t, p1fit, p2fit, p3fit, and p4fit are the documents when a did not appear, appeared once, appeared twice, appeared three times, and appeared four or more times in a document. Is the weight of a. p0fit, p1fit, p2fit, p3f
The method of calculating it and p4fit will be described later with reference to FIG.

【００８５】Ｓ５４では、文書データベース内でａが出
現する場所を全て求め、これを出現する場所の順に並び
替える。In S54, all places where a appears in the document database are obtained, and the places are rearranged in the order of the places where they appear.

【００８６】Ｓ５５では、ａの各出現場所に対し、ａを
含む文書の文書番号を求める。このとき、ａは出現場所
順に並んでいるので、得られる文書番号も小さい順に並
んでいる。In S55, the document number of the document containing a is obtained for each occurrence of a. At this time, since a is arranged in the order of appearance, the obtained document numbers are also arranged in ascending order.

【００８７】Ｓ５６では、出現場所の順にａの出現場所
を一つ選ぶ。In S56, one appearance location a is selected in the order of appearance locations.

【００８８】Ｓ５７では、選んだａの出現場所が、それ
を含む文書内において、最も前方にある出現場所かどう
かを判定する。つまり、選んだ出現場所の文書と、一つ
前の出現場所の文書が異なっていれば、それは最初の出
現場所であり、同じであれば、２番目以降の出現場所で
ある。最初の出現場所であれば、Ｓ５８に進み、その文
書内におけるａの出現回数ｄｔｆを計算し、ａの文書内
における重みを決める。また、sequence_num=1とする。
sequence_numは、選らんだ出現場所が文書内において何
番目のａの出現場所かを表すシーケンス番号である。In S57, it is determined whether or not the selected appearance location is the forefront occurrence location in the document including the selected a. That is, if the document at the selected appearance location is different from the document at the immediately preceding appearance location, it is the first appearance location, and if the same, it is the second and subsequent appearance locations. If it is the first appearance location, the process proceeds to S58, where the number of appearances dtf of a in the document is calculated, and the weight of a in the document is determined. Also, sequence_num = 1 is set.
sequence_num is a sequence number indicating the order of occurrence of the selected appearance location in the document.

【００８９】Ｓ５９では、文書内のシーケンス番号sequ
ence_num、入力文字列Ｘにおけるａの出現場所（以下、
startX）、文書内におけるａの出現場所（以下、startd
oc）、ａの長さ（以下、termlength）、ａの重み（以
下、score）を組にして一致情報管理テーブルに記録
し、sequence_numに１を加える。ただし、scoreに記録
される値はａのそのままの重みではなく、ａの重みから
p0fitを引いた値を記録する。これは、類似度を一致し
たｎグラムの重みの加算で計算する場合、選別された各
部分文字列について、それを含まない文書の類似度にそ
れぞれの部分文字列のp0fitを加算する代わりに、一致
したｎグラムの重みからp0fitを引いた値を加算してお
き、最後に全ての類似度に対して重みのオフセットp0fi
t_sumを加えることによって、計算の手間を減らすため
の工夫である。At S59, the sequence number sequ
ence_num, the appearance position of a in the input character string X (hereinafter, referred to as
startX), where a appears in the document (startd
oc), the length of a (hereinafter termlength), and the weight of a (hereinafter score) are recorded as a set in the coincidence information management table, and 1 is added to sequence_num. However, the value recorded in the score is not the weight of a as it is, but the weight of a.
Record the value after subtracting p0fit. When calculating the similarity by adding the weights of the matched n-grams, instead of adding the p0fit of each substring to the similarity of the document that does not include each selected substring, A value obtained by subtracting p0fit from the weight of the matched n-gram is added, and finally, the weight offset p0fi for all similarities
By adding t_sum, it is a contrivance to reduce the computational effort.

【００９０】Ｓ６０では、sequence_numとｔｆの値を比
較して、ａの全ての出現場所について一致情報の記録を
行ったかどうかを判定する。もし、記録していない一致
情報があれば、Ｓ５６で次のａの出現場所を選び、Ｓ５
９までの処理を繰り返す。もし、ａの全ての出現場所に
ついて一致情報の記録をしていればＳ６１に進む。In S60, the value of sequence_num is compared with the value of tf to determine whether or not the coincidence information has been recorded for all the occurrence locations of a. If there is matching information that has not been recorded, the next appearance location of a is selected in S56, and S5 is selected.
The processing up to 9 is repeated. If the matching information has been recorded for all the appearance locations of a, the process proceeds to S61.

【００９１】Ｓ６１では、部分文字列管理テーブル内の
全ての部分文字列について、一致情報の収集を行ったか
どうかを判定する。もし、一致情報の収集をしていない
部分文字列があれば、Ｓ５２で、まだ選んでいない部分
文字列をａに読み込み、Ｓ６０までの処理を繰り返す。
もし、すべての部分文字列について一致情報の収集を終
えていれば、Ｓ６２で、得られた一致情報管理テーブル
を返す。In S61, it is determined whether or not matching information has been collected for all partial character strings in the partial character string management table. If there is a partial character string for which matching information has not been collected, a partial character string not yet selected is read into a in S52, and the processing up to S60 is repeated.
If the matching information has been collected for all the partial character strings, the obtained matching information management table is returned in S62.

【００９２】図６は、入力文章Ｘと文書Ｙの類似度を、
一致情報管理テーブルから取り出したリストを用いて、
一致した文字列の重みの加算によって求める処理フロー
である。FIG. 6 shows the similarity between the input text X and the document Y.
Using the list extracted from the match information management table,
It is a processing flow obtained by adding the weights of the matched character strings.

【００９３】まず、Ｓ７１で、ＸとＹの類似度（以下、
sim）を０に初期化する。First, at step S71, the similarity between X and Y (hereinafter, referred to as X)
sim) is initialized to zero.

【００９４】Ｓ７２では、一致情報管理テーブルに記録
されているＹに関するリストからある一つを選び、Ｉに
読み込む。At S72, one of the lists related to Y recorded in the matching information management table is selected and read into I.

【００９５】Ｓ７３では、読み込んだＩのsequence_num
が１かどうかを判定する。これは、同一の部分文字列の
scoreをsimに重複して加算しないための処理である。も
し、sequence_numが１でなければ、Ｓ７４をスキップ
し、Ｓ７５に進む。sequence_num=1であれば、Ｓ７４
で、simにＩのscoreを加算する。In S73, sequence_num of the read I
Is determined to be 1 or not. This is the same substring
This is a process to avoid adding score to sim repeatedly. If sequence_num is not 1, skip S74 and proceed to S75. If sequence_num = 1, S74
Then, the score of I is added to sim.

【００９６】Ｓ７５では、Ｙに関する一致情報のリスト
に記録された全ての一致情報について調べたかどうかを
判定する。もし、全ての一致情報について調べていれ
ば、Ｓ７６で、simに文書全体の重みのオフセットp0fit
_sumを加算する。まだ、調べていない一致情報があれ
ば、Ｓ７２で、まだ調べていない一致情報を選んでＩに
読み込み、Ｓ７４までの処理を繰り返す。In S75, it is determined whether all pieces of matching information recorded in the list of matching information regarding Y have been checked. If all pieces of matching information have been checked, in step S76, the weight offset p0fit of the entire document is added to sim.
Add _sum. If there is any matching information that has not been checked yet, in S72, matching information that has not been checked is selected and read into I, and the processing up to S74 is repeated.

【００９７】Ｓ７７は、得られたsimをＸとＹの類似度
として返す処理である。S77 is a process of returning the obtained sim as the similarity between X and Y.

【００９８】図７は、図５のＳ５３における、p0fit、p
1fit、p2fit、p3fit、p4fitの計算処理フローを示す。FIG. 7 shows p0fit, p in S53 of FIG.
The calculation processing flow of 1fit, p2fit, p3fit, and p4fit is shown.

【００９９】まず、Ｓ８１で、p0fit、p1fit、p2fit、p
3fit、p4fitをすべて０に初期化する。First, in S81, p0fit, p1fit, p2fit, p
3fit and p4fit are all initialized to 0.

【０１００】Ｓ８２では、部分文字列ａのｄｆを計算
し、Ｓ８３で、ｄｆと文書データベース内の文書の総数
Ｎからｉｄｆを計算する。このｉｄｆは、情報理論の分
野における情報量を背景とする値で、この値を部分文字
列の重みとする方法も良く知られている。In S82, df of the partial character string a is calculated, and in S83, idf is calculated from df and the total number N of documents in the document database. This idf is a value based on the amount of information in the field of information theory, and a method of using this value as the weight of a partial character string is well known.

【０１０１】Ｓ８４では、部分文字列ａが検索に有効な
部分文字列であるかどうかを判定するための閾値tf_thr
esholdを計算する。In S84, a threshold value tf_thr for determining whether or not the partial character string a is a partial character string effective for retrieval.
Calculate eshold.

【０１０２】Ｓ８５では、ｔｆとｄｆの値から部分文字
列ａが検索に有効かどうかを判定する。ｔｆ /ｄｆ > t
f_thresholdであれば、検索に有効な部分文字列である
と判断し、Ｓ８６に進む。そうでなければ、検索には有
効でないと判断して、Ｓ８６、Ｓ８７をスキップし、Ｓ
８８でp0fit、p1fit、p2fit、p3fit、p4fitを返す。つ
まり、p0fit、p1fit、p2fit、p3fit、p4fitの値は全て
０が返される。In S85, it is determined from the values of tf and df whether or not the partial character string a is valid for retrieval. tf / df> t
If f_threshold, it is determined that the partial character string is valid for the search, and the process proceeds to S86. Otherwise, it is determined that the search is not valid, S86 and S87 are skipped, and S
At 88, p0fit, p1fit, p2fit, p3fit, p4fit are returned. That is, 0 is returned for the values of p0fit, p1fit, p2fit, p3fit, and p4fit.

【０１０３】Ｓ８６では、p0fit、p1fit、p2fit、p3fi
t、p4fitを計算する。In S86, p0fit, p1fit, p2fit, p3fi
Calculate t, p4fit.

【０１０４】Ｓ８７の関数MAXとMINは、それぞれ、引数
に与えられた数値の最大値もしくは最小値を返す関数
で、この関数により、p0fit、p1fit、p2fit、p3fit、p4
fitの値をLB以上UB以下の範囲に制限している。LBとUB
は共にp0fit、p1fit、p2fit、p3fit、p4fitの分布を制
限するパラメータである。Ｓ８８は、p0fit、p1fit、p2
fit、p3fit、p4fitを返す処理である。The functions MAX and MIN in S87 are functions that return the maximum value or the minimum value of the numerical value given as an argument, respectively, and are used to calculate p0fit, p1fit, p2fit, p3fit, p4
The value of fit is limited to the range from LB to UB. LB and UB
Are parameters that limit the distribution of p0fit, p1fit, p2fit, p3fit, and p4fit. S88 is p0fit, p1fit, p2
This is the process that returns fit, p3fit, and p4fit.

【０１０５】以上の説明からも分かる通り、重みp0fi
t、p1fit、p2fit、p3fit、p4fitは、ｔｆ、ｄｆ、ｉｄ
ｆの関数として求められる。Ｓ８４とＳ８６に用いてい
る係数は、Christopher D. Manning and Hinrich Schut
ze, Foundations of Statistical Natural Language Pr
ocessing, The MIT Press, Cambridge, Massachusetts,
pp.529-574, 1999.に開示されている理論に基づき、ド
キュメントデータの観測値を求めることによって定め
た。なおこれらの係数は示された数値に限定されるもの
ではなく、目的に応じて適切な値にすることが許容され
る。As can be seen from the above description, the weight p0fi
t, p1fit, p2fit, p3fit, p4fit are tf, df, id
It is obtained as a function of f. The coefficients used for S84 and S86 are based on Christopher D. Manning and Hinrich Schut
ze, Foundations of Statistical Natural Language Pr
ocessing, The MIT Press, Cambridge, Massachusetts,
pp.529-574, 1999. Based on the theory disclosed in this document, it was determined by obtaining observations of document data. Note that these coefficients are not limited to the numerical values shown, and may be set to appropriate values according to the purpose.

【０１０６】図５のＳ５３における、p0fit、p1fit、p2
fit、p3fit、p4fitの計算処理フローとしては、図７で
示したフローの代わりに、図１５に示すフローを適用す
ることもできる。図１５における計算処理フローを以下
に述べる。P0fit, p1fit, p2 in S53 of FIG.
As a calculation processing flow of fit, p3fit, and p4fit, a flow shown in FIG. 15 can be applied instead of the flow shown in FIG. The calculation processing flow in FIG. 15 will be described below.

【０１０７】まず、Ｓ１８１で、p0fit、p1fit、p2fi
t、p3fit、p4fitをすべて０に初期化する。First, in S181, p0fit, p1fit, p2fi
t, p3fit, p4fit are all initialized to 0.

【０１０８】Ｓ１８２では、部分文字列ａのｄｆと、部
分文字列ａが２回以上出現する文書データベース中の文
書の数（以下、ｄｆ２と記す）を計算する。Ｓ１８３
で、ｄｆと文書データベース内の文書の総数Ｎからｉｄ
ｆを計算する。In S182, the df of the partial character string a and the number of documents in the document database where the partial character string a appears twice or more (hereinafter, referred to as df2) are calculated. S183
And id from df and the total number N of documents in the document database
Calculate f.

【０１０９】Ｓ１８４では、部分文字列ａが検索に有効
な部分文字列であるかどうかを判定するための閾値df2_
thresholdを０．２２に設定する。In S184, a threshold value df2_ for determining whether or not the partial character string a is a partial character string effective for retrieval.
Set threshold to 0.22.

【０１１０】Ｓ１８５では、ｄｆとｄｆ２の値から部分
文字列ａが検索に有効かどうかを判定する。ｄｆ２ /ｄ
ｆ > df2_thresholdであれば、検索に有効な部分文字列
であると判断し、Ｓ１８６に進む。そうでなければ、検
索には有効でないと判断して、Ｓ１８６、Ｓ１８７をス
キップし、Ｓ１８８でp0fit、p1fit、p2fit、p3fit、p4
fitを返す。つまり、p0fit、p1fit、p2fit、p3fit、p4f
itの値は全て０が返される。In S185, it is determined from the values of df and df2 whether the partial character string a is valid for the search. df2 / d
If f> df2_threshold, it is determined that the partial character string is valid for search, and the process proceeds to S186. Otherwise, it is determined that the search is not valid, S186 and S187 are skipped, and p0fit, p1fit, p2fit, p3fit, p4
Returns fit. That is, p0fit, p1fit, p2fit, p3fit, p4f
All the values of it return 0.

【０１１１】Ｓ１８６では、p0fit、p1fit、p2fit、p3f
it、p4fitを計算する。In S186, p0fit, p1fit, p2fit, p3f
Calculate it and p4fit.

【０１１２】Ｓ１８７は図７のＳ８７と同様に、p0fi
t、p1fit、p2fit、p3fit、p4fitの値をLB以上UB以下の
範囲に制限するものである。LBとUBは共にp0fit、p1fi
t、p2fit、p3fit、p4fitの分布を制限するパラメータで
ある。Ｓ１８８は、p0fit、p1fit、p2fit、p3fit、p4fi
tを返す処理である。S187 is the same as p87 in FIG.
This limits the values of t, p1fit, p2fit, p3fit, and p4fit to a range from LB to UB. LB and UB are both p0fit and p1fi
These parameters limit the distribution of t, p2fit, p3fit, and p4fit. S188 is p0fit, p1fit, p2fit, p3fit, p4fi
This is the process that returns t.

【０１１３】以上の説明からも分かる通り、図１５の計
算処理フローでは、部分文字列ａが検索に有効かどうか
を判定する基準として、先に述べた図７の計算処理フロ
ーにおけるｔｆの代わりにｄｆ２を用いている。ｄｆ２
／ｄｆは、部分文字列の出現集中度、つまり、ある部分
文字列が特定の文書にのみ集中して出現する度合を表し
ており、この情報を用いて部分文字列の選別を行うこと
により、検索精度の向上を図っている。As can be seen from the above description, in the calculation processing flow of FIG. 15, the reference for judging whether or not the partial character string a is valid for the search is tf instead of tf in the calculation processing flow of FIG. df2 is used. df2
/ Df indicates the degree of occurrence and concentration of partial character strings, that is, the degree to which a certain partial character string is concentrated and appears only in a specific document. By using this information to select partial character strings, The search accuracy has been improved.

【０１１４】Ｓ１８３における閾値df2_thresholdおよ
びＳ１８６に用いている係数は、示された数値に限定さ
れるものではなく、目的に応じて適切な値にすることが
許容される。The threshold value df2_threshold in S183 and the coefficient used in S186 are not limited to the numerical values shown, but may be set to appropriate values according to the purpose.

【０１１５】図８に、一致情報管理テーブルの構成図を
示す。一致情報管理テーブルは、文書番号毎の一致情報
のリストによって構成される。図８では、文書番号００
０２に一致情報１と一致情報５が、文書番号０１００に
一致情報２、一致情報３と一致情報６が、文書番号０１
１１に一致情報４と一致情報７がリストとして記録され
ている。それぞれの一致情報には、部分文字列の文書内
におけるシーケンス番号sequence_num、入力文字列Ｘに
おける部分文字列の出現場所（startX）、文書内におけ
る部分文字列の出現場所（startdoc）、部分文字列の長
さ（termlength）、部分文字列に付けられた重み（scor
e）が格納されている。FIG. 8 shows a configuration diagram of the coincidence information management table. The matching information management table includes a list of matching information for each document number. In FIG. 8, document number 00
02, the match information 1 and the match information 5 are stored in the document number 0100, and the match information 2 and the match information 3 and the match information 6 are stored in the document number 01.
11, the match information 4 and the match information 7 are recorded as a list. The respective pieces of matching information include the sequence number sequence_num of the partial character string in the document, the location of the partial character string in the input character string X (startX), the location of the partial character string in the document (startdoc), and the Length (termlength), weight given to substring (scor)
e) is stored.

【０１１６】新たに、文書番号０００２に関する一致情
報８が得られた場合、図８のように、これまで一致情報
５を指していたリストの先頭を指すポインタは一致情報
８を指し、一致情報８から一致情報５へのポインタが張
られ、文書番号０００２のリストの先頭に一致情報８は
記録される。When the new match information 8 relating to the document number 0002 is newly obtained, as shown in FIG. 8, the pointer that points to the head of the list that has previously pointed to the match information 5 points to the match information 8 and the match information 8 , A pointer to the matching information 5 is set, and the matching information 8 is recorded at the head of the list of the document number 0002.

【０１１７】（第３実施例）次に、ｎグラムＤＰ類似度
に基づき、入力された文字列と最も類似度の高い文書を
検索する文書検索装置の実施例を図９に示す。この文書
検索装置は、文書データベース１０、文字列入力部１
１、部分文字列選別部１３（内部の図示は省略）、一致
情報収集部１４（内部の図示は省略）、類似度算出部１
７、類似度算出制御部１８、再帰実行制御部１９、及
び、検索結果出力部１２から構成されている。(Third Embodiment) Next, FIG. 9 shows an embodiment of a document search apparatus for searching for a document having the highest similarity to an input character string based on the n-gram DP similarity. This document search device includes a document database 10, a character string input unit 1,
1, partial character string selection unit 13 (internal illustration is omitted), coincidence information collection unit 14 (internal illustration is omitted), similarity calculation unit 1
7, a similarity calculation control unit 18, a recursive execution control unit 19, and a search result output unit 12.

【０１１８】文書データベース１０、文字列入力部１
１、部分文字列選別部１３、一致情報収集部１４、及
び、検索結果出力部１２は実施例１の同符号を付した部
分と同じ機能・構成であり説明を省略する。Document database 10, character string input unit 1
1, the partial character string selection unit 13, the match information collection unit 14, and the search result output unit 12 have the same functions and configurations as the parts denoted by the same reference numerals in the first embodiment, and a description thereof will be omitted.

【０１１９】類似度算出制御部１８は一致情報管理テー
ブルＴ２より、ある１つの文書Ｙに関するリストを取り
出し、文字列ＸとＹとともに類似度算出部１７に与え
る。The similarity calculation control unit 18 extracts a list relating to a certain document Y from the coincidence information management table T2, and supplies the list to the similarity calculation unit 17 together with the character strings X and Y.

【０１２０】類似度算出部１７は、与えられた一致情報
のリストより、式（１）または式（２）に基づいてＸと
Ｙの類似度を算出する。この類似度を算出する途中で、
一部分の文字列について同様に類似度を求める必要があ
る。これは、再帰実行制御部１９により、類似度算出部
１７を繰り返し用いることで実施する。類似度算出部１
７は一致文字列類似度算出部６１、任意文字列類似度算
出部６２、最大値選択部６３から成る。一致文字列類似
度算出部６１は式（３）のSimDPs(α, β)を計算する。
任意文字列類似度算出部６２は式（５）のSimDPg(α,
β)を計算する。最大値選択部６３は、これらに対して
関数MAXを実施することで、式（２）のSimDP(α, β)を
算出する。なお、類似度算出部１７の受け取った文字列
α、βの両方が空文字のときは、再帰実行制御部１９に
よりSimDP(α,β) ＝0.0とする。この際、一致文字列類
似度算出部６１、任意文字列類似度算出部６２、最大値
選択部６３は動作させない。言うまでもなく、このSimD
P(α,β) ＝0.0という値は、式（１）を実施するもので
ある。The similarity calculating section 17 calculates the similarity between X and Y based on the expression (1) or (2) from the given list of matching information. In the process of calculating this similarity,
It is necessary to similarly determine the similarity for a part of the character string. This is performed by the recursive execution control unit 19 using the similarity calculation unit 17 repeatedly. Similarity calculator 1
Reference numeral 7 includes a matching character string similarity calculating unit 61, an arbitrary character string similarity calculating unit 62, and a maximum value selecting unit 63. The matching character string similarity calculation unit 61 calculates SimDPs (α, β) in Expression (3).
The arbitrary character string similarity calculation unit 62 calculates SimDPg (α,
Calculate β). The maximum value selecting unit 63 calculates SimDP (α, β) in Expression (2) by performing the function MAX on these. When both the character strings α and β received by the similarity calculation unit 17 are empty characters, the recursive execution control unit 19 sets SimDP (α, β) = 0.0. At this time, the matching character string similarity calculating unit 61, the arbitrary character string similarity calculating unit 62, and the maximum value selecting unit 63 are not operated. Needless to say, this SimD
The value of P (α, β) = 0.0 implements equation (1).

【０１２１】一致文字列類似度算出部６１は、文字列分
離制御部７１、文字列分離部７２、類似度算出部７３、
加算部７４、最大値選択部７５により実施されており、
式（３）のSimDPｓ(α,β)を算出する。αとβの一致す
る先頭の文字列が一致情報管理テーブルＴ２に記録され
た文字列である場合のみSimDPｓ(α,β)を算出する。The matching character string similarity calculation unit 61 includes a character string separation control unit 71, a character string separation unit 72, a similarity calculation unit 73,
This is performed by the addition unit 74 and the maximum value selection unit 75,
SimDPs (α, β) in equation (3) is calculated. SimDPs (α, β) is calculated only when the leading character string where α and β match is the character string recorded in the match information management table T2.

【０１２２】まず、文字列分離制御部７１は、一致文字
列類似度算出部６１が受け取った文字列α(=ξγ)、β
(=ξδ)において、一致する文字列ξがない場合、すな
わち、一致する文字列ξが空文字列の場合は、式（１）
に従い、SimDPｓ(α,β)＝0.0とする。First, the character string separation control unit 71 sets the character strings α (= ξγ), β
In (= ξδ), if there is no matching character string 、, that is, if the matching character string の is an empty character string, the expression (1)
And SimDPs (α, β) = 0.0.

【０１２３】次に、文字列分離制御部７１は、一致文字
列類似度算出部６１が受け取った文字列α(=ξγ)、β
(=ξδ)において、一致する文字列ξがある場合は、全
てのξに関して、文字列分離部７２、類似度算出部７
３、加算部７４を動作させて、式（３）に含まれるScor
e(ξ)＋SimDP(γ, δ)を計算させる。そして、最も大き
な値を最大値選択部７５により選択する。このことによ
り、式（３）に示すSimDP_s(α,β)が求まる。Next, the character string separation control unit 71 converts the character strings α (= ξγ), β
In (= ξδ), if there is a matching character string ξ, the character string separating unit 72 and the similarity calculating unit 7
3. By operating the adder 74, the Scor included in the equation (3) is calculated.
Let e (ξ) + SimDP (γ, δ) be calculated. Then, the largest value is selected by the maximum value selector 75. As a result, SimDP _s (α, β) shown in Expression (3) is obtained.

【０１２４】文字列分離部７２は、文字列αをξとγ
に、文字列βをξとδに分離した後に一致情報管理テー
ブルＴ２を参照してξの重みScore(ξ)を加算部７４に
与え、γとδを類似度算出部７３に与える。類似度算出
部７３は、式（３）のSimDP(γ, δ)を算出する。類似
度算出部７３は、実際には、再帰実行制御部１７によ
り、類似度算出部１６をγとδに対して適用すること
で、実施する。加算部７４は、式（３）の加算を行う。The character string separating section 72 converts the character string α into ξ and γ
Then, after separating the character string β into ξ and δ, the weight Score (ξ) of ξ is given to the adding unit 74 and γ and δ are given to the similarity calculating unit 73 with reference to the coincidence information management table T2. The similarity calculation unit 73 calculates SimDP (γ, δ) in Expression (3). The similarity calculation unit 73 is actually implemented by the recursion execution control unit 17 applying the similarity calculation unit 16 to γ and δ. The adding unit 74 performs addition of Expression (3).

【０１２５】任意文字列類似度算出部６２は、類似度算
出部８１〜８３、最大値選択部８４により実施されてお
り、式（２）のSimDPg(α,β)を算出する。先頭の文字
列が異なる場合か、先頭の文字列は一致するが一致情報
管理テーブルに登録されていない文字列の場合に任意文
字列類似度算出部６２は実行される。受け取った文字列
α(=ζγ)、β(=ηδ)の先頭の１文字ζ、ηの有無に関
する各場合に対応して、類似度算出部８１、８２、８３
は、それぞれ式（５）のSimDP(α,δ), SimDP(γ,β),
SimDP(γ,δ)を求める。類似度算出部８１〜８３は、実
際には、再帰実行制御部１９により、類似度算出部１７
を、αとδ、γとβ、γとδのそれぞれに対して適用す
ることで、実施する。最大値選択部８４は、式（５）の
関数MAXを実施する。The arbitrary character string similarity calculating section 62 is implemented by the similarity calculating sections 81 to 83 and the maximum value selecting section 84, and calculates SimDPg (α, β) in equation (2). If the leading character strings are different, or if the leading character strings match but are not registered in the matching information management table, the arbitrary character string similarity calculating unit 62 is executed. The similarity calculators 81, 82, 83 correspond to the respective cases regarding the presence or absence of the first character ζ, η of the received character strings α (= ζγ), β (= ηδ).
Are SimDP (α, δ), SimDP (γ, β),
Find SimDP (γ, δ). Actually, the similarity calculating sections 81 to 83 are controlled by the recursive execution control section 19 to execute the similarity calculating section 17.
Is applied to each of α and δ, γ and β, and γ and δ. The maximum value selection unit 84 implements the function MAX of Expression (5).

【０１２６】（第４実施例）ソフトウエアにより文字列
Ｘと文字列ＹのｎグラムＤＰ類似度を求める処理フロー
を図１０と図１１に示す。この処理は、図３のＳ１６の
内部処理として、図６で説明した処理の代わりに用いる
ことが可能である。また、実行にあたっては第２実施例
で示したコンピュータシステムを用いている。(Fourth Embodiment) FIGS. 10 and 11 show the processing flow for obtaining the n-gram DP similarity between the character strings X and Y by software. This processing can be used as the internal processing of S16 in FIG. 3 instead of the processing described in FIG. In the execution, the computer system shown in the second embodiment is used.

【０１２７】まず、Ｓ９１では、ＸとＹそれぞれにおい
て、リストに記録された部分文字列の先頭文字と最終文
字の中で、最も前にある先頭文字の場所minX、minYと、
最も後ろにある最終文字の場所maxX、maxYを求め、長さ
maxX-minX+1の配列X_indexと、長さmaxY-minY+1の配列Y
_indexを用意し、全ての要素を−１に初期化する。これ
らの配列は、それぞれ、ＸのminXからmaxXまでの各文
字、ＹのminYからmaxYまでの各文字と対応している。First, in S91, for each of X and Y, among the first character and the last character of the partial character string recorded in the list, the position of the earliest first character minX, minY,
Find the location of the last character at the end, maxX, maxY, and length
An array X_index of maxX-minX + 1 and an array Y of length maxY-minY + 1
Prepare _index and initialize all elements to -1. These arrays correspond to characters from X minX to maxX and characters from Y minY to maxY, respectively.

【０１２８】Ｓ９２からＳ９４の処理では、Ｘにおける
minXからmaxXまでの各文字、ＹにおけるminYからmaxYま
での各文字で、リストに記録された各部分文字列の先頭
文字もしくは最終文字にあたる文字と対応する配列X_in
dex、Y_indexの要素に０を代入する。In the processing from S92 to S94,
An array X_in corresponding to each character from minX to maxX, each character from minY to maxY in Y, the character corresponding to the first character or the last character of each partial character string recorded in the list
Substitute 0 for the elements of dex and Y_index.

【０１２９】Ｓ９５からＳ９９の処理では、X_index
[ｉ]=0であるｉに対し、前から順に、０,１,２,…, X_i
ndex_num-1と通し番号を振り、その番号をX_index[ｉ]
に代入する。従って、X_index_numは、X_index[ｉ]≠−
１であるｉの数である。In the processing from S95 to S99, X_index
[i] = 0, i, 0, 1, 2,..., X_i
Assign a serial number to ndex_num-1 and assign that number to X_index [i]
Substitute for Therefore, X_index_num is X_index [i] ≠ −
It is the number of i that is 1.

【０１３０】Ｓ１００からＳ１０４では、Y_indexに対
して同様の処理を行う。Y_index[ｊ]=0であるｊに対
し、前から順に、０,１,２,…, Y_index_num-1と通し番
号を振り、その番号をY_index[ｊ]に代入する。従っ
て、Y_index_numは、Y_index[ｊ]≠−１であるｊの数で
ある。In S100 to S104, similar processing is performed on Y_index. With respect to j for which Y_index [j] = 0, serial numbers 0, 1, 2,..., Y_index_num-1 are assigned in order from the front, and the number is substituted for Y_index [j]. Therefore, Y_index_num is the number of js for which Y_index [j] ≠ −1.

【０１３１】Ｓ１０５では、リストに記録されたＸとＹ
の一致情報を、まずＸにおける部分文字列の出現する順
に並び替える。次にＸにおける部分文字列の出現する順
が同じものについて、Ｙにおける部分文字列の出現する
順に並び替え、一致情報の数をｍに読み込む。In S105, X and Y recorded in the list
Are first sorted in the order in which the partial character strings in X appear. Next, the same order of appearance of the partial character strings in X is rearranged in the order of appearance of the partial character strings in Y, and the number of pieces of matching information is read into m.

【０１３２】次に、類似度をＤＰによって効率的に求め
るための準備として、(X_index_num+2)行(Y_index_num+
2)列のスコア表scoretableを作り、表の全ての要素を０
に初期化する。この表は、縦方向が、文字列Ｘの中で、
リストに記録された部分文字列の先頭文字もしくは最終
文字にあたる文字、横方向が、文字列Ｙの中で、リスト
に記録された部分文字列の先頭文字もしくは最終文字に
あたる文字に対応している。Next, as a preparation for efficiently obtaining the similarity by DP, the (X_index_num + 2) row (Y_index_num +
2) Create a score table scoretable for the columns, and set all elements of the table to 0
Initialize to This table shows that the vertical direction is
The character corresponding to the first character or the last character of the partial character string recorded in the list, and the horizontal direction corresponds to the character corresponding to the first character or the last character of the partial character string recorded in the list in the character string Y.

【０１３３】Ｓ１０６では、ｋとｉを、k=1、i=0に初期
化する。In S106, k and i are initialized to k = 1 and i = 0.

【０１３４】Ｓ１０７では、ｊをj=0に初期化する。変
数ｋは、現在、先頭からｋ番目の一致情報について注目
していることを表し、ｉとｊは、それぞれ、ＸとＹにお
けるリストに記録された部分文字列の先頭文字もしくは
最終文字の中で、どの文字に注目しているかを表す変数
である。Ｓ１０８では、現在のスコアとして、currents
coreにscoretable[ｉ] [ｊ]を代入する。In S107, j is initialized to j = 0. The variable k indicates that attention is currently focused on the k-th matching information from the beginning, and i and j are the first character or the last character of the partial character string recorded in the list in X and Y, respectively. , A variable indicating which character is being focused on. In S108, the current score is currents
Substitute scoretable [i] [j] for core.

【０１３５】ここで、説明の便宜上、リストの先頭から
ｋ番目の一致情報を、それぞれ、startX(k)、startdoc
(k)、termlength(k)、score(k)と表すことにする。Here, for convenience of explanation, the k-th matching information from the head of the list is referred to as startX (k) and startdoc, respectively.
(k), termlength (k), and score (k).

【０１３６】Ｓ１０９では、ｉとｊが指す場所が、前か
らｋ番目の一致情報の部分文字列が出現する場所と一致
しているかどうかを判定する。一致していれば、Ｓ１１
０に進み、一致していなければ、Ｓ１１４に進む。In S109, it is determined whether or not the location indicated by i and j matches the location where the partial character string of the kth matching information appears from the beginning. If they match, S11
The process proceeds to 0, and if they do not match, the process proceeds to S114.

【０１３７】Ｓ１１０では、スコア表において、一致し
た部分文字列の最終文字と対応する行番号と列番号を求
め、それぞれ、target_iとtarget_jに代入する。In S110, a line number and a column number corresponding to the last character of the matched partial character string are obtained from the score table, and are substituted into target_i and target_j, respectively.

【０１３８】Ｓ１１１では、scoretable[target_i][tar
get_j]において、現段階で得られているスコアと、curr
entscoreに一致する部分文字列の重みscore(k)を加算し
て得られるスコアを比較し、もし、scoretable[target_
i][target_j]よりcurrentscore+score(k)の方が大きけ
れば、Ｓ１１２で、scoretable[target_i][target_j]に
currentscore+score(k)を代入する。そうでなければ、
Ｓ１１２をスキップし、Ｓ１１３に進む。At S111, scoretable [target_i] [tar
get_j], the current score and curr
Compare the scores obtained by adding the score (k) of the substring that matches entscore, and if scoretable [target_
If currentscore + score (k) is larger than i] [target_j], in S112, scoretable [target_i] [target_j]
Substitute currentscore + score (k). Otherwise,
Skip S112 and proceed to S113.

【０１３９】Ｓ１１３では、ｋの値に１を加え、リスト
内の次の一致情報に注目し、Ｓ１０９に戻る。Ｓ１０９
では、次の一致情報もｉとｊを出現場所とするならＳ１
１３までの処理を繰り返し、そうでなければ、Ｓ１１４
に進む。In S113, 1 is added to the value of k, and attention is paid to the next matching information in the list, and the flow returns to S109. S109
Then, if i and j also appear in the next matching information in the appearance location, S1
13 is repeated, otherwise, S114
Proceed to.

【０１４０】Ｓ１１４からＳ１１９までの処理では、現
在のスコアcurrentscoreとスコア表の右、下、右下のス
コアを比較し、currentscoreの方が大きければcurrents
coreを代入する。In the processing from S114 to S119, the current score currentscore is compared with the scores at the right, lower, and lower right of the score table. If the currentscore is larger, the currentscore is
Substitute core.

【０１４１】Ｓ１２０では、ｊがスコア表の右端まで来
たかどうかを判定する。もし、右端まで来ていなけれ
ば、Ｓ１２１でｊに１を加え、Ｓ１０８に戻り、Ｓ１１
９までの処理を繰り返す。右端まで来ていれば、Ｓ１２
２に進む。In S120, it is determined whether j has reached the right end of the score table. If the right end has not been reached, 1 is added to j in S121, the process returns to S108, and S11
The processing up to 9 is repeated. If it has reached the right end, S12
Proceed to 2.

【０１４２】Ｓ１２２では、ｉがスコア表の下端までに
来たかどうかを判定する。もし、下端まで来ていなけれ
ば、Ｓ１２３でｉに１を加え、Ｓ１０７に戻り、Ｓ１２
０までの処理を繰り返す。下端までていれば、Ｓ１２４
に進み、scoretable[X_index_num+1][Y_index_num+1]を
ＸとＹの類似度として返す。In S122, it is determined whether i has reached the lower end of the score table. If it does not reach the lower end, 1 is added to i in S123, the process returns to S107, and S12
The process up to 0 is repeated. If it has reached the lower end, S124
And returns scoretable [X_index_num + 1] [Y_index_num + 1] as the similarity between X and Y.

【０１４３】[0143]

【発明の効果】請求項１に記載の発明によれば、入力さ
れた文字列から切り出した部分文字列のうち、文書デー
タベース内の文書との類似度算出に関して効果のある文
字列を選別して、類似度を求めることができる。すなわ
ち検索に効果の高い文字列に限定して類似度を求めるこ
とができる。According to the first aspect of the present invention, a character string that is effective in calculating similarity with a document in a document database is selected from partial character strings cut out from an input character string. , The degree of similarity can be obtained. That is, the similarity can be obtained by limiting the search to a character string having a high effect on the search.

【０１４４】請求項２に記載の発明によれば、二つの文
字列それぞれにおける順序に適合し、かつ、共通する部
分文字列に着目して類似度を求めることができる。すな
わち文字列の出現順序を考慮した類似度が求められる。According to the second aspect of the present invention, the degree of similarity can be obtained by focusing on a partial character string that conforms to the order of each of the two character strings and is common. That is, the similarity is determined in consideration of the appearance order of the character strings.

【０１４５】請求項３に記載の発明によれば、類似度算
出に関する効果の大きな順に文字列を選別して、類似度
を求めることができる。これにより決まった類似度はよ
り適切な値となる。According to the third aspect of the present invention, similarities can be obtained by selecting character strings in descending order of the effect on similarity calculation. As a result, the determined similarity becomes a more appropriate value.

【０１４６】請求項４に記載の発明によれば、一致した
部分文字列の重みを加算する方法だけでなく、その他の
類似度算出方法を組み合わせることができる。これによ
り検索精度が更に向上する。According to the invention described in claim 4, not only the method of adding the weights of the matched partial character strings but also other similarity calculation methods can be combined. Thereby, the search accuracy is further improved.

【０１４７】請求項５に記載の発明によれば、入力され
た文字列から切り出した部分文字列のうち、文書データ
ベース内の文書との類似度算出に関して効果のある文字
列を選別して、検索に効果の高い文字列に限定して類似
度を求める装置を実現できる。According to the fifth aspect of the present invention, a character string that is effective in calculating the similarity with a document in the document database is selected from the partial character strings cut out from the input character string and searched. It is possible to realize a device that obtains the similarity by limiting the character string to a highly effective one.

【０１４８】請求項６に記載の発明によれば、二つの文
字列それぞれにおける順序に適合し、かつ、共通する部
分文字列に着目して文字列の出現順序を考慮した類似度
を求める装置を実現できる。According to the sixth aspect of the present invention, there is provided an apparatus which obtains a similarity which conforms to the order of two character strings and takes into account the order of appearance of the character strings by paying attention to a common partial character string. realizable.

【０１４９】請求項７の発明によれば、入力された文字
列から切り出した部分文字列のうち、文書データベース
内の文書との類似度算出に関して効果のある文字列を選
別して、検索に効果の高い文字列に限定して類似度を求
めるプログラムを提供できる。According to the seventh aspect of the present invention, a character string that is effective in calculating the similarity with a document in the document database is selected from the partial character strings cut out from the input character string, and this is effective in the search. It is possible to provide a program that obtains the similarity only for character strings having a high character string.

【０１５０】請求項８に記載の発明によれば、二つの文
字列それぞれにおける順序に適合し、かつ、共通する部
分文字列に着目して文字列の出現順序を考慮した類似度
を求めるプログラムを提供できる。According to the eighth aspect of the present invention, there is provided a program which obtains a similarity which conforms to the order of each of two character strings and takes into account the appearance order of the character strings by paying attention to a common partial character string. Can be provided.

【０１５１】請求項９に記載の発明によれば、以上述べ
たような効果を種々のコンピュータ上で実現できる。According to the ninth aspect, the effects described above can be realized on various computers.

【０１５２】次に、第１および第２の実施例に関する検
索性能を説明する。入力文字列から切り出される部分文
字列を選別する効果と、部分文字列の選別基準にｄｆの
代わりにｔｆを用いることの効果を確認するための評価
実験を行った。Next, search performance in the first and second embodiments will be described. An evaluation experiment was performed to confirm the effect of selecting a partial character string cut out from an input character string and the effect of using tf instead of df as a selection criterion for a partial character string.

【０１５３】選別の基準にｔｆ、ｄｆを用いた時の、選
択した部分文字列の数と検索精度の関係を図１２に、選
別した部分文字列の数と検索時間の関係を図１３に示
す。検索精度には、１１ｐｔ平均精度と呼ばれる値を用
いた。１１ｐｔ平均精度とは、再現率（０〜１）に対
し、０．１刻みで１〜１１点を割り当て、それぞれの値
を平均した評価指数で、詳細については、G Salton and
M. J MacGill, Introduction to Modern Information
Retrieval, p174-181, MacGraw-Hill Book Co., New Yo
rk, 1983.に開示されている。FIG. 12 shows the relationship between the number of selected partial character strings and the search accuracy when tf and df are used as the selection criteria, and FIG. 13 shows the relationship between the number of selected partial character strings and the search time. . For the search accuracy, a value called 11pt average accuracy was used. The 11 pt average precision is an evaluation index obtained by allocating 1 to 11 points in 0.1 increments to the recall (0 to 1) and averaging the respective values. For details, see G Salton and
M. J MacGill, Introduction to Modern Information
Retrieval, p174-181, MacGraw-Hill Book Co., New Yo
rk, 1983.

【０１５４】入力文字列から切り出す部分文字列の長さ
は２に、つまりbigramのみとした。また、図７のＳ８７
で用いるパラメータLBとUBは、それぞれ、LB=０、UB=ｉ
ｄｆとした。用いた文書データは、NTCIR1テストコレク
ションと呼ばれる代表的な類似情報検索用テストデータ
で、約３００，０００件の文書と５３件の検索文章を含
んでいる。The length of the partial character string cut out from the input character string was set to 2, that is, only bigram. Also, S87 of FIG.
Parameters LB and UB used in LB = 0 and UB = i, respectively
df. The used document data is a representative similar information search test data called NTCIR1 test collection, which includes about 300,000 documents and 53 search sentences.

【０１５５】図１２、図１３共に、横軸には選択した部
分文字列の数、つまりSubStringLimitの値を、図１２の
縦軸には検索精度を、図１３の縦軸には５３件の検索文
章を検索するのに要した時間を示している。12 and 13, the horizontal axis indicates the number of selected partial character strings, that is, the value of SubStringLimit, the vertical axis in FIG. 12 indicates search precision, and the vertical axis in FIG. Indicates the time required to search for a sentence.

【０１５６】図１２を見ると、ｔｆ、ｄｆ共に、SubStr
ingLimitの値が小さい時には、SubStringLimitを増加さ
せることによって検索精度が向上することが確認できる
が、ある程度大きくなると、SubStringLimitの値を増加
させても検索精度があまり変化しないことが確認でき
る。このことから、SubStringLimitを適当な値に設定し
てやれば、部分文字列の選別を行っても、検索精度の低
下を抑えることができる。また、ｔｆとｄｆを比較する
と、ｄｆを基準とした選別による検索精度の方が上回っ
ていることが確認できるが、SubStringLimitの値がある
程度大きくなると、ｔｆを基準として選別した場合で
も、ｄｆの時とほぼ同等の検索精度を持つことが確認で
きる。Referring to FIG. 12, both tf and df have SubStr.
When the value of ingLimit is small, it can be confirmed that the search accuracy is improved by increasing SubStringLimit. However, when the value of ingLimit is large, it can be confirmed that the search accuracy does not change much even if the value of SubStringLimit is increased. For this reason, if SubStringLimit is set to an appropriate value, a decrease in search accuracy can be suppressed even if a partial character string is selected. In addition, when comparing tf and df, it can be confirmed that the search accuracy by the selection based on df is higher, but when the value of SubStringLimit increases to some extent, even when the selection is performed based on tf, the time of df It can be confirmed that it has almost the same search accuracy as.

【０１５７】図１３を見ると、ｔｆ、ｄｆ共に、SubStr
ingLimitの値が減少するにつれ検索時間が短くなってお
り、部分文字列を選別することによる高速化の効果が確
認できる。また、ｔｆとｄｆの比較をすると、ｔｆを用
いた方の検索時間が小さく、SubStringLimitの値が小さ
いほどその差は大きくなることが確認できる。仮に、Su
bStringLimitの値を、ｔｆ、ｄｆ共に高い検索精度を示
している値、SubStringLimit=22に設定したとすると、
その時の検索時間は、ｔｆが２６４．６秒、ｄｆが３６
７．１秒で、１．４倍ほどｔｆの方が高速である。この
ように、本発明によって検索時間を向上させることがで
きる。Referring to FIG. 13, both tf and df have SubStr.
As the value of ingLimit decreases, the search time becomes shorter, and the effect of speeding up by selecting the partial character strings can be confirmed. Further, when comparing tf and df, it can be confirmed that the search time using tf is shorter, and the difference becomes larger as the value of SubStringLimit is smaller. Suppose Su
Assuming that the value of bStringLimit is set to SubStringLimit = 22, a value indicating high search accuracy for both tf and df,
The search time at that time is as follows: tf is 264.6 seconds, df is 36
At 7.1 seconds, tf is about 1.4 times faster. As described above, the search time can be improved by the present invention.

【０１５８】次に、図３のＳ１６における類似度の算出
に、第３および第４実施例による方法を適用した時の検
索性能を説明する。類似度の算出法に、図６の加算によ
る方法を用いた場合と、図１０、１１のＤＰによる方法
を用いた場合の検索精度の比較を行い、その結果を図１
４に示す。検索精度には、１１ｐｔ平均精度を用いた。
入力文字列から切り出す部分文字列の長さは２に、つま
りbigramのみとし、図７のＳ８７で用いるパラメータLB
とUBは、それぞれ、LB=0、UB=ｉｄｆとした。用いた文
書データは、NTCIR1テストコレクションである。Next, search performance when the method according to the third and fourth embodiments is applied to the calculation of the similarity in S16 of FIG. 3 will be described. The search accuracy is compared between the case of using the addition method of FIG. 6 and the case of using the DP method of FIGS. 10 and 11 for calculating the similarity, and the result is shown in FIG.
It is shown in FIG. 11 pt average precision was used for search precision.
The length of the partial character string cut out from the input character string is 2, that is, only the bigram is used, and the parameter LB used in S87 of FIG. 7 is used.
And UB are set to LB = 0 and UB = idf, respectively. The document data used is the NTCIR1 test collection.

【０１５９】図１４の横軸には選択した部分文字列の
数、つまりSubStringLimitの値を、縦軸には検索精度を
示している。図１４を見ると、SubStringLimitが１５以
下の時には、検索精度に大きな差はないが、１５を超え
た場合には、ＤＰを用いた方法が加算による方法に比
べ、検索精度が上回っていることが確認できる。また、
検索時間を測定した所、ＤＰによる方法は、加算による
方法に比べ約２倍の検索時間を要すること、既存のＤＰ
による方法と比較すると数十倍高速であることが確認さ
れた。In FIG. 14, the horizontal axis indicates the number of selected partial character strings, that is, the value of SubStringLimit, and the vertical axis indicates search accuracy. Referring to FIG. 14, when SubStringLimit is 15 or less, there is no significant difference in the search accuracy. However, when the value exceeds 15, the search accuracy using the DP is higher than the addition method. You can check. Also,
When the search time was measured, the DP method requires about twice the search time as compared to the addition method, and the existing DP
It was confirmed that the speed was several tens of times higher than that of the method described in the above.

【０１６０】このように、本発明によって、収集した一
致情報に対しＤＰを適用することで、加算による方法よ
りも検索精度を向上することができ、既存のＤＰによる
方法と比べ高速に類似度を算出できることができる。As described above, according to the present invention, by applying the DP to the collected coincidence information, the search accuracy can be improved as compared with the method using addition, and the similarity can be calculated faster than the method using the existing DP. Can be calculated.

[Brief description of the drawings]

【図１】本発明による文書検索装置の実施例を示す図で
ある。FIG. 1 is a diagram showing an embodiment of a document search device according to the present invention.

【図２】本発明により文章検索を行う計算機システムの
実施例を示す図である。FIG. 2 is a diagram showing an embodiment of a computer system for performing a sentence search according to the present invention.

【図３】本発明の実施例の文章検索を行う処理のフロー
チャートである。FIG. 3 is a flowchart of a process for performing a sentence search according to the embodiment of the present invention.

【図４】文字列から切り出される部分文字列の選別を行
う処理のフローチャートである。FIG. 4 is a flowchart of a process for selecting a partial character string cut out from a character string.

【図５】一致情報を収集し、記録する処理のフローチャ
ートである。FIG. 5 is a flowchart of a process of collecting and recording matching information.

【図６】一致情報から加算による類似度を求める処理の
フローチャートである。FIG. 6 is a flowchart of a process of calculating a similarity by addition from matching information.

【図７】部分文字列の重みを求める処理のフローチャー
トである。FIG. 7 is a flowchart of a process for calculating a weight of a partial character string.

【図８】一致情報管理テーブルの構成図である。FIG. 8 is a configuration diagram of a matching information management table.

【図９】本発明による文書検索装置の別の実施例を示す
図である。FIG. 9 is a diagram showing another embodiment of the document search device according to the present invention.

【図１０】一致情報からｎグラムＤＰ類似度を求める処
理の前半部分のフローチャートである。FIG. 10 is a flowchart of a first half of a process for obtaining an n-gram DP similarity from matching information.

【図１１】一致情報からｎグラムＤＰ類似度を求める処
理の後半部分のフローチャートである。FIG. 11 is a flowchart of a latter half of a process for obtaining an n-gram DP similarity from matching information.

【図１２】選別の基準にｔｆ、ｄｆを用いた時の、選択
した部分文字列の数と検索精度の関係を示す図である。FIG. 12 is a diagram illustrating a relationship between the number of selected partial character strings and search accuracy when tf and df are used as selection criteria.

【図１３】選別の基準にｔｆ、ｄｆを用いた時の、選別
した部分文字列の数と検索時間の関係を示す図である。FIG. 13 is a diagram illustrating a relationship between the number of selected partial character strings and a search time when tf and df are used as selection criteria.

【図１４】本発明におけるＤＰを用いた類似度算出法と
加算による類似度算出法の、検索精度の対比を示す図で
ある。FIG. 14 is a diagram illustrating a comparison of search accuracy between a similarity calculation method using DP and a similarity calculation method by addition according to the present invention.

【図１５】部分文字列の重みを求める別の処理のフロー
チャートである。FIG. 15 is a flowchart of another process for calculating the weight of a partial character string.

[Explanation of symbols]

１０：文書データベース１０ａ、１０ｂ、１０ｃ：文書１１：文字列入力部１２：検索結果出力部１３：部分文字列選別部１４：一致情報収集部１５、１７、７３、８１、８２、８３：類似度算出部１６、１８：類似度算出制御部１９：再帰実行制御部３１：部分文字列切り出し制御部３２：部分文字列切り出し部３３：文字列出現頻度算出部３４：部分文字列登録部４１：一致情報収集制御部４２：文字列出現場所検索部４３：文字列重み算出部４４：一致情報登録制御部４５：一致情報登録部５１：文字列重み加算制御部５２：文字列重み加算部６１：一致文字列類似度算出部６２：任意文字列類似度算出部６３、７５、８４：最大値選択部７１：文字列分離制御部７２：文字列分離部７４：加算部１０１：ディスプレイ１０２：プリンタ１０３：キーボード１０４：フロッピー（Ｒ）ディスク装置１０５：ＣＤ−ＲＯＭ装置１０６：読み出し専用メモリ（ＲＯＭ）１０７：ランダムアクセスメモリ（ＲＡＭ）１０８：磁気ディスク装置１０９：中央処理装置（ＣＰＵ）１１０：通信インターフェイス１１１：バス１１２：フロッピー（Ｒ）ディスク１１３：ＣＤ−ＲＯＭ１１４：通信ネットワーク 10: Document database 10a, 10b, 10c: Document 11: Character string input unit 12: Search result output unit 13: Partial character string selection unit 14: Matching information collection unit 15, 17, 73, 81, 82, 83: Similarity Calculation units 16, 18: Similarity calculation control unit 19: Recursive execution control unit 31: Partial character string cutout control unit 32: Partial character string cutout unit 33: Character string appearance frequency calculation unit 34: Partial character string registration unit 41: Match Information collection control unit 42: Character string appearance location search unit 43: Character string weight calculation unit 44: Match information registration control unit 45: Match information registration unit 51: Character string weight addition control unit 52: Character string weight addition unit 61: Match Character string similarity calculation unit 62: Arbitrary character string similarity calculation unit 63, 75, 84: Maximum value selection unit 71: Character string separation control unit 72: Character string separation unit 74: Addition unit 101: Display A 102: Printer 103: Keyboard 104: Floppy (R) disk device 105: CD-ROM device 106: Read only memory (ROM) 107: Random access memory (RAM) 108: Magnetic disk device 109: Central processing unit (CPU) 110: Communication interface 111: Bus 112: Floppy (R) disk 113: CD-ROM 114: Communication network

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B075 ND03 NK06 NR05 PP12 PP24 PR06 QM01 QM08 QS01 UU06 ──────────────────────────────────────────────────続き Continued on the front page F term (reference) 5B075 ND03 NK06 NR05 PP12 PP24 PR06 QM01 QM08 QS01 UU06

Claims

[Claims]

In a method for calculating the similarity between two character strings, a partial character string selected based on an effect on similarity calculation among partial character strings cut out from the first character string is subjected to a second character string. A character string similarity calculating method, wherein information on matching with a character string is collected, a weight of a partial character string matched from the matching information is calculated, and a similarity is calculated based on the weight.

2. A method for calculating a similarity between two character strings, wherein a partial character string selected based on an effect on similarity calculation among partial character strings cut out from the first character string is subjected to a second character string. Information on matching with a character string is collected, and a similarity is calculated from partial character strings included in the matching information based on the weight of the partial character string in which the order of appearance in the first and second character strings matches. A character string similarity calculation method.

3. The method according to claim 1, wherein the selection of the partial character string is performed in consideration of the number of times the partial character string cut out from the first character string appears in the second character string. Character string similarity calculation method described.

4. The character string matching information includes a partial character string length, a partial character string appearance location in a first character string, a partial character string appearance location in a second character string, and a second character string appearance location.
4. A character string similarity calculation method according to claim 1, further comprising: a sequence number indicating the number of a match in the character string.

5. A character string similarity calculating apparatus for calculating a similarity between two character strings, wherein a partial character string selected based on an effect on similarity calculation among partial character strings cut out from the first character string. A selecting unit, a matching information collecting unit that collects matching information with the second character string, and calculating a weight of the matched partial character string based on the matching information, and calculating the similarity by summing the weights A character string similarity calculation device comprising: a similarity calculation unit.

6. A character string similarity calculating apparatus for calculating a similarity between two character strings, wherein a partial character string cut out from the first character string is selected based on an effect on similarity calculation. Part, a matching information collection unit that collects matching information with the second character string for the selected partial character string, and an order in which each character string appears from among the partial character strings included in the matching information. A matching partial character string selecting unit that selects a matching partial character string, and a similarity calculating unit that sums weights assigned to the matching partial character strings,
A character string similarity calculation device characterized by having:

7. A character string similarity calculation program for calculating a similarity between two character strings, comprising:
Selection means for selecting based on the number of times a partial character string cut out from the character string appears in the second character string, and matching information collection for collecting matching information of the selected partial character string with the second character string A character string similarity calculation program that functions as means and similarity calculation means for calculating weights of the matched partial character strings based on the match information and calculating the similarity by summing the weights. .

8. A character string similarity calculation program for calculating a similarity between two character strings, comprising:
Selecting means for selecting a partial character string cut out from the character string based on the effect on the similarity calculation, and for the selected partial character string, as the matching information with the second character string, the length of the partial character string, Matching information collecting means for collecting the occurrence position of the partial character string in the first character string, the occurrence position of the partial character string in the second character string, and the sequence number indicating the number of the match in the second character string And functioning as similarity calculating means for calculating a similarity based on a weight given to a partial character string whose order of appearance in each character string matches from among the partial character strings included in the matching information. A character string similarity calculation program characterized by the following.

9. A computer-readable recording medium on which the program according to claim 7 is recorded.