JP4065695B2

JP4065695B2 - Character string similarity calculation device, character string similarity calculation program, computer-readable recording medium recording the same, and character string similarity calculation method

Info

Publication number: JP4065695B2
Application number: JP2002012259A
Authority: JP
Inventors: 恭司梅村
Original assignee: Sumitomo Electric Industries Ltd
Current assignee: Sumitomo Electric Industries Ltd
Priority date: 2001-01-24
Filing date: 2002-01-22
Publication date: 2008-03-26
Anticipated expiration: 2022-01-22
Also published as: JP2002297660A

Description

【０００１】
【発明の属する技術分野】
本発明は、二つの文字列の類似度判定に関するものであり、特に情報検索において、入力された文字列とデータベースに登録された文書との類似度判定に用いると好適である。
【０００２】
【従来の技術】
文書のデータベースから、所望の文書を取り出す情報検索がよく行なわれている。このような情報検索において、文書は、複数の文字からなる単語を組み合わせた文字列の集合として扱う。そして、検索文字列と検索対象文書中の文字列同士を比較し、類似度の高いものを一つあるいは複数選び出すことで情報検索を行っている。この文字列同士の類似度は、大きく分けて、形態素解析を用いる方法と、長さｎの部分文字列（以下、ｎグラムと称する）の一致を求める方法の２通りがある。
【０００３】
形態素解析を用いる方法は、例えば、Gerard Salton and Christopher Buckley, Term-Weighting Approaches in Automatic Text Retrieval, Information Proceeding and Management, 24, pp.513-523, 1988.に開示されている。この方法で二つの文字列同士の類似度を求める基本的手順は以下のようになる。まず、両方の文字列を、辞書と文法知識を用いた形態素解析により単語の列に分解する。次に、両方の単語列を比較して、一致する単語を求める。そして、一致する単語に対して重みを設定する。その上で、この重みを、すべての一致する単語に関して加算する。この加算の結果得られた総和が、形態素解析による類似度となる。
【０００４】
形態素解析を用いる方法は、形態素解析自体の精度が低いと情報検索が不調に終わるという本質的な問題を有している。形態素解析の精度を上げるには、単語辞書や文法規則などが大規模にならざるを得ず、簡単に情報検索を利用することが難しくなる。さらに、流行語、造語、限られた分野でのみ用いられる専門用語が出現する文書では、単語辞書の整備の手間が大きな負担となる。
【０００５】
次に、ｎグラムによる方法は、例えば、Yasushi Ogawa and Toru Matsuda, Overlapping statistical word indexing: A new indexing method for Japanese text, In proceeding of SIGIR'97, Philadelphia PA, USA, pp.226-234, 1997.に開示されている。この方法で文字列同士の類似度を求める基本的手順は以下のようになる。まず、両方の文字列に共通して含まれるｎ文字の部分文字列を求める。次に、この共通する部分文字列に対して、重みを設定する。そして、この重みを、すべての一致する部分に関して加算する。この加算の結果得られた総和が、ｎグラムによる類似度となる。
【０００６】
共通する部分文字列の重みの設定に関しては、特定の文書に集中的に出現して、他の文書には出現しない文字列には大きな値が与えられる。逆に多くの文書に出現する文字列には小さな値しか与えられない。これは多くの文書に出現する文字列は文書を特徴づける要素になっておらず、検索に際し、有効に利用できないことを反映したものである。
【０００７】
ｎグラムによる方法は、形態素解析を要しないため、新しい技術用語などの未知語にも対応することができ、簡単に利用できる。
【０００８】
ｎグラムによる方法の中でも、特に、長さ２の文字列（以下、bigramと称する）による類似度の算出においては、切り出されるすべてのbigramを対象にするのではなく、ひらがなを含まないbigramに限定して、文字列同士に関する一致情報を求め、類似度を算出する方法がある。これは、ひらがなを含むbigramは、文書データベース内の多くの文書で出現する可能性が高く、各文書を特徴づける文字列となる確率が極めて小さいことを考慮したものである。ひらがなを含むbigramを文字列の比較の対象に含めると計算量が大きくなるばかりで、検索の精度は大きな向上が期待できないと認識されている。
【０００９】
【発明が解決しようとする課題】
上記のように、ｎグラムによる方法は形態素解析による方法と比較して有利な面が多く、情報検索の分野で利用される場面が多い。ｎグラム法における問題は、文書データベースが大きくなるに従って計算量が増し、検索結果を得るまでの時間がかかることである。
【００１０】
ところで、文字列から切り出される部分文字列の中には、必ずしも類似度の算出において有効でない部分文字列、すなわち文書データベース内の多くの文書に含まれるため重みが小さく、類似度に与える影響の小さい部分文字列が多く含まれていると考えられている。そのため、文字列から切り出されるすべてのｎグラムについて一致するかどうかを調べる方法は、計算時間の点からみて効率がよくない。
【００１１】
また、切り出されたbigramに対し、ひらがなを含む文字列を一致情報として扱わないという方法は、計算時間の点については効率化されているが、本来は検索に有効な文字列までも一律に切り捨てられてしまう。その結果、検索精度が低下してしまうという問題があった。
【００１２】
本発明は上記したｎグラム法における問題を解決し、計算時間短縮と検索精度向上を両立する方法を提供するものである。
【００１３】
【課題を解決するための手段】
請求項１に記載の発明は、二つの文字列の類似度を算出する方法において、第１の文字列から切り出した部分文字列のうち、類似度算出に対する効果に基づいて選別した部分文字列について、第２の文字列との一致情報を収集し、前記一致情報から一致した部分文字列の重みを算出し、前記重みに基づいて類似度を算出することを特徴とする。このように文字列同士の一致情報を求める際に、切り出されるすべてのｎグラムを扱うのではなく、類似度の算出に対する効果を推定することで、類似度の算出に有効なｎグラムを選別している。
【００１４】
かかるように構成されているので、選別しない方法による検索精度とほぼ同等の検索精度を保ちつつ、計算時間の短縮を図ることが可能となる。
【００１５】
請求項２に記載の発明は、二つの文字列の類似度を算出する方法において、第１の文字列から切り出した部分文字列のうち、類似度算出に対する効果に基づいて選別した部分文字列について、第２の文字列との一致情報を収集し、前記一致情報に含まれる部分文字列の中から、第１および第２の文字列に出現する順序が適合する部分文字列の重みに基づいて類似度を算出することを特徴とする。このように、文字列同士の類似度の算出において、一致情報に記録された部分文字列の中から、さらにそれぞれの文字列に出現する順序が適合する部分文字列を選び、その重みを類似度の算出に用いている。
【００１６】
かかるように構成されているので、重みを算出する部分文字列が限定され、検索精度を保ったまま、一層の計算の高速化が達成できる。
【００１７】
請求項３に記載の発明は、請求項１または２に記載の発明において、部分文字列の選別を、第１の文字列から切り出した部分文字列が第２の文字列に出現する回数を加味して行うことを特徴とする。
【００１８】
かかるように構成されているので、類似度の算出に効果のある部分文字列が効率的に選別でき、検索精度を保ちつつ、一層の計算時間の短縮を図ることが可能となる。
【００１９】
請求項４に記載の発明は、請求項１から３のいずれかに記載の発明において、文字列同士の一致情報として、部分文字列の長さ、第１の文字列における部分文字列の出現場所、第２の文字列における部分文字列の出現場所、第２の文字列内で何番目の一致かを表すシーケンス番号、を含むことを特徴とする。
【００２０】
かかるように構成されているので、一致したｎグラムの重みを加算して類似度を算出する方法だけでなく、多くの類似度算出方法を併用することが可能となる。
【００２１】
請求項５に記載の発明は、二つの文字列の類似度を算出する文字列類似度算出装置において、第１の文字列から切り出した部分文字列のうち、類似度算出に対する効果に基づいて選別する部分文字列選別部と、第２の文字列との一致情報を収集する一致情報収集部と、前記一致情報に基づき一致した部分文字列の重みを算出し、前記重みを総和することで類似度を算出する類似度算出部と、を有することを特徴とする文字列類似度算出装置である。
【００２２】
かかるように構成されているので、選別しない方法による検索精度とほぼ同等の検索精度を保ちつつ、計算時間の短縮を図る装置を実現できる。
【００２３】
請求項６に記載の発明は、二つの文字列の類似度を算出する文字列類似度算出装置において、第１の文字列から切り出した部分文字列について、類似度算出に対する効果に基づいて選別する部分文字列選別部と、前記選別部分文字列について、第２の文字列との一致情報を収集する一致情報収集部と、前記一致情報に含まれる部分文字列の中から、それぞれの文字列に出現する順序が適合する部分文字列を選択する適合部分文字列選択手段と、前記適合部分文字列に付けられた重みを総和する類似度算出部と、を有することを特徴とする文字列類似度算出装置である。
【００２４】
かかるように構成されているので、重みを算出する部分文字列が限定され、検索精度を保ったまま、一層の計算の高速化が達成できる装置を実現できる。
【００２５】
請求項７に記載の発明は、二つの文字列の類似度を算出する文字列類似度算出プログラムであって、コンピュータを、第１の文字列から切り出した部分文字列が第２の文字列に出現する回数に基づいて選別する選別手段と、前記選別部分文字列について、第２の文字列との一致情報を収集する一致情報収集手段と、前記一致情報に基づき一致した部分文字列の重みを算出し、前記重みを総和することで類似度を算出する類似度算出手段、として機能させることを特徴とする文字列類似度算出プログラムである。
【００２６】
かかるように構成されているので、コンピュータを、選別しない方法による検索精度とほぼ同等の検索精度を保ちつつ、計算時間の短縮を図る手段として機能させることができる。
【００２７】
請求項８に記載の発明は、二つの文字列の類似度を算出する文字列類似度算出プログラムであって、コンピュータを、第１の文字列から切り出した部分文字列について、類似度算出に対する効果に基づいて選別する選別手段と、前記選別部分文字列について、第２の文字列との一致情報として、部分文字列の長さ、第１の文字列における部分文字列の出現場所、第２の文字列における部分文字列の出現場所、第２の文字列内で何番目の一致かを表すシーケンス番号、を収集する一致情報収集手段と、前記一致情報に含まれる部分文字列の中から、それぞれの文字列に出現する順序が適合する部分文字列に付けられた重みに基づいて類似度を算出する類似度算出手段、として機能させることを特徴とする文字列類似度算出プログラムである。
【００２８】
かかるように構成されているので、コンピュータを、重みを算出する部分文字列を限定し検索精度を保ったまま、一層の計算の高速化を図る手段として機能させることができる。
【００２９】
請求項９に記載の発明は、請求項７または８に記載のプログラムを記録したコンピュータ読取可能な記録媒体である。
【００３０】
かかるように構成されているので、記録された機能を必要な場所で実現することができる。
【００３１】
【発明の実施の形態】
本発明は、ｎグラムによる文字列同士の類似度の算出方法およびそれを実現する装置、プログラム、記録媒体に関する。入力された文字列とデータベースに登録された複数の文書との類似度を算出することを想定しているが、それ以外の応用も可能である。文字列同士の一致部分を求める際に、データベース中のすべての文書それぞれに対して、入力文字列と文書に共通するｎグラムを求めるという方法ではなく、入力文字列からｎグラムを切り出し、それぞれのｎグラムを含む文書をサフィックスファイルの利用によって効率的にデータベース内から検索するという方法を用いる。
【００３２】
入力文字列から切り出されるｎグラムの中には、類似度の算出に与える影響の少ないｎグラムが多く含まれていると考えられる。入力文字列から切り出される部分文字列の数は、例えば、入力文字列の長さをｍとし、部分文字列を２文字であるbigramに限定するとｍ−１となる。従って、部分文字列の選別を行わずに、切り出されたbigramすべてに対し一致情報を求めると、入力文字列長ｍが大きくなるほど、計算時間も大きくなる。そこで、本発明では、切り出されたｎグラムの中から、類似度の算出に対する効果を推定することでｎグラムを選別し、定数個のｎグラムのみを一致情報の収集に適用する。
【００３３】
部分文字列の選別については、類似度に加算される重みの大きな部分文字列を選別するのが検索精度向上のために必要である。重みは、その部分文字列を含むデータベース中の文書の数（以下、ｄｆと記す）によって決まるので、ｄｆの値を元に選別するのが自然である。しかし、本発明の一つの局面では、ｄｆの代わりに、ｎグラムがデータベース内の文書に出現する数（以下、出現度数、もしくは、ｔｆと記す）に基づき部分文字列の選別を行うことが推奨される。
【００３４】
ｔｆは直接的に算出でき、その値が増えても計算時間はほとんど変わらないのに対し、ｄｆを算出するには、ｔｆを算出した後に文書内の重複を再計算しなければならず、ｔｆの値が大きな場合は計算時間が大きくなる。一方、ｔｆとｄｆには大きな相関があり、ｄｆの代わりにｔｆを用いても、検索の精度に影響しないことが期待できる。
【００３５】
一致情報の収集は選別された各ｎグラムに対し、次のような方法で行う。
まず、ｎグラムの重みを計算する。次に、文書データベース全体からそのｎグラムを含む文書を求め、その文書内におけるｎグラムの出現場所、入力文字列におけるｎグラムの出現場所、ｎグラムの長さ、文書内において何番目の一致かを表すシーケンス番号、ｎグラムの重みを一致情報として記録する。
【００３６】
通常、得られた一致情報は、記録・管理することなく、そのまま重みの加算がされ類似度が算出されるが、本発明では、これを記録・管理することにより、一致したｎグラムの重みを加算して類似度を算出する方法だけでなく、高速性を保ったまま、多くの類似度算出方法に適用することも可能にしている。
【００３７】
入力文字列とデータベース内の文書との類似度は、一致したｎグラムに付けられた重みを加算することによって算出される。ただし、一致するｎグラムが同じ文書内に２回以上出現する場合でも、その重みは１度しか加算されない。つまり、一致するｎグラムが２回現れるからといって、加算される重みが２倍になるわけではない。そうではなく、ｎグラムが文書内に出現する回数（以下、ｄｔｆ）に応じて、一致情報として記憶されているｎグラムの重みは各文書によって異なり、ｄｔｆの値が大きいほど、与えられる重みは大きくなっている。また、この重みは、各ｎグラムに対して、ｄｔｆ＝０の場合にも定義されており、そのｎグラムを含まない文書すべてに対して、０以下の重みが加算される。これは、ｎグラムを含まないということは類似していないということを示すものである、という考え方に基づくもので、加算される重みは、類似していない度合いを数値化したものと言える。従って、類似度を示す値は負の値を含む実数値を取り、その数直線上での値が大きいほど入力文字列との類似度は高くなる。
【００３８】
類似度を算出する方法は、一致したｎグラムに付けられた重みを加算するという前記の方法に限定されるものではなく、これ以外の類似度算出方法を適用することも可能である。その一例が、一致したｎグラムの中からそれぞれの文字列に出現する順序が適合するｎグラムだけを選び、それらのｎグラムに付けられた重みを総和した値を類似度とする方法である。
【００３９】
文字列の出現順序を考慮して類似度を算出する方法は、これまでも提案されている。しかしそれらは一致する全ての長さの部分文字列について重みを計算するため、一致したｎグラムの重みを加算する方法に比べて計算量が増え、扱う文字列の長さが大きくなるほど計算時間が大きくなる。
【００４０】
本発明では、選別されたｎグラムに重みを与え、それらのｎグラムが一致した時にのみ類似度に重みが加算される。従って、選別された部分文字列に対してのみ、一致および出現順序を考慮すればよいので、これまでの方法に比べて高速に計算でき、扱う文字列の長さが大きくなっても計算時間に与える影響は小さい。この方法では、適合するｎグラムの組み合わせの中で最も類似度が高くなる組み合わせを効率的に見つけるため、動的計画法（Dynamic Programming、以下ＤＰと称する）を用いて類似度を計算する。以下では、ＤＰを使って、ｎグラムＤＰ類似度を求める方法を説明する。
【００４１】
α、β、γ、δを長さ０以上の文字列、ξ、ζ、ηを長さ１以上の文字列、""を空文字列とする。複数の文字列（例えば、ξとγ）を繋げた文字列（例えば、α）は、要素となる文字列の記号を続けて書くことで表す。（例えば、α=ξγ）。
【００４２】
ｎグラムＤＰ類似度SimDPは、引数の文字列の部分に関する一致パターンに応じて、以下の式を再帰的に当てはめることで求める。まず、両方とも空文字列の時は、
SimDP("", "") = 0 （１）
とする。それ以外のときは、
SimDP(α, β) = MAX( SimDPs(α, β), SimDPg(α, β) ) （２）
とする。
ここで、SimDPs(α, β)は、ξを、一致情報管理テーブルのαとβに関するリストに記録されている文字列とし、α=ξγ、β=ξδとすると、
SimDPs(α, β) = MAX( Score(ξ) + SimDP(γ, δ) ) （３）
を、全てのξに関して計算することによって求められる。そのような文字列ξが存在しないときは、
SimDPs(α, β) = 0.0 （４）
とする。Score(ξ)は、一致情報管理テーブルに記録されたξの重みを返す関数である。
【００４３】
また、ζとηをα＝ζγ、β＝ηδを満たし、かつ、一致情報管理テーブルのαとβに関するリストに記録された文字列と共通部分を持たない最大の文字列とすると、SimDPg(α, β)は、
SimDPg(α,β)=MAX(SimDP(α,δ), SimDP(γ,β), SimDP(γ,δ) ) （５）
によって求められる。この式は、二つの文字列（ζγとηδ）から、ζかηの一方、もしくはζとηの両方を取り除いた残りに相当する文字列同士（αとδ、γとβ、γとδ）の類似度のうち、最も高い類似度を採用することを意味する。
【００４４】
以上の式を再帰的に適用することで、一致情報管理テーブルに記録された部分文字列の中から、二つの文字列それぞれの順序に適合する部分文字列が求められ、かつ、類似度が最大となる。
【００４５】
上記で示したｎグラムＤＰ類似度は、式（３）のように、文字（長さ１の文字列）ではなく、ｎグラム（長さｎの文字列）に対する一致を考慮し、重みを与えることによって、文字の連続性を加味した類似度算出法になっている。また、全ての部分文字列について文字列同士の一致を考慮するのではなく、一致情報管理テーブルに記録されたｎグラムに限定することで、式（５）のように、一致情報管理テーブルに記録された部分文字列に関係のない部分については、一致の有無を判定する必要がなく、取り除くことができる。これによって、類似度の算出対象とする文字列の長さがどんなに大きくても、一致情報管理テーブルに記録された部分文字列に関係する部分だけを考慮すればよいので、全ての部分文字列について文字列同士の一致を考慮する従来の方法に比べ、高速に類似度を算出することができる。
【００４６】
（第１実施例）
まず、ｎグラムを用いて文字列同士の類似度を算出する方法の実施例を示す。図１は選別された部分文字列に基づき、入力された文字列と最も類似度の高い文書を検索する文書検索装置の例である。この文書検索装置は、文書データベース１０、文字列入力部１１、部分文字列選別部１３、一致情報収集部１４、類似度算出部１５、類似度算出制御部１６、及び、検索結果出力部１２から構成されている。
【００４７】
文書データベース１０には、検索対象となる複数の文書１０ａ、１０ｂ、…、１０ｃが登録されている。検索のためには、キーワード、語、語句、文、文章などを入力する（以下、代表して検索文章と呼ぶ）。文字列入力部１１は、検索文章を文字列Ｘとして部分文字列選別部１３に与える。
【００４８】
部分文字列選別部１３は、文字列入力部１１から与えられた文字列Ｘから部分文字列を切り出し、出現頻度ｔｆを算出した後、算出したｔｆの小さいものから定数個を取り出し、部分文字列管理テーブルＴ１に登録する。部分文字列選別部１３は、部分文字列切り出し制御部３１、部分文字列切り出し部３２、文字列出現頻度算出部３３、部分文字列登録部３４から成る。部分文字列切り出し制御部３１は、部分文字列切り出し部３２がどの部分文字列を切り出すかを制御する。部分文字列切り出し部３２は文字列Ｘより部分文字列ｘを切り出す。文字列出現頻度算出部３３は部分文字列ｘの文書データベース１０内における出現頻度ｔｆを算出する。部分文字列登録部３４は切り出された部分文字列をｔｆの値の小さい順に定数個選び、部分文字列管理テーブルＴ１に登録する。
【００４９】
一致情報収集部１４は部分文字列管理テーブルＴ１に登録された各部分文字列に対し、文字列に与える重みを算出し、文書データベース内における各部分文字列の出現場所を検出し、一致情報として、文書内における出現場所、入力文字列における出現場所、部分文字列の文字列の長さ、文書内において何番目の一致かを表すシーケンス番号、文字列の重みを一致情報管理テーブルＴ２に記録する。一致情報収集部１４は、一致情報収集制御部４１、文字列出現場所検索部４２、文字列重み算出部４３、一致情報登録制御部４４、一致情報登録部４５から成る。一致情報収集制御部４１は部分文字列管理テーブルＴ１に登録された部分文字列ａを１つずつ取り出し、文字列出現場所検索部４２と文字列重み算出部４３に与える。文字列出現場所検索部４２は与えられた部分文字列ａの文書データベース内における出現場所の全てについて、出現場所、部分文字列ａの長さ、文書内において何番目の一致かを表すシーケンス番号を求める。文字列重み算出部４３は与えられた文字列ａに与える重みを計算する。一致情報登録制御部４４は部分文字列ａの出現場所を１つずつ選び、文書内における出現場所、入力文字列における出現場所、部分文字列ａの長さ、文書内におけるシーケンス番号、部分文字列ａの重みと組にして、一致情報登録部４５に与える。一致情報登録部４５は受け取った各組を、一致情報管理テーブルＴ２の該当する文書番号のリストに一致情報として登録する。
【００５０】
類似度算出制御部１６は一致情報管理テーブルＴ２から、ある１つの文書Ｙに関するリストを取り出し、類似度算出部１５に与える。
【００５１】
類似度算出部１５は、与えられた一致情報のリストより、ＸとＹの類似度を算出する。類似度算出部１５は、文字列重み加算制御部５１、文字列重み加算部５２から成る。文字列重み加算制御部５１は一致情報のリストより１つの一致情報を選び、その一致情報の持つシーケンス番号が１であれば、その文字列の重みscoreを文字列重み加算部５２に与える。文字列重み加算部５２は与えられた重みをＸとＹの類似度Sim(X,Y)に加算する。
【００５２】
検索結果出力部１２は類似度が最も高い文書を選択し出力する。この時、類似度が一定値以上の文書や、上位から一定の順位までの文書を合わせて出力しても良い。
【００５３】
（第２実施例）
本発明による文章検索をソフトウエアにより実施する実施例を以下に説明する。
【００５４】
図２に、文章検索の実行に用いる計算機システムの一例を示す。この計算機システムは、ディスプレイ１０１、プリンタ１０２、キーボード１０３、フロッピー（Ｒ）ディスク装置１０４、ＣＤ−ＲＯＭ（Compact Disk− Read Only Memory）装置１０５、読み出し専用メモリ（Read Only Memory。以下、ＲＯＭ）１０６、読み書き可能なランダムアクセスメモリ（Random Access Memory。以下、ＲＡＭ）１０７、磁気ディスク装置１０８、中央処理装置（Central Processing Unit。以下、ＣＰＵ）１０９、通信インターフェイス１１０、及び、これらを接続するバス１１１から構成されている。フロッピー（Ｒ）ディスク装置１０４はフロッピー（Ｒ）ディスク１１２の読み書きを行い、ＣＤ−ＲＯＭ装置１０４はＣＤ−ＲＯＭ１１３の読み出しを行う。また、通信インターフェイス１１０により、本計算機システムは通信ネットワーク１１４に接続されている。
【００５５】
本発明を実施する文章検索プログラムは、ＲＯＭ１０６に記憶しておく。あるいは、フロッピー（Ｒ）ディスク１１２、ＣＤ−ＲＯＭ１１３、又は、磁気ディスク装置１０８に文章検索プログラムを記憶しておき、ＲＡＭ１０７に転送した後、ＣＰＵ１０９が実行するのでも良い。ＣＰＵ１０９は、ＲＡＭ１０７を作業領域に使って文章検索プログラムを実行する。必要に応じて、磁気ディスク装置１０８を作業領域に使っても良い。文章検索プログラムの実行の指示はキーボード１０３から行い、実行結果は、ディスプレイ１０１、又は、プリンタ１０２に出力する。文章検索プログラムの実行を、フロッピー（Ｒ）ディスク１１２から指示することや、実行結果をフロッピー（Ｒ）ディスク１１２に書き込んでも良いのは言うまでもない。
【００５６】
文書データベースは、フロッピー（Ｒ）ディスク１１２、ＣＤ−ＲＯＭ１１３、又は、磁気ディスク１０８に蓄えておく。高速なアクセスのためにＲＡＭ１０７に転送しておくのでも良い。ＲＡＭ１０７に転送する際に、容易に処理できる形式に変換するのも良い。また、文章検索プログラム、文書データベース、又は、実行の指示を、ネットワーク１１４経由で本計算機システムに入力したり、実行の結果をネットワーク１１４経由で本計算機システムから出力したりしても良いことは、もちろんである。
【００５７】
また、図に示されたものに限らず、各種の記録媒体、入力手段、出力手段を用いて、本計算機システムへの入力と出力を行うなど各種の実施態様への変形が可能なことは言うまでもない。これらの、記録媒体、入力手段、出力手段は本計算機システムが直接アクセスするものの他、通信ネットワークを経由してアクセスするものであっても良いのはもちろんである。
【００５８】
図３から図７に示すのは、計算対象とするｎグラムを選別して算出する文字列類似度による文書検索プログラムの処理フローである。
【００５９】
図３は、検索文章に基づいて文書データベースを検索し、類似度の高い文書を選び出して出力する処理フローを示す。
【００６０】
まず、ステップＳ１１（以下、Ｓ１１と略記）で、ある文字列の出現回数を効率よく計算する準備のために、文書データベースに含まれる全文書を統合してサフィックスファイル（Suffix File）を作成する。サフィックスファイルの作成法と利用法は、M. Yamamoto and K. W. Church, Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus, In proceeding of 6^th Workshop on Very Large Corpora, Ed. Eugene Charniak, Motreal, pp28-37, 1998に開示されている。
【００６１】
サフィックスファイルを使うと、ある文字列が文書データベース内に出現する回数を高速に求めることができる。サフィックスファイルは、すべての文書において生じうる部分の文字列を、文字コード順に並び替えて、通し番号(サフィックス)を付けておくことで実施する。文字列が文書データベースに出現する回数は、その文字列と一致する文字列がサフィックスファイルの中にいくつあるかを算出することで求められる。
【００６２】
具体的には、まず、一致する文字列のサフィックスの最小値minと最大値maxをそれぞれ二分探索法により求める。一致する文字列がなければ、文書データベースに出現する回数は０である。minとmaxが求まれば、文字列が出現する回数ｔｆは、ｔｆ=max‐min+1で求められる。
【００６３】
文書データベースの文書は、文書番号によって互いに区別されるものとし、サフィックスファイルに登録する部分文字列にはこの文書番号を付けておく。これによって、ある部分文字列を含む文書を効率的に検索することができる。また、ｄｆは、重複する文書番号の数を数え上げ、その数をｔｆから引くことによって計算することができる。
【００６４】
次にＳ１２で、検索文章を文字列Ｘに読み込む。
【００６５】
Ｓ１３では、文字列Ｘから切り出される部分文字列を、文書データベース内における出現頻度ｔｆに基づいて選別し、ｔｆと組にして部分文字列管理テーブルに記録する。Ｓ１３で行う処理については、図４を用いて後述する。
【００６６】
Ｓ１４では、部分文字列管理テーブルに記録された各部分文字列に対し、一致情報を収集し、一致情報管理テーブルへの記録を行う。一致情報管理テーブルには、文書番号毎に、一致情報のリストとして記録される。Ｓ１４で行う処理については、図５を用いて後述する。
【００６７】
Ｓ１５では、一致情報管理テーブルから、ある一つの文書Ｙのリストを取り出す。
【００６８】
次にＳ１６で、取り出したリストよりＸとＹの類似度を計算する。Ｓ１６で行う処理については、図６を用いて後述する。
【００６９】
Ｓ１７では、求めた類似度と文書番号を組にして文書管理テーブルに登録する。
【００７０】
Ｓ１８では、一致情報管理テーブルに記録された全てのリストについて類似度を計算したかどうかを判定する。もし、まだ全てのリストについて類似度を計算していなければ、まだ類似度の計算を行っていないリストをＳ１５で選んで取り出し、Ｓ１７までの処理を繰り返す。もし、全てのリストについて計算していれば、Ｓ１９で、登録したテーブルを類似度の高い順に並び替える。
【００７１】
Ｓ２０では、類似度の高い文書の出力する処理を行う。出力する文書は、一つだけにする、あるいは、所定の複数にする、所定の類似度以上である全ての文書にする、など種々の態様が可能である。
【００７２】
図４は、検索文章を読み込んだ文字列Ｘから部分文字列を切り出し、一致情報の収集に利用する部分文字列を出現頻度に基づき選別し、部分文字列管理テーブルに記録する処理のフローを示す。
【００７３】
まず、Ｓ３１で、部分文字列管理テーブルに記録された部分文字列の数を表す変数num_substringと切り出す部分文字列の長さを表す変数ｊを初期化している。MinNgramLengthは、切り出す部分文字列の長さの最小値を決めるパラメータである。
【００７４】
次にＳ３２で、文字列Ｘから長さｊの部分文字列を一つ切り出し、文書データベース内における出現頻度ｔｆを計算する。
【００７５】
Ｓ３３では、切り出された部分文字列のｔｆの値が０かどうかを判定する。もし、ｔｆ＝０ならば、文書データベース内にその部分文字列は存在しないため、一致情報の収集に利用するのは不適当である。したがって、Ｓ３４の処理を飛ばして、Ｓ３５に進む。ｔｆ≠０ならば、Ｓ３４に進む。
【００７６】
Ｓ３４では、切り出された部分文字列をｔｆと共に部分文字列管理テーブルに記録し、num_substringの値に１を加える。
【００７７】
Ｓ３５では、Ｘから切り出される長さｊの全ての部分文字列についてｔｆを計算したかどうかを判定する。もし、まだ長さｊの全ての部分文字列について計算していなければ、まだ計算していない長さｊの部分文字列をＳ３２で選んでｔｆを計算し、Ｓ３４までの処理を繰り返す。もし、長さｊの全ての部分文字列について計算していれば、Ｓ３６で、ｊに１を加える。
【００７８】
Ｓ３７では、Ｘから切り出す部分文字列の長さｊの値が、切り出す部分文字列の長さの最大値を決めるパラメータMaxNgramLengthより大きいかどうかを判定する。もし、ｊの値がMaxNgramLength以下なら、Ｓ３２に戻り、長さｊの部分文字列に対し、Ｓ３６までの処理を繰り返す。もし、ｊの値がMaxNgramLengthより大きければ、長さがMinNgramLength以上、MaxNgramLength以下のすべての部分文字列に対しｔｆの計算を終えているので、Ｓ３８に進み、部分文字列管理テーブルに記録された部分文字列をｔｆの小さい順に並び替える。
【００７９】
Ｓ３９では、部分文字列管理テーブルに登録された部分文字列の数num_substringが、一致情報の収集に用いる部分文字列の数の上限値を決めるパラメータSubStringLimitより大きいかどうかを判定する。もし、num_substringがSubStringLimitより大きければ、Ｓ４０に進み、ｔｆの小さい順にSubStringLimit個の部分文字列を取り出し、これらの部分文字列を改めて部分文字列管理テーブルに記録する。もし、num_substringがSubStringLimit以下ならば、Ｓ４０をスキップしてＳ４１に進む。
【００８０】
Ｓ４１は、記録された部分文字列管理テーブルを返す処理である。
【００８１】
図５は、部分文字列管理テーブルに記録された各部分文字列と、文書データベース内の各文書との一致情報を収集し、その情報を一致情報管理テーブルに記録する処理のフローを表す。
【００８２】
まず、Ｓ５１では、変数p0fit_sumを０に初期化する。変数p0fit_sumは、類似度を一致したｎグラムの重みの加算で算出する際に、計算手間を高速化するために用いる変数で、文書データベース内の文書全体に関する類似度のオフセットである。
【００８３】
Ｓ５２では、部分文字列管理テーブルからある一つの部分文字列を選びａに読み込む。
【００８４】
Ｓ５３では、p0fit、p1fit、p2fit、p3fit、p4fitを計算し、p0fit_sumにp0fitを加算する。p0fit、p1fit、p2fit、p3fit、p4fitは、それぞれ、ａがある文書内に、出現しなかった、１回出現した、２回出現した、３回出現した、４回以上出現したときの、その文書におけるａの重みである。p0fit、p1fit、p2fit、p3fit、p4fitの計算方法については、図７を用いて後述する。
【００８５】
Ｓ５４では、文書データベース内でａが出現する場所を全て求め、これを出現する場所の順に並び替える。
【００８６】
Ｓ５５では、ａの各出現場所に対し、ａを含む文書の文書番号を求める。このとき、ａは出現場所順に並んでいるので、得られる文書番号も小さい順に並んでいる。
【００８７】
Ｓ５６では、出現場所の順にａの出現場所を一つ選ぶ。
【００８８】
Ｓ５７では、選んだａの出現場所が、それを含む文書内において、最も前方にある出現場所かどうかを判定する。つまり、選んだ出現場所の文書と、一つ前の出現場所の文書が異なっていれば、それは最初の出現場所であり、同じであれば、２番目以降の出現場所である。最初の出現場所であれば、Ｓ５８に進み、その文書内におけるａの出現回数ｄｔｆを計算し、ａの文書内における重みを決める。また、sequence_num=1とする。sequence_numは、選らんだ出現場所が文書内において何番目のａの出現場所かを表すシーケンス番号である。
【００８９】
Ｓ５９では、文書内のシーケンス番号sequence_num、入力文字列Ｘにおけるａの出現場所（以下、startX）、文書内におけるａの出現場所（以下、startdoc）、ａの長さ（以下、termlength）、ａの重み（以下、score）を組にして一致情報管理テーブルに記録し、sequence_numに１を加える。ただし、scoreに記録される値はａのそのままの重みではなく、ａの重みからp0fitを引いた値を記録する。これは、類似度を一致したｎグラムの重みの加算で計算する場合、選別された各部分文字列について、それを含まない文書の類似度にそれぞれの部分文字列のp0fitを加算する代わりに、一致したｎグラムの重みからp0fitを引いた値を加算しておき、最後に全ての類似度に対して重みのオフセットp0fit_sumを加えることによって、計算の手間を減らすための工夫である。
【００９０】
Ｓ６０では、sequence_numとｔｆの値を比較して、ａの全ての出現場所について一致情報の記録を行ったかどうかを判定する。もし、記録していない一致情報があれば、Ｓ５６で次のａの出現場所を選び、Ｓ５９までの処理を繰り返す。もし、ａの全ての出現場所について一致情報の記録をしていればＳ６１に進む。
【００９１】
Ｓ６１では、部分文字列管理テーブル内の全ての部分文字列について、一致情報の収集を行ったかどうかを判定する。もし、一致情報の収集をしていない部分文字列があれば、Ｓ５２で、まだ選んでいない部分文字列をａに読み込み、Ｓ６０までの処理を繰り返す。もし、すべての部分文字列について一致情報の収集を終えていれば、Ｓ６２で、得られた一致情報管理テーブルを返す。
【００９２】
図６は、入力文章Ｘと文書Ｙの類似度を、一致情報管理テーブルから取り出したリストを用いて、一致した文字列の重みの加算によって求める処理フローである。
【００９３】
まず、Ｓ７１で、ＸとＹの類似度（以下、sim）を０に初期化する。
【００９４】
Ｓ７２では、一致情報管理テーブルに記録されているＹに関するリストからある一つを選び、Ｉに読み込む。
【００９５】
Ｓ７３では、読み込んだＩのsequence_numが１かどうかを判定する。これは、同一の部分文字列のscoreをsimに重複して加算しないための処理である。もし、sequence_numが１でなければ、Ｓ７４をスキップし、Ｓ７５に進む。sequence_num=1であれば、Ｓ７４で、simにＩのscoreを加算する。
【００９６】
Ｓ７５では、Ｙに関する一致情報のリストに記録された全ての一致情報について調べたかどうかを判定する。もし、全ての一致情報について調べていれば、Ｓ７６で、simに文書全体の重みのオフセットp0fit_sumを加算する。まだ、調べていない一致情報があれば、Ｓ７２で、まだ調べていない一致情報を選んでＩに読み込み、Ｓ７４までの処理を繰り返す。
【００９７】
Ｓ７７は、得られたsimをＸとＹの類似度として返す処理である。
【００９８】
図７は、図５のＳ５３における、p0fit、p1fit、p2fit、p3fit、p4fitの計算処理フローを示す。
【００９９】
まず、Ｓ８１で、p0fit、p1fit、p2fit、p3fit、p4fitをすべて０に初期化する。
【０１００】
Ｓ８２では、部分文字列ａのｄｆを計算し、Ｓ８３で、ｄｆと文書データベース内の文書の総数Ｎからｉｄｆを計算する。このｉｄｆは、情報理論の分野における情報量を背景とする値で、この値を部分文字列の重みとする方法も良く知られている。
【０１０１】
Ｓ８４では、部分文字列ａが検索に有効な部分文字列であるかどうかを判定するための閾値tf_thresholdを計算する。
【０１０２】
Ｓ８５では、ｔｆとｄｆの値から部分文字列ａが検索に有効かどうかを判定する。ｔｆ /ｄｆ > tf_thresholdであれば、検索に有効な部分文字列であると判断し、Ｓ８６に進む。そうでなければ、検索には有効でないと判断して、Ｓ８６、Ｓ８７をスキップし、Ｓ８８でp0fit、p1fit、p2fit、p3fit、p4fitを返す。つまり、p0fit、p1fit、p2fit、p3fit、p4fitの値は全て０が返される。
【０１０３】
Ｓ８６では、p0fit、p1fit、p2fit、p3fit、p4fitを計算する。
【０１０４】
Ｓ８７の関数MAXとMINは、それぞれ、引数に与えられた数値の最大値もしくは最小値を返す関数で、この関数により、p0fit、p1fit、p2fit、p3fit、p4fitの値をLB以上UB以下の範囲に制限している。LBとUBは共にp0fit、p1fit、p2fit、p3fit、p4fitの分布を制限するパラメータである。Ｓ８８は、p0fit、p1fit、p2fit、p3fit、p4fitを返す処理である。
【０１０５】
以上の説明からも分かる通り、重みp0fit、p1fit、p2fit、p3fit、p4fitは、ｔｆ、ｄｆ、ｉｄｆの関数として求められる。Ｓ８４とＳ８６に用いている係数は、Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, Massachusetts, pp.529-574, 1999.に開示されている理論に基づき、ドキュメントデータの観測値を求めることによって定めた。なおこれらの係数は示された数値に限定されるものではなく、目的に応じて適切な値にすることが許容される。
【０１０６】
図５のＳ５３における、p0fit、p1fit、p2fit、p3fit、p4fitの計算処理フローとしては、図７で示したフローの代わりに、図１５に示すフローを適用することもできる。図１５における計算処理フローを以下に述べる。
【０１０７】
まず、Ｓ１８１で、p0fit、p1fit、p2fit、p3fit、p4fitをすべて０に初期化する。
【０１０８】
Ｓ１８２では、部分文字列ａのｄｆと、部分文字列ａが２回以上出現する文書データベース中の文書の数（以下、ｄｆ２と記す）を計算する。Ｓ１８３で、ｄｆと文書データベース内の文書の総数Ｎからｉｄｆを計算する。
【０１０９】
Ｓ１８４では、部分文字列ａが検索に有効な部分文字列であるかどうかを判定するための閾値df2_thresholdを０．２２に設定する。
【０１１０】
Ｓ１８５では、ｄｆとｄｆ２の値から部分文字列ａが検索に有効かどうかを判定する。ｄｆ２ /ｄｆ > df2_thresholdであれば、検索に有効な部分文字列であると判断し、Ｓ１８６に進む。そうでなければ、検索には有効でないと判断して、Ｓ１８６、Ｓ１８７をスキップし、Ｓ１８８でp0fit、p1fit、p2fit、p3fit、p4fitを返す。つまり、p0fit、p1fit、p2fit、p3fit、p4fitの値は全て０が返される。
【０１１１】
Ｓ１８６では、p0fit、p1fit、p2fit、p3fit、p4fitを計算する。
【０１１２】
Ｓ１８７は図７のＳ８７と同様に、p0fit、p1fit、p2fit、p3fit、p4fitの値をLB以上UB以下の範囲に制限するものである。LBとUBは共にp0fit、p1fit、p2fit、p3fit、p4fitの分布を制限するパラメータである。Ｓ１８８は、p0fit、p1fit、p2fit、p3fit、p4fitを返す処理である。
【０１１３】
以上の説明からも分かる通り、図１５の計算処理フローでは、部分文字列ａが検索に有効かどうかを判定する基準として、先に述べた図７の計算処理フローにおけるｔｆの代わりにｄｆ２を用いている。ｄｆ２／ｄｆは、部分文字列の出現集中度、つまり、ある部分文字列が特定の文書にのみ集中して出現する度合を表しており、この情報を用いて部分文字列の選別を行うことにより、検索精度の向上を図っている。
【０１１４】
Ｓ１８３における閾値df2_thresholdおよびＳ１８６に用いている係数は、示された数値に限定されるものではなく、目的に応じて適切な値にすることが許容される。
【０１１５】
図８に、一致情報管理テーブルの構成図を示す。一致情報管理テーブルは、文書番号毎の一致情報のリストによって構成される。図８では、文書番号０００２に一致情報１と一致情報５が、文書番号０１００に一致情報２、一致情報３と一致情報６が、文書番号０１１１に一致情報４と一致情報７がリストとして記録されている。それぞれの一致情報には、部分文字列の文書内におけるシーケンス番号sequence_num、入力文字列Ｘにおける部分文字列の出現場所（startX）、文書内における部分文字列の出現場所（startdoc）、部分文字列の長さ（termlength）、部分文字列に付けられた重み（score）が格納されている。
【０１１６】
新たに、文書番号０００２に関する一致情報８が得られた場合、図８のように、これまで一致情報５を指していたリストの先頭を指すポインタは一致情報８を指し、一致情報８から一致情報５へのポインタが張られ、文書番号０００２のリストの先頭に一致情報８は記録される。
【０１１７】
（第３実施例）
次に、ｎグラムＤＰ類似度に基づき、入力された文字列と最も類似度の高い文書を検索する文書検索装置の実施例を図９に示す。この文書検索装置は、文書データベース１０、文字列入力部１１、部分文字列選別部１３（内部の図示は省略）、一致情報収集部１４（内部の図示は省略）、類似度算出部１７、類似度算出制御部１８、再帰実行制御部１９、及び、検索結果出力部１２から構成されている。
【０１１８】
文書データベース１０、文字列入力部１１、部分文字列選別部１３、一致情報収集部１４、及び、検索結果出力部１２は実施例１の同符号を付した部分と同じ機能・構成であり説明を省略する。
【０１１９】
類似度算出制御部１８は一致情報管理テーブルＴ２より、ある１つの文書Ｙに関するリストを取り出し、文字列ＸとＹとともに類似度算出部１７に与える。
【０１２０】
類似度算出部１７は、与えられた一致情報のリストより、式（１）または式（２）に基づいてＸとＹの類似度を算出する。この類似度を算出する途中で、一部分の文字列について同様に類似度を求める必要がある。これは、再帰実行制御部１９により、類似度算出部１７を繰り返し用いることで実施する。類似度算出部１７は一致文字列類似度算出部６１、任意文字列類似度算出部６２、最大値選択部６３から成る。一致文字列類似度算出部６１は式（３）のSimDPs(α, β)を計算する。任意文字列類似度算出部６２は式（５）のSimDPg(α, β)を計算する。最大値選択部６３は、これらに対して関数MAXを実施することで、式（２）のSimDP(α, β)を算出する。なお、類似度算出部１７の受け取った文字列α、βの両方が空文字のときは、再帰実行制御部１９によりSimDP(α,β) ＝0.0とする。この際、一致文字列類似度算出部６１、任意文字列類似度算出部６２、最大値選択部６３は動作させない。言うまでもなく、このSimDP(α,β) ＝0.0という値は、式（１）を実施するものである。
【０１２１】
一致文字列類似度算出部６１は、文字列分離制御部７１、文字列分離部７２、類似度算出部７３、加算部７４、最大値選択部７５により実施されており、式（３）のSimDPｓ(α,β)を算出する。αとβの一致する先頭の文字列が一致情報管理テーブルＴ２に記録された文字列である場合のみSimDPｓ(α,β)を算出する。
【０１２２】
まず、文字列分離制御部７１は、一致文字列類似度算出部６１が受け取った文字列α(=ξγ)、β(=ξδ)において、一致する文字列ξがない場合、すなわち、一致する文字列ξが空文字列の場合は、式（１）に従い、SimDPｓ(α,β)＝0.0とする。
【０１２３】
次に、文字列分離制御部７１は、一致文字列類似度算出部６１が受け取った文字列α(=ξγ)、β(=ξδ)において、一致する文字列ξがある場合は、全てのξに関して、文字列分離部７２、類似度算出部７３、加算部７４を動作させて、式（３）に含まれるScore(ξ)＋SimDP(γ, δ)を計算させる。そして、最も大きな値を最大値選択部７５により選択する。このことにより、式（３）に示すSimDP_s(α,β)が求まる。
【０１２４】
文字列分離部７２は、文字列αをξとγに、文字列βをξとδに分離した後に一致情報管理テーブルＴ２を参照してξの重みScore(ξ)を加算部７４に与え、γとδを類似度算出部７３に与える。類似度算出部７３は、式（３）のSimDP(γ, δ)を算出する。類似度算出部７３は、実際には、再帰実行制御部１７により、類似度算出部１６をγとδに対して適用することで、実施する。加算部７４は、式（３）の加算を行う。
【０１２５】
任意文字列類似度算出部６２は、類似度算出部８１〜８３、最大値選択部８４により実施されており、式（２）のSimDPg(α,β)を算出する。先頭の文字列が異なる場合か、先頭の文字列は一致するが一致情報管理テーブルに登録されていない文字列の場合に任意文字列類似度算出部６２は実行される。受け取った文字列α(=ζγ)、β(=ηδ)の先頭の１文字ζ、ηの有無に関する各場合に対応して、類似度算出部８１、８２、８３は、それぞれ式（５）のSimDP(α,δ), SimDP(γ,β), SimDP(γ,δ)を求める。類似度算出部８１〜８３は、実際には、再帰実行制御部１９により、類似度算出部１７を、αとδ、γとβ、γとδのそれぞれに対して適用することで、実施する。最大値選択部８４は、式（５）の関数MAXを実施する。
【０１２６】
（第４実施例）
ソフトウエアにより文字列Ｘと文字列ＹのｎグラムＤＰ類似度を求める処理フローを図１０と図１１に示す。この処理は、図３のＳ１６の内部処理として、図６で説明した処理の代わりに用いることが可能である。また、実行にあたっては第２実施例で示したコンピュータシステムを用いている。
【０１２７】
まず、Ｓ９１では、ＸとＹそれぞれにおいて、リストに記録された部分文字列の先頭文字と最終文字の中で、最も前にある先頭文字の場所minX、minYと、最も後ろにある最終文字の場所maxX、maxYを求め、長さmaxX-minX+1の配列X_indexと、長さmaxY-minY+1の配列Y_indexを用意し、全ての要素を−１に初期化する。これらの配列は、それぞれ、ＸのminXからmaxXまでの各文字、ＹのminYからmaxYまでの各文字と対応している。
【０１２８】
Ｓ９２からＳ９４の処理では、ＸにおけるminXからmaxXまでの各文字、ＹにおけるminYからmaxYまでの各文字で、リストに記録された各部分文字列の先頭文字もしくは最終文字にあたる文字と対応する配列X_index、Y_indexの要素に０を代入する。
【０１２９】
Ｓ９５からＳ９９の処理では、X_index[ｉ]=0であるｉに対し、前から順に、０,１,２,…, X_index_num-1と通し番号を振り、その番号をX_index[ｉ]に代入する。従って、X_index_numは、X_index[ｉ]≠−１であるｉの数である。
【０１３０】
Ｓ１００からＳ１０４では、Y_indexに対して同様の処理を行う。Y_index[ｊ]=0であるｊに対し、前から順に、０,１,２,…, Y_index_num-1と通し番号を振り、その番号をY_index[ｊ]に代入する。従って、Y_index_numは、Y_index[ｊ]≠−１であるｊの数である。
【０１３１】
Ｓ１０５では、リストに記録されたＸとＹの一致情報を、まずＸにおける部分文字列の出現する順に並び替える。次にＸにおける部分文字列の出現する順が同じものについて、Ｙにおける部分文字列の出現する順に並び替え、一致情報の数をｍに読み込む。
【０１３２】
次に、類似度をＤＰによって効率的に求めるための準備として、(X_index_num+2)行(Y_index_num+2)列のスコア表scoretableを作り、表の全ての要素を０に初期化する。この表は、縦方向が、文字列Ｘの中で、リストに記録された部分文字列の先頭文字もしくは最終文字にあたる文字、横方向が、文字列Ｙの中で、リストに記録された部分文字列の先頭文字もしくは最終文字にあたる文字に対応している。
【０１３３】
Ｓ１０６では、ｋとｉを、k=1、i=0に初期化する。
【０１３４】
Ｓ１０７では、ｊをj=0に初期化する。変数ｋは、現在、先頭からｋ番目の一致情報について注目していることを表し、ｉとｊは、それぞれ、ＸとＹにおけるリストに記録された部分文字列の先頭文字もしくは最終文字の中で、どの文字に注目しているかを表す変数である。Ｓ１０８では、現在のスコアとして、currentscoreにscoretable[ｉ] [ｊ]を代入する。
【０１３５】
ここで、説明の便宜上、リストの先頭からｋ番目の一致情報を、それぞれ、startX(k)、startdoc(k)、termlength(k)、score(k)と表すことにする。
【０１３６】
Ｓ１０９では、ｉとｊが指す場所が、前からｋ番目の一致情報の部分文字列が出現する場所と一致しているかどうかを判定する。一致していれば、Ｓ１１０に進み、一致していなければ、Ｓ１１４に進む。
【０１３７】
Ｓ１１０では、スコア表において、一致した部分文字列の最終文字と対応する行番号と列番号を求め、それぞれ、target_iとtarget_jに代入する。
【０１３８】
Ｓ１１１では、scoretable[target_i][target_j]において、現段階で得られているスコアと、currentscoreに一致する部分文字列の重みscore(k)を加算して得られるスコアを比較し、もし、scoretable[target_i][target_j]よりcurrentscore+score(k)の方が大きければ、Ｓ１１２で、scoretable[target_i][target_j]にcurrentscore+score(k)を代入する。そうでなければ、Ｓ１１２をスキップし、Ｓ１１３に進む。
【０１３９】
Ｓ１１３では、ｋの値に１を加え、リスト内の次の一致情報に注目し、Ｓ１０９に戻る。Ｓ１０９では、次の一致情報もｉとｊを出現場所とするならＳ１１３までの処理を繰り返し、そうでなければ、Ｓ１１４に進む。
【０１４０】
Ｓ１１４からＳ１１９までの処理では、現在のスコアcurrentscoreとスコア表の右、下、右下のスコアを比較し、currentscoreの方が大きければcurrentscoreを代入する。
【０１４１】
Ｓ１２０では、ｊがスコア表の右端まで来たかどうかを判定する。もし、右端まで来ていなければ、Ｓ１２１でｊに１を加え、Ｓ１０８に戻り、Ｓ１１９までの処理を繰り返す。右端まで来ていれば、Ｓ１２２に進む。
【０１４２】
Ｓ１２２では、ｉがスコア表の下端までに来たかどうかを判定する。もし、下端まで来ていなければ、Ｓ１２３でｉに１を加え、Ｓ１０７に戻り、Ｓ１２０までの処理を繰り返す。下端までていれば、Ｓ１２４に進み、scoretable[X_index_num+1][Y_index_num+1]をＸとＹの類似度として返す。
【０１４３】
【発明の効果】
請求項１に記載の発明によれば、入力された文字列から切り出した部分文字列のうち、文書データベース内の文書との類似度算出に関して効果のある文字列を選別して、類似度を求めることができる。すなわち検索に効果の高い文字列に限定して類似度を求めることができる。
【０１４４】
請求項２に記載の発明によれば、二つの文字列それぞれにおける順序に適合し、かつ、共通する部分文字列に着目して類似度を求めることができる。すなわち文字列の出現順序を考慮した類似度が求められる。
【０１４５】
請求項３に記載の発明によれば、類似度算出に関する効果の大きな順に文字列を選別して、類似度を求めることができる。これにより決まった類似度はより適切な値となる。
【０１４６】
請求項４に記載の発明によれば、一致した部分文字列の重みを加算する方法だけでなく、その他の類似度算出方法を組み合わせることができる。これにより検索精度が更に向上する。
【０１４７】
請求項５に記載の発明によれば、入力された文字列から切り出した部分文字列のうち、文書データベース内の文書との類似度算出に関して効果のある文字列を選別して、検索に効果の高い文字列に限定して類似度を求める装置を実現できる。
【０１４８】
請求項６に記載の発明によれば、二つの文字列それぞれにおける順序に適合し、かつ、共通する部分文字列に着目して文字列の出現順序を考慮した類似度を求める装置を実現できる。
【０１４９】
請求項７の発明によれば、入力された文字列から切り出した部分文字列のうち、文書データベース内の文書との類似度算出に関して効果のある文字列を選別して、検索に効果の高い文字列に限定して類似度を求めるプログラムを提供できる。
【０１５０】
請求項８に記載の発明によれば、二つの文字列それぞれにおける順序に適合し、かつ、共通する部分文字列に着目して文字列の出現順序を考慮した類似度を求めるプログラムを提供できる。
【０１５１】
請求項９に記載の発明によれば、以上述べたような効果を種々のコンピュータ上で実現できる。
【０１５２】
次に、第１および第２の実施例に関する検索性能を説明する。入力文字列から切り出される部分文字列を選別する効果と、部分文字列の選別基準にｄｆの代わりにｔｆを用いることの効果を確認するための評価実験を行った。
【０１５３】
選別の基準にｔｆ、ｄｆを用いた時の、選択した部分文字列の数と検索精度の関係を図１２に、選別した部分文字列の数と検索時間の関係を図１３に示す。検索精度には、１１ｐｔ平均精度と呼ばれる値を用いた。１１ｐｔ平均精度とは、再現率（０〜１）に対し、０．１刻みで１〜１１点を割り当て、それぞれの値を平均した評価指数で、詳細については、G Salton and M. J MacGill, Introduction to Modern Information Retrieval, p174-181, MacGraw-Hill Book Co., New York, 1983.に開示されている。
【０１５４】
入力文字列から切り出す部分文字列の長さは２に、つまりbigramのみとした。また、図７のＳ８７で用いるパラメータLBとUBは、それぞれ、LB=０、UB=ｉｄｆとした。用いた文書データは、NTCIR1テストコレクションと呼ばれる代表的な類似情報検索用テストデータで、約３００，０００件の文書と５３件の検索文章を含んでいる。
【０１５５】
図１２、図１３共に、横軸には選択した部分文字列の数、つまりSubStringLimitの値を、図１２の縦軸には検索精度を、図１３の縦軸には５３件の検索文章を検索するのに要した時間を示している。
【０１５６】
図１２を見ると、ｔｆ、ｄｆ共に、SubStringLimitの値が小さい時には、SubStringLimitを増加させることによって検索精度が向上することが確認できるが、ある程度大きくなると、SubStringLimitの値を増加させても検索精度があまり変化しないことが確認できる。このことから、SubStringLimitを適当な値に設定してやれば、部分文字列の選別を行っても、検索精度の低下を抑えることができる。また、ｔｆとｄｆを比較すると、ｄｆを基準とした選別による検索精度の方が上回っていることが確認できるが、SubStringLimitの値がある程度大きくなると、ｔｆを基準として選別した場合でも、ｄｆの時とほぼ同等の検索精度を持つことが確認できる。
【０１５７】
図１３を見ると、ｔｆ、ｄｆ共に、SubStringLimitの値が減少するにつれ検索時間が短くなっており、部分文字列を選別することによる高速化の効果が確認できる。また、ｔｆとｄｆの比較をすると、ｔｆを用いた方の検索時間が小さく、SubStringLimitの値が小さいほどその差は大きくなることが確認できる。仮に、SubStringLimitの値を、ｔｆ、ｄｆ共に高い検索精度を示している値、SubStringLimit=22に設定したとすると、その時の検索時間は、ｔｆが２６４．６秒、ｄｆが３６７．１秒で、１．４倍ほどｔｆの方が高速である。このように、本発明によって検索時間を向上させることができる。
【０１５８】
次に、図３のＳ１６における類似度の算出に、第３および第４実施例による方法を適用した時の検索性能を説明する。類似度の算出法に、図６の加算による方法を用いた場合と、図１０、１１のＤＰによる方法を用いた場合の検索精度の比較を行い、その結果を図１４に示す。検索精度には、１１ｐｔ平均精度を用いた。入力文字列から切り出す部分文字列の長さは２に、つまりbigramのみとし、図７のＳ８７で用いるパラメータLBとUBは、それぞれ、LB=0、UB=ｉｄｆとした。用いた文書データは、NTCIR1テストコレクションである。
【０１５９】
図１４の横軸には選択した部分文字列の数、つまりSubStringLimitの値を、縦軸には検索精度を示している。図１４を見ると、SubStringLimitが１５以下の時には、検索精度に大きな差はないが、１５を超えた場合には、ＤＰを用いた方法が加算による方法に比べ、検索精度が上回っていることが確認できる。また、検索時間を測定した所、ＤＰによる方法は、加算による方法に比べ約２倍の検索時間を要すること、既存のＤＰによる方法と比較すると数十倍高速であることが確認された。
【０１６０】
このように、本発明によって、収集した一致情報に対しＤＰを適用することで、加算による方法よりも検索精度を向上することができ、既存のＤＰによる方法と比べ高速に類似度を算出できることができる。
【図面の簡単な説明】
【図１】本発明による文書検索装置の実施例を示す図である。
【図２】本発明により文章検索を行う計算機システムの実施例を示す図である。
【図３】本発明の実施例の文章検索を行う処理のフローチャートである。
【図４】文字列から切り出される部分文字列の選別を行う処理のフローチャートである。
【図５】一致情報を収集し、記録する処理のフローチャートである。
【図６】一致情報から加算による類似度を求める処理のフローチャートである。
【図７】部分文字列の重みを求める処理のフローチャートである。
【図８】一致情報管理テーブルの構成図である。
【図９】本発明による文書検索装置の別の実施例を示す図である。
【図１０】一致情報からｎグラムＤＰ類似度を求める処理の前半部分のフローチャートである。
【図１１】一致情報からｎグラムＤＰ類似度を求める処理の後半部分のフローチャートである。
【図１２】選別の基準にｔｆ、ｄｆを用いた時の、選択した部分文字列の数と検索精度の関係を示す図である。
【図１３】選別の基準にｔｆ、ｄｆを用いた時の、選別した部分文字列の数と検索時間の関係を示す図である。
【図１４】本発明におけるＤＰを用いた類似度算出法と加算による類似度算出法の、検索精度の対比を示す図である。
【図１５】部分文字列の重みを求める別の処理のフローチャートである。
【符号の説明】
１０：文書データベース
１０ａ、１０ｂ、１０ｃ：文書
１１：文字列入力部
１２：検索結果出力部
１３：部分文字列選別部
１４：一致情報収集部
１５、１７、７３、８１、８２、８３：類似度算出部
１６、１８：類似度算出制御部
１９：再帰実行制御部
３１：部分文字列切り出し制御部
３２：部分文字列切り出し部
３３：文字列出現頻度算出部
３４：部分文字列登録部
４１：一致情報収集制御部
４２：文字列出現場所検索部
４３：文字列重み算出部
４４：一致情報登録制御部
４５：一致情報登録部
５１：文字列重み加算制御部
５２：文字列重み加算部
６１：一致文字列類似度算出部
６２：任意文字列類似度算出部
６３、７５、８４：最大値選択部
７１：文字列分離制御部
７２：文字列分離部
７４：加算部
１０１：ディスプレイ
１０２：プリンタ
１０３：キーボード
１０４：フロッピー（Ｒ）ディスク装置
１０５：ＣＤ−ＲＯＭ装置
１０６：読み出し専用メモリ（ＲＯＭ）
１０７：ランダムアクセスメモリ（ＲＡＭ）
１０８：磁気ディスク装置
１０９：中央処理装置（ＣＰＵ）
１１０：通信インターフェイス
１１１：バス
１１２：フロッピー（Ｒ）ディスク
１１３：ＣＤ−ＲＯＭ
１１４：通信ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to determination of similarity between two character strings, and is particularly suitable for use in determining similarity between an input character string and a document registered in a database in information retrieval.
[0002]
[Prior art]
Information retrieval for retrieving a desired document from a document database is often performed. In such information retrieval, a document is handled as a set of character strings in which words composed of a plurality of characters are combined. Then, the search is performed by comparing the search character string and the character strings in the search target document and selecting one or a plurality of items having high similarity. The degree of similarity between character strings can be broadly divided into two methods: a method using morphological analysis and a method for obtaining a match between partial character strings of length n (hereinafter referred to as n-grams).
[0003]
A method using morphological analysis is disclosed in, for example, Gerard Salton and Christopher Buckley, Term-Weighting Approaches in Automatic Text Retrieval, Information Proceeding and Management, 24, pp. 513-523, 1988. The basic procedure for obtaining the similarity between two character strings by this method is as follows. First, both character strings are decomposed into word strings by morphological analysis using a dictionary and grammatical knowledge. Next, both word strings are compared to find a matching word. Then, a weight is set for the matching word. This weight is then added for all matching words. The sum obtained as a result of this addition becomes the similarity by morphological analysis.
[0004]
The method using morphological analysis has an essential problem that information retrieval ends abnormally if the accuracy of morphological analysis itself is low. In order to increase the accuracy of morphological analysis, the word dictionary and grammatical rules have to be large-scale, and it is difficult to easily use information retrieval. In addition, for documents that contain buzzwords, coined words, or technical terms that are used only in limited fields, the maintenance of the word dictionary is a heavy burden.
[0005]
Next, the method by n-gram is, for example, Yasushi Ogawa and Toru Matsuda, Overlapping statistical word indexing: A new indexing method for Japanese text, In proceeding of SIGIR '97, Philadelphia PA, USA, pp.226-234, 1997. Is disclosed. The basic procedure for obtaining the similarity between character strings by this method is as follows. First, an n-character partial character string included in both character strings is obtained. Next, a weight is set for the common partial character string. This weight is then added for all matching parts. The sum obtained as a result of this addition becomes the similarity by n-grams.
[0006]
Regarding the setting of the weight of the common partial character string, a large value is given to a character string that appears intensively in a specific document and does not appear in other documents. Conversely, only small values are given to character strings that appear in many documents. This reflects that a character string appearing in many documents is not an element that characterizes the document and cannot be effectively used for searching.
[0007]
Since the n-gram method does not require morphological analysis, it can handle unknown words such as new technical terms and can be used easily.
[0008]
Among the n-gram methods, in particular, in the calculation of similarity using a character string of length 2 (hereinafter referred to as bigram), not all biggrams to be cut out are limited to bigrams that do not include hiragana Then, there is a method for obtaining matching information about character strings and calculating a similarity. This is because the bigram including hiragana is highly likely to appear in many documents in the document database, and the probability of becoming a character string characterizing each document is extremely small. It is recognized that the inclusion of bigrams containing hiragana characters in the comparison of character strings only increases the amount of calculation and the search accuracy cannot be expected to improve significantly.
[0009]
[Problems to be solved by the invention]
As described above, the n-gram method has many advantages over the morphological analysis method, and is often used in the field of information retrieval. The problem with the n-gram method is that the amount of calculation increases as the document database grows, and it takes time to obtain search results.
[0010]
By the way, among the partial character strings cut out from the character string, the partial character strings are not necessarily effective in calculating the similarity, that is, they are included in many documents in the document database, so the weight is small and the influence on the similarity is small. It is considered that many substrings are included. For this reason, the method of checking whether all n-grams cut out from the character string match is not efficient in terms of calculation time.
[0011]
In addition, the method of not treating the string containing hiragana as matching information for the extracted bigram is efficient in terms of calculation time, but even the character string that is originally valid for search is uniformly truncated. It will be. As a result, there is a problem that the search accuracy is lowered.
[0012]
The present invention solves the problems in the above-described n-gram method, and provides a method that achieves both reduction in calculation time and improvement in search accuracy.
[0013]
[Means for Solving the Problems]
According to the first aspect of the present invention, in the method of calculating the similarity between two character strings, the partial character strings selected based on the effect on the similarity calculation among the partial character strings cut out from the first character string. The matching information with the second character string is collected, the weight of the matched partial character string is calculated from the matching information, and the similarity is calculated based on the weight. In this way, when obtaining matching information between character strings, not all the n-grams to be cut out are handled, but the n-grams effective for calculating the similarity are selected by estimating the effect on the calculation of the similarity. ing.
[0014]
Since it is configured as described above, it is possible to reduce the calculation time while maintaining a search accuracy substantially equal to a search accuracy by a method that does not select.
[0015]
According to a second aspect of the present invention, in the method of calculating the similarity between two character strings, the partial character strings selected based on the effect on the similarity calculation among the partial character strings cut out from the first character string. , Collecting the matching information with the second character string, based on the weight of the partial character string that matches the order of appearance in the first and second character strings from the partial character strings included in the matching information The similarity is calculated. In this way, in calculating the similarity between character strings, a partial character string that matches the order of appearance in each character string is further selected from the partial character strings recorded in the matching information, and the weight is set as the similarity. It is used for calculation.
[0016]
Since it is configured as described above, the partial character string for calculating the weight is limited, and the calculation speed can be further increased while the search accuracy is maintained.
[0017]
The invention according to claim 3 is the invention according to claim 1 or 2, wherein the partial character string is selected in consideration of the number of times the partial character string cut out from the first character string appears in the second character string. It is characterized by being performed.
[0018]
Since it is configured as described above, it is possible to efficiently select partial character strings that are effective in calculating similarity, and it is possible to further reduce calculation time while maintaining search accuracy.
[0019]
The invention according to claim 4 is the invention according to any one of claims 1 to 3, wherein as the matching information between the character strings, the length of the partial character string, the appearance location of the partial character string in the first character string , An appearance location of the partial character string in the second character string, and a sequence number representing the number of matches in the second character string.
[0020]
Since it is configured as described above, it is possible to use not only a method of calculating the similarity by adding the weights of the matched n-grams, but also using many similarities calculation methods.
[0021]
According to a fifth aspect of the present invention, in the character string similarity calculation device for calculating the similarity between two character strings, a partial character string cut out from the first character string is selected based on the effect on the similarity calculation. A partial character string selection unit that performs matching, a matching information collection unit that collects matching information between the second character string, a weight by matching the partial character strings based on the matching information, and summing the weights A character string similarity calculating device comprising: a similarity calculating unit that calculates a degree.
[0022]
Since it is configured as described above, it is possible to realize an apparatus that shortens the calculation time while maintaining a search accuracy substantially equal to a search accuracy by a method that does not select.
[0023]
According to a sixth aspect of the present invention, in the character string similarity calculation device for calculating the similarity between two character strings, a partial character string cut out from the first character string is selected based on the effect on the similarity calculation. A partial character string selection unit, a matching information collection unit that collects matching information with a second character string for the selected partial character string, and a partial character string included in the matching information, Character string similarity, comprising: a matching partial character string selection unit that selects a partial character string that matches the appearance order; and a similarity calculation unit that sums the weights assigned to the matching partial character strings It is a calculation device.
[0024]
Since it is configured as described above, it is possible to realize a device in which partial character strings for calculating weights are limited and the calculation speed can be further increased while the search accuracy is maintained.
[0025]
The invention according to claim 7 is a character string similarity calculation program for calculating the similarity between two character strings, wherein a partial character string obtained by cutting the computer from the first character string is changed to the second character string. Sorting means for sorting based on the number of appearances; Matching information collecting means for collecting matching information with the second character string for the sorted partial character string; and weights of partial character strings matched based on the matching information A character string similarity calculation program that functions as similarity calculation means for calculating similarity by calculating the sum of the weights.
[0026]
With such a configuration, it is possible to cause the computer to function as a means for shortening the calculation time while maintaining a search accuracy substantially equal to a search accuracy by a method that does not select.
[0027]
The invention according to claim 8 is a character string similarity calculation program for calculating the similarity between two character strings, and the computer has an effect on the similarity calculation for a partial character string cut out from the first character string. The sorting means for sorting on the basis of the second character string, as the matching information with the second character string, for the sorted partial character string, the length of the partial character string, the appearance location of the partial character string in the first character string, the second A match information collecting means for collecting the appearance location of the partial character string in the character string, a sequence number representing the number of matches in the second character string, and a partial character string included in the match information, It is a character string similarity calculation program that functions as a similarity calculation unit that calculates a similarity based on a weight assigned to a partial character string that matches the order of appearance in the character string.
[0028]
With this configuration, the computer can function as a means for further speeding up the calculation while limiting the partial character string for calculating the weight and maintaining the search accuracy.
[0029]
The invention according to claim 9 is a computer-readable recording medium in which the program according to claim 7 or 8 is recorded.
[0030]
Since it is configured as described above, the recorded function can be realized at a necessary place.
[0031]
DETAILED DESCRIPTION OF THE INVENTION
The present invention relates to a method for calculating the similarity between character strings using n-grams, and an apparatus, program, and recording medium for realizing the method. Although it is assumed that the similarity between the input character string and a plurality of documents registered in the database is calculated, other applications are possible. When determining the matching part between character strings, it is not a method of obtaining n-grams common to the input character strings and the documents for all the documents in the database. A method of efficiently searching a document containing n-grams from the database by using a suffix file is used.
[0032]
The n-grams cut out from the input character string are considered to contain many n-grams that have little influence on the similarity calculation. The number of partial character strings cut out from the input character string is, for example, m-1 when the length of the input character string is m and the partial character string is limited to bigram which is two characters. Therefore, if the matching information is obtained for all the extracted bigrams without selecting the partial character strings, the calculation time increases as the input character string length m increases. Therefore, in the present invention, n-grams are selected from the extracted n-grams by estimating the effect on the calculation of the degree of similarity, and only a fixed number of n-grams are applied to the collection of matching information.
[0033]
Regarding the selection of partial character strings, it is necessary to select a partial character string having a large weight added to the degree of similarity in order to improve the search accuracy. Since the weight is determined by the number of documents in the database including the partial character string (hereinafter referred to as df), it is natural to select based on the value of df. However, in one aspect of the present invention, it is recommended to select a partial character string based on the number of n-grams appearing in the document in the database (hereinafter referred to as the frequency of appearance or tf) instead of df. Is done.
[0034]
tf can be calculated directly, and even if the value increases, the calculation time hardly changes. To calculate df, after calculating tf, the duplication in the document must be recalculated. When the value of is large, the calculation time becomes long. On the other hand, there is a large correlation between tf and df, and it can be expected that the use of tf instead of df will not affect the accuracy of the search.
[0035]
The collection of coincidence information is performed for each selected n-gram by the following method.
First, n-gram weights are calculated. Next, a document including the n-gram is obtained from the entire document database, the location of the n-gram in the document, the location of the n-gram in the input character string, the length of the n-gram, and the number of matches in the document The sequence number representing n and the weight of the n-gram are recorded as coincidence information.
[0036]
Usually, the obtained matching information is added and weighted as it is without recording and managing, and the similarity is calculated. In the present invention, by recording and managing this, the weight of the matched n-gram is obtained. In addition to the method of calculating the similarity by addition, the method can be applied to many methods of calculating the similarity while maintaining high speed.
[0037]
The similarity between the input character string and the document in the database is calculated by adding the weights assigned to the matched n-grams. However, even when a matching n-gram appears more than once in the same document, the weight is added only once. That is, just because a matching n-gram appears twice does not mean that the added weight is doubled. Instead, the weight of the n-gram stored as the matching information varies depending on each document according to the number of times the n-gram appears in the document (hereinafter referred to as dtf), and the greater the value of dtf, the greater the weight given. It is getting bigger. This weight is also defined for each n-gram even when dtf = 0, and a weight of 0 or less is added to all documents not including the n-gram. This is based on the idea that not including n-grams indicates that they are not similar, and it can be said that the added weight is obtained by quantifying the degree of dissimilarity. Accordingly, the value indicating the similarity is a real value including a negative value, and the similarity with the input character string increases as the value on the number line increases.
[0038]
The method of calculating the similarity is not limited to the above-described method of adding the weights attached to the matched n-grams, and other similarity calculation methods can be applied. One example is a method in which only n-grams that match the order of appearance in each character string are selected from the matched n-grams, and the sum of the weights assigned to these n-grams is used as the similarity.
[0039]
A method for calculating the similarity in consideration of the appearance order of character strings has been proposed. However, since they calculate weights for all matching partial character strings, the amount of calculation increases compared to the method of adding weights of matched n-grams, and the calculation time increases as the length of the character string to be handled increases. growing.
[0040]
In the present invention, a weight is given to the selected n-grams, and the weight is added to the similarity only when the n-grams match. Therefore, it is only necessary to consider the matching and appearance order for the selected partial character strings, so that the calculation can be performed faster than the conventional methods, and the calculation time can be increased even if the length of the character string to be handled increases. The effect is small. In this method, the similarity is calculated using dynamic programming (hereinafter referred to as DP) in order to efficiently find the combination having the highest similarity among the matching n-gram combinations. Hereinafter, a method of obtaining n-gram DP similarity using DP will be described.
[0041]
α, β, γ, and δ are character strings having a length of 0 or more, ξ, ζ, and η are character strings having a length of 1 or more, and “” is an empty character string. A character string (for example, α) in which a plurality of character strings (for example, ξ and γ) are connected is expressed by continuously writing the symbol of the character string as an element. (For example, α = ξγ).
[0042]
The n-gram DP similarity SimDP is obtained by recursively applying the following expression according to the matching pattern related to the character string portion of the argument. First, if both are empty strings,
SimDP ("", "") = 0 (1)
And Otherwise,
SimDP (α, β) = MAX (SimDPs (α, β), SimDPg (α, β)) (2)
And
Here, SimDPs (α, β) is a character string recorded in the list related to α and β in the matching information management table, and α = ξγ, β = ξδ,
SimDPs (α, β) = MAX (Score (ξ) + SimDP (γ, δ)) (3)
Is calculated for all ξ. When there is no such character string ξ,
SimDPs (α, β) = 0.0 (4)
And Score (ξ) is a function that returns the weight of ξ recorded in the coincidence information management table.
[0043]
Further, if ζ and η are the maximum character strings satisfying α = ζγ and β = ηδ and having no common part with the character strings recorded in the list related to α and β in the matching information management table, SimDPg (α , β) is
SimDPg (α, β) = MAX (SimDP (α, δ), SimDP (γ, β), SimDP (γ, δ)) (5)
Sought by. This formula is obtained by removing one of ζ or η or both ζ and η from two character strings (ζγ and ηδ) (α and δ, γ and β, γ and δ). This means that the highest degree of similarity is adopted.
[0044]
By applying the above formula recursively, a partial character string that matches the order of each of the two character strings can be obtained from the partial character strings recorded in the match information management table, and the degree of similarity is maximum. It becomes.
[0045]
The n-gram DP similarity shown above is weighted in consideration of matching not the character (character string of length 1) but the n-gram (character string of length n), as in equation (3). Thus, the similarity calculation method takes account of the continuity of characters. In addition, the matching between character strings is not considered for all partial character strings, but is limited to n-grams recorded in the matching information management table, and recorded in the matching information management table as shown in Equation (5). It is not necessary to determine whether or not there is a match for a portion that is not related to the partial character string that has been set, and can be removed. As a result, no matter how large the length of the character string for which similarity is to be calculated, only the portion related to the partial character string recorded in the match information management table needs to be considered. Compared with the conventional method that considers matching between character strings, the similarity can be calculated at high speed.
[0046]
(First embodiment)
First, an embodiment of a method for calculating the similarity between character strings using n-grams will be described. FIG. 1 shows an example of a document retrieval apparatus that retrieves a document having the highest similarity to an input character string based on the selected partial character string. The document search apparatus includes a document database 10, a character string input unit 11, a partial character string selection unit 13, a matching information collection unit 14, a similarity calculation unit 15, a similarity calculation control unit 16, and a search result output unit 12. It is configured.
[0047]
In the document database 10, a plurality of documents 10a, 10b, ..., 10c to be searched are registered. For the search, keywords, words, phrases, sentences, sentences, and the like are input (hereinafter referred to as search sentences representatively). The character string input unit 11 gives the search sentence as the character string X to the partial character string selection unit 13.
[0048]
The partial character string selection unit 13 cuts out the partial character string from the character string X given from the character string input unit 11, calculates the appearance frequency tf, extracts a constant number from the calculated small tf, and extracts the partial character string. Register in the management table T1. The partial character string selection unit 13 includes a partial character string cutout control unit 31, a partial character string cutout unit 32, a character string appearance frequency calculation unit 33, and a partial character string registration unit 34. The partial character string cutout control unit 31 controls which partial character string the partial character string cutout unit 32 cuts out. The partial character string cutout unit 32 cuts out the partial character string x from the character string X. The character string appearance frequency calculation unit 33 calculates the appearance frequency tf of the partial character string x in the document database 10. The partial character string registration unit 34 selects a fixed number of cut partial character strings in ascending order of the value of tf and registers them in the partial character string management table T1.
[0049]
The coincidence information collecting unit 14 calculates the weight given to the character string for each partial character string registered in the partial character string management table T1, detects the appearance location of each partial character string in the document database, and uses it as the coincidence information. The appearance location in the document, the appearance location in the input character string, the length of the character string of the partial character string, the sequence number indicating the number of matching in the document, and the weight of the character string are recorded in the match information management table T2. . The match information collection unit 14 includes a match information collection control unit 41, a character string appearance location search unit 42, a character string weight calculation unit 43, a match information registration control unit 44, and a match information registration unit 45. The coincidence information collection control unit 41 extracts the partial character strings a registered in the partial character string management table T1 one by one and gives them to the character string appearance location search unit 42 and the character string weight calculation unit 43. The character string appearance location search unit 42 sets the appearance number, the length of the partial character string a, and the sequence number representing the number of matches in the document for all the appearance locations in the document database of the given partial character string a. Ask. The character string weight calculation unit 43 calculates the weight given to the given character string a. The matching information registration control unit 44 selects the appearance location of the partial character string a one by one, the appearance location in the document, the appearance location in the input character string, the length of the partial character string a, the sequence number in the document, the partial character string A match with the weight of a is given to the matching information registration unit 45. The match information registration unit 45 registers each received set as match information in the corresponding document number list of the match information management table T2.
[0050]
The similarity calculation control unit 16 extracts a list relating to one document Y from the coincidence information management table T2 and provides the list to the similarity calculation unit 15.
[0051]
The similarity calculation unit 15 calculates the similarity between X and Y from the given list of matching information. The similarity calculation unit 15 includes a character string weight addition control unit 51 and a character string weight addition unit 52. The character string weight addition control unit 51 selects one match information from the list of match information, and if the sequence number of the match information is 1, gives the character string weight score to the character string weight adder 52. The character string weight addition unit 52 adds the given weight to the similarity Sim (X, Y) between X and Y.
[0052]
The search result output unit 12 selects and outputs a document having the highest similarity. At this time, documents having similarities of a certain value or higher, and documents from higher ranks to a certain rank may be output together.
[0053]
(Second embodiment)
An embodiment in which the text search according to the present invention is implemented by software will be described below.
[0054]
FIG. 2 shows an example of a computer system used for executing a text search. The computer system includes a display 101, a printer 102, a keyboard 103, a floppy (R) disk device 104, a CD-ROM (Compact Disk-Read Only Memory) device 105, a read-only memory (hereinafter referred to as ROM) 106, A readable / writable random access memory (RAM) 107, a magnetic disk device 108, a central processing unit (CPU) 109, a communication interface 110, and a bus 111 for connecting them. Has been. The floppy (R) disk device 104 reads and writes the floppy (R) disk 112, and the CD-ROM device 104 reads the CD-ROM 113. The computer system is connected to the communication network 114 by the communication interface 110.
[0055]
A text search program for implementing the present invention is stored in the ROM 106. Alternatively, the sentence search program may be stored in the floppy (R) disk 112, the CD-ROM 113, or the magnetic disk device 108 and transferred to the RAM 107, and then executed by the CPU 109. The CPU 109 executes a text search program using the RAM 107 as a work area. If necessary, the magnetic disk device 108 may be used as a work area. An instruction to execute the text search program is given from the keyboard 103, and the execution result is output to the display 101 or the printer 102. It goes without saying that the execution of the text search program may be instructed from the floppy (R) disk 112 and the execution result may be written to the floppy (R) disk 112.
[0056]
The document database is stored in the floppy (R) disk 112, the CD-ROM 113, or the magnetic disk 108. It may be transferred to the RAM 107 for high-speed access. When transferring to the RAM 107, it may be converted into a format that can be easily processed. In addition, a text search program, a document database, or an execution instruction may be input to the computer system via the network 114, or an execution result may be output from the computer system via the network 114. Of course.
[0057]
Further, the present invention is not limited to those shown in the figure, and it goes without saying that various modifications can be made to various embodiments such as input and output to the computer system using various recording media, input means, and output means. Yes. Of course, the recording medium, input means, and output means may be accessed via a communication network in addition to those directly accessed by the computer system.
[0058]
FIG. 3 to FIG. 7 show the processing flow of the document search program based on the character string similarity calculated by selecting n-grams to be calculated.
[0059]
FIG. 3 shows a processing flow in which a document database is searched based on a search sentence, and a document having a high similarity is selected and output.
[0060]
First, in step S11 (hereinafter abbreviated as S11), in order to prepare for efficiently calculating the number of appearances of a certain character string, all the documents included in the document database are integrated to create a suffix file. See S. Yamamoto and KW Church, Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus, In proceeding of 6 ^th Workshop on Very Large Corpora, Ed. Eugene Charniak, Motreal, pp 28-37, 1998.
[0061]
By using a suffix file, the number of times a character string appears in the document database can be obtained at high speed. The suffix file is implemented by rearranging character strings of parts that can occur in all documents in the order of character codes and attaching serial numbers (suffixes). The number of times a character string appears in the document database can be obtained by calculating how many character strings in the suffix file match the character string.
[0062]
Specifically, first, the minimum value min and the maximum value max of the matching character string suffix are obtained by the binary search method. If there is no matching character string, the number of appearances in the document database is zero. If min and max are obtained, the number of times tf at which the character string appears is obtained by tf = max−min + 1.
[0063]
The documents in the document database are distinguished from each other by the document number, and the document number is assigned to the partial character string registered in the suffix file. As a result, a document including a certain partial character string can be efficiently searched. Also, df can be calculated by counting the number of duplicate document numbers and subtracting that number from tf.
[0064]
In step S12, the search sentence is read into the character string X.
[0065]
In S13, the partial character string cut out from the character string X is selected based on the appearance frequency tf in the document database, and recorded in the partial character string management table as a pair with tf. The process performed in S13 will be described later with reference to FIG.
[0066]
In S14, match information is collected for each partial character string recorded in the partial character string management table and recorded in the match information management table. In the match information management table, a list of match information is recorded for each document number. The process performed in S14 will be described later with reference to FIG.
[0067]
In S15, a list of one document Y is extracted from the match information management table.
[0068]
In step S16, the similarity between X and Y is calculated from the extracted list. The process performed in S16 will be described later with reference to FIG.
[0069]
In S17, the obtained similarity and the document number are paired and registered in the document management table.
[0070]
In S18, it is determined whether the similarity is calculated for all the lists recorded in the match information management table. If the similarity is not yet calculated for all the lists, a list for which the similarity is not yet calculated is selected and extracted in S15, and the processes up to S17 are repeated. If all the lists have been calculated, the registered tables are rearranged in descending order of similarity in S19.
[0071]
In S20, a process of outputting a document having a high similarity is performed. Various modes are possible such as only one document to be output, a predetermined plural number, or all documents having a predetermined similarity or higher.
[0072]
FIG. 4 shows a flow of processing in which a partial character string is cut out from the character string X from which the search text is read, a partial character string used for collecting matching information is selected based on the appearance frequency, and recorded in the partial character string management table. .
[0073]
First, in S31, a variable num_substring representing the number of partial character strings recorded in the partial character string management table and a variable j representing the length of the partial character string to be extracted are initialized. MinNgramLength is a parameter that determines the minimum length of the partial character string to be cut out.
[0074]
In step S32, one partial character string having a length j is cut out from the character string X, and the appearance frequency tf in the document database is calculated.
[0075]
In S33, it is determined whether or not the value of tf of the extracted partial character string is 0. If tf = 0, the partial character string does not exist in the document database, so it is inappropriate to use it for collecting matching information. Therefore, the process of S34 is skipped and the process proceeds to S35. If tf ≠ 0, the process proceeds to S34.
[0076]
In S34, the extracted partial character string is recorded in the partial character string management table together with tf, and 1 is added to the value of num_substring.
[0077]
In S35, it is determined whether tf has been calculated for all partial character strings of length j cut out from X. If all the partial character strings of length j have not been calculated yet, a partial character string of length j that has not yet been calculated is selected in S32, tf is calculated, and the processing up to S34 is repeated. If all the partial character strings having the length j are calculated, 1 is added to j in S36.
[0078]
In S37, it is determined whether the value of the length j of the partial character string cut out from X is larger than the parameter MaxNgramLength that determines the maximum length of the partial character string to be cut out. If the value of j is less than or equal to MaxNgramLength, the process returns to S32, and the process up to S36 is repeated for the partial character string of length j. If the value of j is greater than MaxNgramLength, the calculation of tf has been completed for all partial character strings whose length is greater than or equal to MinNgramLength and less than or equal to MaxNgramLength. Therefore, the process proceeds to S38 and the part recorded in the partial character string management table The character strings are rearranged in ascending order of tf.
[0079]
In S39, it is determined whether or not the number num_substring of partial character strings registered in the partial character string management table is larger than the parameter SubStringLimit that determines the upper limit value of the number of partial character strings used for collecting matching information. If num_substring is larger than SubStringLimit, the process proceeds to S40, and SubStringLimit partial character strings are extracted in ascending order of tf, and these partial character strings are newly recorded in the partial character string management table. If num_substring is less than or equal to SubStringLimit, S40 is skipped and the process proceeds to S41.
[0080]
S41 is a process of returning the recorded partial character string management table.
[0081]
FIG. 5 shows a flow of processing for collecting matching information between each partial character string recorded in the partial character string management table and each document in the document database and recording the information in the matching information management table.
[0082]
First, in S51, the variable p0fit_sum is initialized to zero. The variable p0fit_sum is a variable used for speeding up the calculation effort when calculating the weight of n-grams with matching similarities, and is an offset of the similarity regarding the entire document in the document database.
[0083]
In S52, one partial character string is selected from the partial character string management table and read into a.
[0084]
In S53, p0fit, p1fit, p2fit, p3fit, and p4fit are calculated, and p0fit is added to p0fit_sum. p0fit, p1fit, p2fit, p3fit, and p4fit are documents in which a did not appear in a document, appeared once, appeared twice, appeared three times, appeared four times or more, respectively. Is the weight of a. The calculation method of p0fit, p1fit, p2fit, p3fit, and p4fit will be described later with reference to FIG.
[0085]
In S54, all the places where a appears in the document database are obtained and rearranged in the order of the places where they appear.
[0086]
In S55, the document number of the document including a is obtained for each occurrence location of a. At this time, since a is arranged in the order of appearance, the obtained document numbers are also arranged in ascending order.
[0087]
In S56, one appearance location a is selected in the order of the appearance location.
[0088]
In S57, it is determined whether or not the appearance location of the selected “a” is the foremost occurrence location in the document including it. That is, if the document at the selected appearance location is different from the document at the previous appearance location, it is the first appearance location, and if it is the same, it is the second or subsequent appearance location. If it is the first appearance place, the process proceeds to S58, where the number of occurrences dtf of a in the document is calculated, and the weight of a in the document is determined. Also, sequence_num = 1. The sequence_num is a sequence number indicating the number a of the occurrence location in the document.
[0089]
In S59, the sequence number sequence_num in the document, the occurrence location of a in the input character string X (hereinafter referred to as startX), the occurrence location of a in the document (hereinafter referred to as startdoc), the length of a (hereinafter referred to as termlength), the a A weight (hereinafter, score) is paired and recorded in the match information management table, and 1 is added to sequence_num. However, the value recorded in score is not the weight of a as it is, but the value obtained by subtracting p0fit from the weight of a. This is because, when the similarity is calculated by adding the weights of the matched n-grams, instead of adding p0fit of each substring to the similarity of the document that does not include, for each selected substring, This is a contrivance to reduce the calculation effort by adding a value obtained by subtracting p0fit from the weights of the matched n-grams and finally adding a weight offset p0fit_sum to all similarities.
[0090]
In S60, the values of sequence_num and tf are compared, and it is determined whether or not the coincidence information has been recorded for all appearance locations of a. If there is coincidence information that is not recorded, the next a appearance location is selected in S56, and the processing up to S59 is repeated. If coincidence information is recorded for all the appearance locations of a, the process proceeds to S61.
[0091]
In S61, it is determined whether or not the matching information has been collected for all the partial character strings in the partial character string management table. If there is a partial character string for which no matching information has been collected, in S52, the partial character string not yet selected is read into a, and the processing up to S60 is repeated. If the collection of matching information has been completed for all the partial character strings, the obtained matching information management table is returned in S62.
[0092]
FIG. 6 is a processing flow for obtaining the similarity between the input sentence X and the document Y by adding the weights of the matched character strings using the list extracted from the matching information management table.
[0093]
First, in S71, the similarity between X and Y (hereinafter, sim) is initialized to zero.
[0094]
In S72, one of the lists related to Y recorded in the coincidence information management table is selected and read into I.
[0095]
In S73, it is determined whether the sequence_num of the read I is 1. This is a process for not adding the score of the same partial character string redundantly to sim. If sequence_num is not 1, S74 is skipped and the process proceeds to S75. If sequence_num = 1, the score of I is added to sim in S74.
[0096]
In S75, it is determined whether or not all the matching information recorded in the matching information list regarding Y has been checked. If all the matching information is checked, the offset p0fit_sum of the weight of the entire document is added to sim in S76. If there is matching information that has not been checked yet, the matching information that has not been checked yet is selected and read into I in S72, and the processing up to S74 is repeated.
[0097]
S77 is processing for returning the obtained sim as the similarity between X and Y.
[0098]
FIG. 7 shows a calculation processing flow of p0fit, p1fit, p2fit, p3fit, and p4fit in S53 of FIG.
[0099]
First, in S81, p0fit, p1fit, p2fit, p3fit, and p4fit are all initialized to zero.
[0100]
In S82, df of the partial character string a is calculated, and in S83, idf is calculated from df and the total number N of documents in the document database. This idf is a value based on the amount of information in the field of information theory, and a method using this value as the weight of the partial character string is well known.
[0101]
In S84, a threshold value tf_threshold for determining whether or not the partial character string a is a partial character string effective for search is calculated.
[0102]
In S85, it is determined whether or not the partial character string a is valid for the search from the values of tf and df. If tf / df> tf_threshold, it is determined that the partial character string is valid for search, and the process proceeds to S86. Otherwise, it is determined that the search is not effective, S86 and S87 are skipped, and p0fit, p1fit, p2fit, p3fit and p4fit are returned in S88. That is, 0 is returned for all values of p0fit, p1fit, p2fit, p3fit, and p4fit.
[0103]
In S86, p0fit, p1fit, p2fit, p3fit, and p4fit are calculated.
[0104]
The functions MAX and MIN in S87 are functions that return the maximum value or the minimum value of the numerical value given to the argument, respectively. With this function, the values of p0fit, p1fit, p2fit, p3fit, and p4fit are within the range of LB to UB. Restricted. LB and UB are parameters that limit the distribution of p0fit, p1fit, p2fit, p3fit, and p4fit. S88 is a process of returning p0fit, p1fit, p2fit, p3fit, and p4fit.
[0105]
As can be seen from the above description, the weights p0fit, p1fit, p2fit, p3fit, and p4fit are obtained as functions of tf, df, and idf. The coefficients used in S84 and S86 are based on the theory disclosed in Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, The MIT Press, Cambridge, Massachusetts, pp. 529-574, 1999. It was determined by obtaining the observation value of document data. These coefficients are not limited to the numerical values shown, but are allowed to be appropriate values according to the purpose.
[0106]
As the calculation processing flow of p0fit, p1fit, p2fit, p3fit, and p4fit in S53 of FIG. 5, the flow shown in FIG. 15 can be applied instead of the flow shown in FIG. The calculation processing flow in FIG. 15 will be described below.
[0107]
First, in S181, p0fit, p1fit, p2fit, p3fit, and p4fit are all initialized to zero.
[0108]
In S182, the df of the partial character string a and the number of documents in the document database in which the partial character string a appears twice or more (hereinafter referred to as df2) are calculated. In step S183, idf is calculated from df and the total number N of documents in the document database.
[0109]
In S184, a threshold df2_threshold for determining whether or not the partial character string a is a partial character string effective for search is set to 0.22.
[0110]
In S185, it is determined whether or not the partial character string a is valid for the search from the values of df and df2. If df2 / df> df2_threshold, it is determined that the partial character string is valid for the search, and the process proceeds to S186. Otherwise, it is determined that the search is not effective, S186 and S187 are skipped, and p0fit, p1fit, p2fit, p3fit and p4fit are returned in S188. That is, 0 is returned for all values of p0fit, p1fit, p2fit, p3fit, and p4fit.
[0111]
In S186, p0fit, p1fit, p2fit, p3fit, and p4fit are calculated.
[0112]
S187 restricts the values of p0fit, p1fit, p2fit, p3fit, and p4fit to the range of LB to UB in the same manner as S87 in FIG. LB and UB are parameters that limit the distribution of p0fit, p1fit, p2fit, p3fit, and p4fit. S188 is processing for returning p0fit, p1fit, p2fit, p3fit, and p4fit.
[0113]
As can be seen from the above description, in the calculation processing flow of FIG. 15, df2 is used instead of tf in the calculation processing flow of FIG. 7 described above as a criterion for determining whether the partial character string a is valid for the search. ing. df2 / df represents the degree of appearance of a partial character string, that is, the degree to which a certain partial character string appears only in a specific document. By using this information, the partial character string is selected. To improve search accuracy.
[0114]
The threshold value df2_threshold in S183 and the coefficient used in S186 are not limited to the numerical values shown, but can be set to appropriate values according to the purpose.
[0115]
FIG. 8 shows a configuration diagram of the matching information management table. The match information management table is configured by a list of match information for each document number. In FIG. 8, match information 1 and match information 5 are recorded in document number 0002, match information 2 is recorded in document number 0100, match information 3 and match information 6 are recorded, and match information 4 and match information 7 are recorded in document number 0111 as a list. ing. Each match information includes the sequence number sequence_num in the document of the partial character string, the occurrence location of the partial character string in the input character string X (startX), the occurrence location of the partial character string in the document (startdoc), and the partial character string The length (termlength) and the weight (score) attached to the partial character string are stored.
[0116]
When the matching information 8 related to the document number 0002 is newly obtained, as shown in FIG. 8, the pointer pointing to the head of the list that has been pointing to the matching information 5 so far points to the matching information 8. A pointer to 5 is set, and the match information 8 is recorded at the top of the list of the document number 0002.
[0117]
(Third embodiment)
Next, FIG. 9 shows an embodiment of a document search apparatus that searches for a document having the highest similarity with the input character string based on the n-gram DP similarity. The document search apparatus includes a document database 10, a character string input unit 11, a partial character string selection unit 13 (internal illustration is omitted), a matching information collection unit 14 (internal illustration is omitted), a similarity calculation unit 17, and a similarity The degree calculation control unit 18, the recursive execution control unit 19, and the search result output unit 12 are configured.
[0118]
The document database 10, the character string input unit 11, the partial character string selection unit 13, the matching information collection unit 14, and the search result output unit 12 have the same functions and configurations as the parts denoted by the same reference numerals in the first embodiment. Omitted.
[0119]
The similarity calculation control unit 18 takes out a list related to a certain document Y from the coincidence information management table T2, and provides the similarity calculation unit 17 together with the character strings X and Y.
[0120]
The similarity calculation unit 17 calculates the similarity between X and Y based on the formula (1) or the formula (2) from the given list of matching information. In the middle of calculating the similarity, it is necessary to similarly determine the similarity for a part of character strings. This is implemented by repeatedly using the similarity calculation unit 17 by the recursive execution control unit 19. The similarity calculation unit 17 includes a matched character string similarity calculation unit 61, an arbitrary character string similarity calculation unit 62, and a maximum value selection unit 63. The matched character string similarity calculation unit 61 calculates SimDPs (α, β) in Expression (3). The arbitrary character string similarity calculation unit 62 calculates SimDPg (α, β) in equation (5). The maximum value selection unit 63 calculates SimDP (α, β) in Expression (2) by performing the function MAX on these. When both the character strings α and β received by the similarity calculation unit 17 are empty characters, the recursive execution control unit 19 sets SimDP (α, β) = 0.0. At this time, the matched character string similarity calculation unit 61, the arbitrary character string similarity calculation unit 62, and the maximum value selection unit 63 are not operated. Needless to say, this value of SimDP (α, β) = 0.0 implements the equation (1).
[0121]
The matched character string similarity calculation unit 61 is implemented by a character string separation control unit 71, a character string separation unit 72, a similarity calculation unit 73, an addition unit 74, and a maximum value selection unit 75, and SimDPs of Expression (3) (α, β) is calculated. SimDPs (α, β) is calculated only when the leading character string in which α and β match is the character string recorded in the match information management table T2.
[0122]
First, in the character strings α (= ξγ) and β (= ξδ) received by the matching character string similarity calculation unit 61, the character string separation control unit 71 does not have a matching character string ξ, that is, a matching character. When the column ξ is an empty character string, SimDPs (α, β) = 0.0 is set according to Equation (1).
[0123]
Next, in the character strings α (= ξγ) and β (= ξδ) received by the matching character string similarity calculation unit 61, the character string separation control unit 71 determines that all the ξ , The character string separation unit 72, the similarity calculation unit 73, and the addition unit 74 are operated to calculate Score (ξ) + SimDP (γ, δ) included in Equation (3). Then, the maximum value is selected by the maximum value selection unit 75. As a result, SimDP shown in Equation (3) _s (α, β) is obtained.
[0124]
The character string separating unit 72 separates the character string α into ξ and γ and the character string β into ξ and δ, and then gives the weight Score (ξ) of ξ to the adding unit 74 with reference to the matching information management table T2. γ and δ are given to the similarity calculation unit 73. The similarity calculation unit 73 calculates SimDP (γ, δ) in Expression (3). The similarity calculation unit 73 is actually implemented by applying the similarity calculation unit 16 to γ and δ by the recursive execution control unit 17. The adding unit 74 performs addition of Expression (3).
[0125]
The arbitrary character string similarity calculation unit 62 is implemented by the similarity calculation units 81 to 83 and the maximum value selection unit 84, and calculates SimDPg (α, β) in Expression (2). The arbitrary character string similarity calculation unit 62 is executed when the leading character strings are different or when the leading character strings match but are not registered in the matching information management table. Corresponding to each case regarding the presence or absence of the first character ζ, η of the received character strings α (= ζγ), β (= ηδ), the similarity calculation units 81, 82, 83 respectively Find SimDP (α, δ), SimDP (γ, β), SimDP (γ, δ). The similarity calculation units 81 to 83 are actually implemented by applying the similarity calculation unit 17 to each of α and δ, γ and β, and γ and δ by the recursive execution control unit 19. . The maximum value selection unit 84 performs the function MAX of Expression (5).
[0126]
(Fourth embodiment)
A processing flow for obtaining the n-gram DP similarity between the character string X and the character string Y by software is shown in FIGS. This process can be used instead of the process described in FIG. 6 as the internal process of S16 of FIG. In execution, the computer system shown in the second embodiment is used.
[0127]
First, in S91, in each of X and Y, among the first character and last character of the partial character string recorded in the list, the position of the first character in the forefront, minX, minY, and the position of the last character in the last maxX and maxY are obtained, an array X_index of length maxX-minX + 1 and an array Y_index of length maxY-minY + 1 are prepared, and all elements are initialized to -1. These arrays correspond to each character from X minX to maxX and each character from Y minY to maxY.
[0128]
In the processing from S92 to S94, the array X_index corresponding to the character corresponding to the first character or the last character of each partial character string recorded in the list, with each character from minX to maxX in X and each character from minY to maxY in Y. , 0 is assigned to the Y_index element.
[0129]
In the processing from S95 to S99, serial numbers 0, 1, 2,..., X_index_num-1 are assigned in order from the front to i where X_index [i] = 0, and the numbers are substituted into X_index [i]. Therefore, X_index_num is the number of i for which X_index [i] ≠ −1.
[0130]
In S100 to S104, the same processing is performed on Y_index. A serial number such as 0, 1, 2,..., Y_index_num-1 is assigned to j where Y_index [j] = 0 in order from the front, and the number is substituted into Y_index [j]. Therefore, Y_index_num is the number of j where Y_index [j] ≠ −1.
[0131]
In S105, the matching information of X and Y recorded in the list is first rearranged in the order in which the partial character strings in X appear. Next, in the order in which the partial character strings appear in X, they are rearranged in the order in which the partial character strings appear in Y, and the number of matching information is read into m.
[0132]
Next, as a preparation for efficiently obtaining the similarity by DP, a score table scoretable of (X_index_num + 2) rows (Y_index_num + 2) columns is created, and all elements of the table are initialized to zero. In this table, the vertical direction is the character corresponding to the first character or the last character of the partial character string recorded in the list in the character string X, and the horizontal direction is the partial character recorded in the list in the character string Y. Corresponds to the first or last character in the column.
[0133]
In S106, k and i are initialized to k = 1 and i = 0.
[0134]
In S107, j is initialized to j = 0. The variable k indicates that attention is currently focused on the k-th matching information from the beginning, and i and j are the first character or the last character of the partial character string recorded in the list in X and Y, respectively. , A variable that indicates which character is focused on. In S108, scoretable [i] [j] is assigned to currentscore as the current score.
[0135]
Here, for convenience of explanation, the kth matching information from the top of the list is represented as startX (k), startdoc (k), termlength (k), and score (k), respectively.
[0136]
In S109, it is determined whether or not the location indicated by i and j matches the location where the partial character string of the kth matching information from the front appears. If they match, the process proceeds to S110, and if they do not match, the process proceeds to S114.
[0137]
In S110, a line number and a column number corresponding to the final character of the matched partial character string are obtained in the score table, and are substituted into target_i and target_j, respectively.
[0138]
In S111, in scoretable [target_i] [target_j], the score obtained at the current stage is compared with the score obtained by adding the weight score (k) of the substring matching currentscore. If currentscore + score (k) is larger than target_i] [target_j], currentscore + score (k) is substituted into scoretable [target_i] [target_j] in S112. Otherwise, skip S112 and go to S113.
[0139]
In S113, 1 is added to the value of k, and attention is paid to the next matching information in the list, and the process returns to S109. In S109, if i and j are also used as the next match information, the process up to S113 is repeated, and if not, the process proceeds to S114.
[0140]
In the processing from S114 to S119, the current score currentscore is compared with the right, lower, and lower right scores of the score table, and if the currentscore is larger, the currentscore is substituted.
[0141]
In S120, it is determined whether j has reached the right end of the score table. If it has not come to the right end, 1 is added to j in S121, the process returns to S108, and the processes up to S119 are repeated. If it has come to the right end, it will progress to S122.
[0142]
In S122, it is determined whether i has reached the lower end of the score table. If it has not come to the lower end, 1 is added to i in S123, the process returns to S107, and the processes up to S120 are repeated. If it has reached the lower end, the process proceeds to S124, and scoretable [X_index_num + 1] [Y_index_num + 1] is returned as the similarity between X and Y.
[0143]
【The invention's effect】
According to the first aspect of the present invention, the character string that is effective in calculating the similarity to the document in the document database is selected from the partial character strings cut out from the input character string to obtain the similarity. be able to. That is, the similarity can be obtained by limiting to character strings that are highly effective for searching.
[0144]
According to the second aspect of the present invention, it is possible to obtain the similarity by paying attention to the partial character strings that match the order in each of the two character strings. That is, the similarity considering the appearance order of the character strings is obtained.
[0145]
According to the third aspect of the present invention, it is possible to obtain the similarity by sorting the character strings in descending order of the effect on the similarity calculation. As a result, the determined similarity becomes a more appropriate value.
[0146]
According to the fourth aspect of the present invention, not only a method of adding weights of matched partial character strings but also other similarity calculation methods can be combined. This further improves the search accuracy.
[0147]
According to the fifth aspect of the present invention, a character string that is effective in calculating similarity to a document in the document database is selected from the partial character strings cut out from the input character string, and the search is effective. It is possible to realize an apparatus that obtains similarity by limiting to a high character string.
[0148]
According to the sixth aspect of the present invention, it is possible to realize an apparatus that finds the degree of similarity that matches the order in each of the two character strings and that considers the appearance order of the character strings by paying attention to the common partial character strings.
[0149]
According to the seventh aspect of the present invention, a character string that has an effect on the similarity calculation with the document in the document database is selected from the partial character strings cut out from the input character string, and the character that is highly effective for the search. It is possible to provide a program for obtaining the similarity only for the columns.
[0150]
According to the eighth aspect of the invention, it is possible to provide a program that finds the degree of similarity that matches the order in each of the two character strings and that considers the appearance order of the character strings by paying attention to the common partial character strings.
[0151]
According to the ninth aspect of the present invention, the effects described above can be realized on various computers.
[0152]
Next, the search performance regarding the first and second embodiments will be described. An evaluation experiment was conducted to confirm the effect of selecting a partial character string cut out from the input character string and the effect of using tf instead of df as a selection criterion for the partial character string.
[0153]
FIG. 12 shows the relationship between the number of selected partial character strings and search accuracy when tf and df are used as selection criteria, and FIG. 13 shows the relationship between the number of selected partial character strings and search time. A value called 11pt average accuracy was used for the search accuracy. The 11 pt average accuracy is an evaluation index obtained by assigning 1 to 11 points in 0.1 increments to the recall (0 to 1) and averaging each value. For details, see G Salton and M. J MacGill, Introduction to Modern Information Retrieval, p174-181, MacGraw-Hill Book Co., New York, 1983.
[0154]
The length of the partial character string cut out from the input character string is 2, that is, only bigram. Also, the parameters LB and UB used in S87 of FIG. 7 are set to LB = 0 and UB = idf, respectively. The used document data is representative similar information search test data called NTCIR1 test collection, which includes about 300,000 documents and 53 search sentences.
[0155]
12 and 13, the horizontal axis indicates the number of selected partial character strings, that is, the value of SubStringLimit, the vertical axis in FIG. 12 indicates search accuracy, and the vertical axis in FIG. 13 searches for 53 search sentences. It shows the time taken to do.
[0156]
Referring to FIG. 12, it can be confirmed that the search accuracy is improved by increasing the SubStringLimit when the value of SubStringLimit is small for both tf and df, but the search accuracy is improved even if the value of SubStringLimit is increased to a certain extent. It can be confirmed that there is not much change. For this reason, if SubStringLimit is set to an appropriate value, it is possible to suppress a decrease in search accuracy even if partial character strings are selected. In addition, when tf and df are compared, it can be confirmed that the search accuracy based on the selection based on df is higher, but if the value of SubStringLimit increases to some extent, even when the selection is performed based on tf, It can be confirmed that it has almost the same search accuracy.
[0157]
Referring to FIG. 13, both tf and df, the search time is shortened as the value of SubStringLimit decreases, and the effect of speeding up by selecting partial character strings can be confirmed. Further, when tf and df are compared, it can be confirmed that the search time using tf is shorter, and that the difference becomes larger as the value of SubStringLimit is smaller. If the value of SubStringLimit is set to a value indicating high search accuracy for both tf and df, SubStringLimit = 22, the search time at that time is 264.6 seconds for tf and 367.1 seconds for df. The tf is about 1.4 times faster. Thus, the search time can be improved by the present invention.
[0158]
Next, search performance when the methods according to the third and fourth embodiments are applied to the calculation of the similarity in S16 of FIG. 3 will be described. The search accuracy is compared between the case of using the addition method of FIG. 6 and the method of DP of FIGS. 10 and 11 as the similarity calculation method, and the results are shown in FIG. The search accuracy was 11 pt average accuracy. The length of the partial character string cut out from the input character string is 2, that is, only bigram, and the parameters LB and UB used in S87 of FIG. 7 are LB = 0 and UB = idf, respectively. The document data used is the NTCIR1 test collection.
[0159]
The horizontal axis of FIG. 14 indicates the number of selected partial character strings, that is, the value of SubStringLimit, and the vertical axis indicates the search accuracy. As shown in FIG. 14, when SubStringLimit is 15 or less, there is no big difference in search accuracy. However, when it exceeds 15, the method using DP is more accurate than the method using addition. I can confirm. Further, when the search time was measured, it was confirmed that the DP method requires approximately twice as much search time as the addition method, and is several tens of times faster than the existing DP method.
[0160]
Thus, according to the present invention, by applying the DP to the collected match information, the search accuracy can be improved as compared with the addition method, and the similarity can be calculated at a higher speed than the existing DP method. it can.
[Brief description of the drawings]
FIG. 1 is a diagram showing an embodiment of a document search apparatus according to the present invention.
FIG. 2 is a diagram showing an embodiment of a computer system for performing a text search according to the present invention.
FIG. 3 is a flowchart of a text search process according to an embodiment of the present invention.
FIG. 4 is a flowchart of processing for selecting a partial character string cut out from a character string.
FIG. 5 is a flowchart of a process for collecting and recording coincidence information.
FIG. 6 is a flowchart of processing for obtaining similarity by addition from coincidence information.
FIG. 7 is a flowchart of processing for obtaining a weight of a partial character string.
FIG. 8 is a configuration diagram of a matching information management table.
FIG. 9 is a diagram showing another embodiment of the document search apparatus according to the present invention.
FIG. 10 is a flowchart of the first half of processing for obtaining n-gram DP similarity from coincidence information.
FIG. 11 is a flowchart of the latter half of the process of obtaining n-gram DP similarity from coincidence information.
FIG. 12 is a diagram showing the relationship between the number of selected partial character strings and search accuracy when tf and df are used as selection criteria.
FIG. 13 is a diagram showing the relationship between the number of selected partial character strings and search time when tf and df are used as selection criteria.
FIG. 14 is a diagram showing a comparison of search accuracy between a similarity calculation method using DP and a similarity calculation method by addition in the present invention.
FIG. 15 is a flowchart of another process for obtaining the weight of a partial character string.
[Explanation of symbols]
10: Document database
10a, 10b, 10c: Document
11: Character string input part
12: Search result output part
13: Partial character string selector
14: Matching information collection unit
15, 17, 73, 81, 82, 83: similarity calculation unit
16, 18: Similarity calculation control unit
19: Recursive execution control unit
31: Partial character string segmentation control unit
32: Partial character string cutout part
33: Character string appearance frequency calculation unit
34: Partial character string registration part
41: Matching information collection control unit
42: Character string appearance location search part
43: Character string weight calculation unit
44: Matching information registration control unit
45: Match information registration section
51: Character string weight addition control unit
52: Character string weight addition unit
61: Matched character string similarity calculation unit
62: Arbitrary character string similarity calculation unit
63, 75, 84: Maximum value selection section
71: Character string separation control unit
72: Character string separator
74: Adder
101: Display
102: Printer
103: Keyboard
104: floppy (R) disk device
105: CD-ROM device
106: Read-only memory (ROM)
107: Random access memory (RAM)
108: Magnetic disk device
109: Central processing unit (CPU)
110: Communication interface
111: Bus
112: Floppy (R) disk
113: CD-ROM
114: Communication network

Claims

A character string similarity calculation device for calculating a similarity between an input character string and a character string in a document included in a document database,
The document database includes at least one document;
Cutting means for sequentially cutting out a partial character string of a predetermined length from the input character string;
Appearance frequency calculating means for calculating the appearance frequency in the document database for each of the partial character strings cut out sequentially,
Wherein based on the appearance frequency for each of the partial character string, sequentially cut out the partial character string selection means for said frequency is selected a predetermined number of partial strings of the relatively small as sorting character strings of When,
For each partial character string included in the selected character string group, the appearance location in the document included in the document database is searched, and the appearance frequency of each partial character string included in the selected character string group in each document is determined. Search means to output ;
For each partial character string included in the selected character string group, one function corresponding to the appearance frequency of each partial character string in each document among a plurality of functions defined in advance in association with the value of the appearance frequency in the document A weight calculating means for calculating a weight in the document using
Similarity calculation means for calculating the similarity by adding the weights associated with each of the partial character strings included in the selected character string group appearing in the document for the document included in the document database; A character string similarity calculation device.

The weight calculation means has a ratio of the number of documents including two or more partial character strings to the number of documents including at least one partial character string for each of the partial character strings included in the selected character string group. , it is determined whether more than a predetermined threshold value, the weight of the partial character string is set to zero if the ratio is less than a predetermined threshold value, the string similarity calculating apparatus according to claim 1 .

The weight calculation means, for each of the partial character strings included in the selected character string group, the ratio of the appearance frequency of the partial character string in the document database with respect to the number of documents including at least one partial character string, determining whether more than a predetermined threshold value, the weight of the partial character string is set to zero if the ratio is less than a predetermined threshold value, the string similarity calculating apparatus according to claim 1.

4. The character string similarity calculation device according to claim 3 , wherein the predetermined threshold value is calculated based on the number of documents including at least one corresponding partial character string and the total number of documents included in the document database. .

Each of the plurality of predefined functions is a function including a value idf with an information amount as a background,
The value idf with the amount of information in the background is represented by the following equation:
idf = log ₂ (N / df)
However, N is the number of documents included in the document database, df is the number of documents that contain the substring corresponding among the documents included in the document database, according to any one of claims 1-4 Character string similarity calculation device.

Wherein among the documents included in the document database, the result the degree of similarity calculated selects and outputs a relatively high document further comprising an output unit, a character string according to any one of claims 1 to 5 Similarity calculation device.

A character string similarity calculation program that causes a computer to function as a character string similarity calculation device that calculates a similarity between an input character string and a character string in a document included in a document database,
The document database includes at least one document;
The program is a computer,
Cutting means for sequentially cutting out a partial character string of a predetermined length from the input character string;
Appearance frequency calculating means for calculating the appearance frequency in the document database for each of the partial character strings cut out sequentially,
Wherein based on the appearance frequency for each of the partial character string, sequentially cut out the partial character string selection means for said frequency is selected a predetermined number of partial strings of the relatively small as sorting character strings of When,
For each partial character string included in the selected character string group, the appearance location in the document included in the document database is searched, and the appearance frequency of each partial character string included in the selected character string group in each document is determined. Search means to output ;
For each partial character string included in the selected character string group, one function corresponding to the appearance frequency of each partial character string in each document among a plurality of functions defined in advance in association with the value of the appearance frequency in the document A weight calculating means for calculating a weight in the document using
As a similarity calculation means for calculating the similarity by adding the weights associated with each of the partial character strings included in the selected character string group appearing in the document for the document included in the document database. A character string similarity calculation program that functions.

A computer-readable recording medium on which the character string similarity calculation program according to claim 7 is recorded.

A character string similarity calculation method for calculating a similarity between an input character string and a character string in a document included in a document database using a computer including a processing device,
The document database includes at least one document;
The character string similarity calculation method is:
The processor sequentially cuts out a partial character string of a predetermined length from the input character string;
The processing device calculates the appearance frequency in the document database for each of the partial character strings sequentially cut out;
The processing device selects a predetermined number of partial character strings out of the partial character strings that are sequentially cut out based on the appearance frequency for each of the partial character strings, the partial character strings having a relatively low appearance frequency. As a step to sort as
For each partial character string included in the selected character string group, the processing device searches for an appearance location in the document included in the document database, and each of the partial character strings included in the selected character string group. Outputting the appearance frequency in the document;
For each partial character string included in the selected character string group, the processing device sets an appearance frequency of each partial character string in each document among a plurality of functions defined in advance in association with an appearance frequency value in the document. Calculating weights in the document using one corresponding function ;
Calculating the similarity by adding the weight associated with each of the partial character strings included in the selected character string group appearing in the document for the document included in the document database. String similarity calculation method.