JP4073015B2

JP4073015B2 - Similarity calculation method, apparatus, program, and recording medium storing the program

Info

Publication number: JP4073015B2
Application number: JP2003058542A
Authority: JP
Inventors: 潤鈴木; 英作前田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-03-05
Filing date: 2003-03-05
Publication date: 2008-04-09
Anticipated expiration: 2023-03-05
Also published as: JP2004272352A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力されたテキスト間の類似度を計算する方法及び装置に関する。
【０００２】
【従来の技術】
あるテキストと別のあるテキストとが構造的、意味的、内容的に相互にどの程度類似性があるかを効率的に計算する手法に、関心が集まっている。例えば、テキスト分類タスクは、計算機を用いて、特徴が類似しているテキストを一つのカテゴリとしてまとめ上げることを目的としている。つまり、各テキストがどの程度「似ているか」という類似度指標がテキスト分類において重要な要素である、と考えることができる。また、テキストによる質問応答技術でも、被検索対象となるテキスト集合から質問との類似度が高いテキストを抽出することを目的としていることから、テキスト間の類似度の計算が重要な役割を果たす。このように、テキスト処理の分野では、テキスト間の類似度を必要とするアプリケーションは数多く存在する。
【０００３】
テキストの特徴を表現する方法として、テキスト中の各出現単語をベクトルの一つの要素と考え、単語の出現回数を値とし、与えられたテキストをｎ次元ベクトル（ｎ；単語数）に変換する方法がある。このような出現単語を要素として、テキストの特徴をベクトルで表わす方法は、「bug of words」と呼ばれる。つまり、テキストは単語の集合で特徴付けられると考える方法である。このような単語ベクトルでテキストを表現する方法は、テキスト分類などの類似度計算時に、現在、最もよく用いられている方法である（非特許文献１）。
【０００４】
また、テキスト間の類似度を計算する方法として、最も一般的かつ効率的な方法は、テキストから得られたｎ次元の単語ベクトルの内積あるいはコサイン（余弦）距離を計算する方法である。具体的な計算式を以下に示す。図１に単語ベクトルのコサイン距離による類似度計算方法を示す。また、具体的な計算式は次式で表わされる。
【０００５】
【数１】

【０００６】
図１に示したものは、それぞれ「私は車を買った」、「私の買った車」及び「私は家を買う」であるテキストＴ１〜Ｔ３に対し、各テキスト内の単語（表層）について原形と品詞とを求め、単語ベクトルによって類似度を計算したものである。ここで、原形だけを用いて単語ベクトルを作成した場合には、Ｔ１とＴ２との類似度（Ｔ１＊Ｔ２）は０．７３０であり、同様に、Ｔ２＊Ｔ３として０．４が、Ｔ１＊Ｔ３として０．７３０が得られている。また、原形と品詞とを用いて類似度を計算すると、Ｔ１＊Ｔ２＝０．８６６、Ｔ２＊Ｔ３＝０．６９４及びＴ１＊Ｔ３＝０．８６８が得られている。
【０００７】
一般的に、テキスト中には構造が存在すると考えられている。また、その構造がテキストの意味を形成すると考えられている。最も大きな構造としては、段落、文、文節、形態素といったテキスト内の文字の意味のあるまとまりが考えられ、それ以外にも、文節の依存関係や、照応関係、単語の出現順序などが考えられる。図２は、テキストの構造の一例を示している。
【０００８】
前述のように、単語ベクトルを用いた方法では、テキスト中の各出現単語単体に着目するが、単語がテキスト中でどのように出現したかは考慮されない。つまり、対象とする単語が、どの単語の前に出現したか、どの単語の後に出現したか、どの単語と係り受けの関係にあるか、といったテキスト内に含まれる意味的、構造的な特徴は考慮されない。このような方法では、テキストの特徴をよく反映したテキストの類似度とはいえない。単語が表わす性質は、語の依存関係や、意味的な構造から語自体の意味や性質が決定することが多い。具体的な例として、いわゆる多義語は、構造を無視した時点で、その語がどのような意味でテキスト中に出現したかを判定することは、困難になる。
【０００９】
テキスト中の構造をベクトル表現に置き換える方法も考えられるが、例えば、単語の組み合わせをベクトルの要素にする方法を考えた場合、単語数は数万語であり、かつ、その組み合わせ数は指数関数的に増加することから、数え上げるのは現実的とはいえない。以上の理由から、この方法は、実際に用いられることはない。また、部分的な構造として、限定した組み合わせのみをベクトル中の要素として扱う方法もあるが、結局、複雑なテキストの構造を的確に扱うことはできない。
【００１０】
このように、単語ベクトルを用いる方法、あるいはその拡張としてある程度の構造を考慮する方法は、非常に容易かつ効率的ではあるが、テキストの特徴を十分に反映した類似尺度であるとは言いがたい。
【００１１】
【非特許文献１】
Salton, G., Wong, A. and Yang, C.: "A Vector Space Model for Automatic Indexing," Communication of the ACM, Vol. 11, No. 18, pp. 613-620 (1975)
【００１２】
【発明が解決しようとする課題】
以上説明したように、現状では、テキストの類似度の評価に際してテキストの構造そのものを計算機で扱うのは難しいという課題と、さらに、構造を考慮した類似度を計算するためには、計算が複雑かつ時間がかかるという課題がある。
【００１３】
そこで本発明の目的は、テキストの構造を反映してテキスト間の類似度を簡単に算出できる方法および装置を提供し、さらに、構造を考慮した類似度の計算における計算を簡単かつ計算量の少ないものとすることができる方法及び装置を提供することにある。
【００１４】
【課題を解決するための手段】
１点目の課題であるテキストの構造は扱いが困難であるという課題を解決するために、本発明は、テキストが持つ構造を非循環有向グラフとみなす方法を提案する。
【００１５】
すなわち本発明の類似度計算方法は、テキスト入力部が類似度計算対象のテキストを入力するステップと、非循環有向グラフ生成手段が、入力したテキストをそれぞれ階層を許した非循環有向グラフとして表現するステップと、ソート処理部が、類似度計算対象である１対のテキストにそれぞれ対応する非循環有向グラフのノードを計算順序でソートするステップと、再帰計算部が、ソートされた順番にしたがって再帰式を計算するステップと、を有し、類似度計算対象である１対のテキストに対応する２つの非循環有向グラフにおける全部分パス中の一致する部分パス数の総和として与えられる２つの非循環有向グラフの相互の類似度を再帰計算によって計算し、類似度をテキスト間の類似度として出力することを特徴とする。
【００１６】
また本発明の類似度計算装置は、テキストを入力するテキスト入力部と、入力したテキストをそれぞれ階層を許した非循環有向グラフとして表現する非循環有向グラフ生成手段と、非循環有向グラフを格納するグラフ格納部と、類似度計算対象である１対のテキストにそれぞれ対応する非循環有向グラフのノードを計算順序でソートするソート処理部と、ソートされた順番にしたがって再帰式を計算する再帰計算部と、を有し、類似度計算対象である１対のテキストにそれぞれ対応する非循環有向グラフにおける全部分パス中の一致する部分パス数の総和として与えられる第１および第２の非循環有向グラフの相互の類似度を再帰計算によって計算し、その類似度をテキスト間の類似度として出力することを特徴とする。
【００１７】
このような本発明においては、テキストの構造を考慮した類似度の計算を、非循環有向グラフ同士の類似度を計算することと等価であると考えることができる。また、テキストの構造は複雑であるが、非循環有向グラフの制約内で十分記述可能である。このように非循環有向グラフとして考えることにより、ノードを単語、リンクを構造というように、自然にテキストの構造を記述することができる。つまり、従来の手法のようにベクトル表現のような一次元の配列に置き換えるわけではないため、より直観的、直接的にテキストの構造を表現することができる。図２からも、テキストを、階層を許した非循環有向グラフで捉えることができることが分かる。
【００１８】
２点目のテキストの構造を考慮した類似度を計算しようとすると計算が複雑かつ時間がかかるという課題を解決するために、本発明では、テキスト間の類似度の計算式を再帰的に定義し、全部分を陽に計算することなく全体の類似度を計算する方法を提案する。本発明では、テキストを非循環有向グラフとみなすため、２つの非循環有向グラフ中の全部分パス中で一致する部分パス数の重み付き総和を、２つの非循環有向グラフの類似度として扱う。図３は、本発明で用いる階層を許した非循環有向グラフの部分パスの一例を示している。ここでは、部分パスとしてノード単体も含むことにする。また、パス同士の一致度を計算する際に、始点及び終点以外の中間ノードの差異、伸縮を許した一致も数え上げる。ただしこの場合は、ペナルティλを与えて類似度を計算する。このペナルティが類似度計算時の重みとなる。これにより、完全一致でない柔軟な類似度を計算することが可能になる。
【００１９】
このように、ラベルの差異なども許した全部分パスの一致数を陽に数え上げるのは、ノード数が多くなった場合には、非常に困難であると推測できる。しかし実際の計算では、全部分パスを陽に数え上げてその中から一致するパスを数え上げるのではなく、再帰計算式を定義することで、効率的に一致する部分パスの総和を計算することが可能である。また、類似度を計算する対象を２つに限定して計算することで、効率的な計算が可能となる。これは、カーネル法によるカーネルトリックと呼ばれている計算方法の一種と考えられるものであって、ある２つのテキスト間の内積形を定義することで、実際に全ての要素を陽に展開することなく高次元の計算を低次元の内積計算に置き換える方法である。この方法と同様に、本発明では、対象となるテキストの部分パスを陽に全展開することなく、再帰式で定義される計算式から、効率的に計算することが可能である。
【００２０】
階層を許した２つの非循環有向グラフの類似度は、以下の式で与えられる。
【００２１】
【数２】

【００２２】
つまり、各ノードの関数Ｋ(・,・)の値の総和で与えられる。関数Ｋ(・,・)は、以下の再帰式で定義することができる。
【００２３】
【数３】

【００２４】
次に、
【００２５】
【数４】

【００２６】
の定義を以下のように与える。
【００２７】
【数５】

【００２８】
ノード中にグラフを含んでいない場合には(6-1)式となり、グラフを含む場合には(6-2)式となる。
【００２９】
最後に、
【００３０】
【数６】

【００３１】
を計算するための関数Ｋ″(・,・)は、以下の式で定義される。
【００３２】
【数７】

【００３３】
これらの再帰式を計算することにより、結果的には階層を許した非循環有向グラフの全部分パスの一致した数を計算するとの等しい結果が得られる。また、計算量はＯ(|Ｇ₁||Ｇ₂|)となり、各々のグラフに含まれるノード数の積に比例した計算量で計算することが可能である。
【００３４】
【発明の実施の形態】
次に、本発明の好ましい実施の形態について、図面を参照して説明する。
【００３５】
図４は、本発明の実施の一形態の類似度計算装置の構成を示すブロック図であり、図５は、本実施形態での類似度の計算手順を示すフローチャートである。
【００３６】
図４に示した装置は、入力テキストが入力するテキスト入力部１１と、入力テキストを格納するテキスト格納部１２と、テキスト格納部１２内のテキストに対して形態素解析を行う形態素解析部１４と、形態素解析部１４での形態素解析の結果に基づいて文節へのまとめ上げを行う文節解析部１５と、文節解析部１５で得られた文節に関して依存関係を決定する依存関係解析部１６と、形態素解析の結果、文節へのまとめ上げの結果および依存関係の解析結果に基づいて、処理対象のテキストに対応する階層を許した非循環有向グラフを生成する非循環有向グラフ生成部１７と、生成した非循環有向グラフを格納するグラフ格納部１８と、グラフ格納部１８に格納されたそれぞれ異なるテキストに対応する２つの非循環有向グラフを取り出して、これら２つの非循環有向グラフ間の類似度を計算して出力する類似度計算部１９と、を備えている。類似度計算部１９は、上述した再帰式を用いた計算手法によって類似度を計算するために、非循環有向グラフのノードを計算順序でソートするソート処理部２１と、ソートされた順序にしたがって再帰計算を行うための再帰計算部２２とを備えている。
【００３７】
次に、本実施形態におけるテキスト間の類似度を計算する手順について、説明する。本実施形態の計算手順は、大まかに言うと、
（１）比較対象となる２つのテキストを選択
（２）階層を許した非循環有向グラフにテキストを変換
（３）選択されたテキスト間の類似度を計算
（３．１）各非循環有向グラフを計算順序にソート
（３．２）ソートされた順序にしたがって再帰式を計算
の各手順からなっている。最も効率的に再帰的に計算するために、手順（３．１）において、階層を許した非循環有向グラフ中の各ノードの計算順序を決める必要がある。ただし、非循環有向グラフの性質として、既に半順序が決定している。半順序が保たれていれば、効率的に計算することができる。
【００３８】
ここで、２つの入力テキスト（テキストＡとテキストＢ）の間の類似度を計算するものとすると、図５に示すように、ステップ１０１においてテキスト入力部１１においてテキストＡ及びテキストＢを受け取ってテキスト格納部１２に格納し、各テキストごとに、ステップ１０２において形態素解析部１４によって形態素解析を行い、ステップ１０３において文節解析部１５によって形態素を文節にまとめ上げ、ステップ１０４において依存関係解析部１６によって文節間の依存関係を決定し、ステップ１０５において非循環有向グラフ生成部１７によってテキストから階層を許した非循環有向グラフへの変換を行って、生成した非循環有向グラフをグラフ格納部１８に格納する。このようにして、テキストＡ及びテキストＢにそれぞれ対応する非循環有向グラフＡ及び非循環有向グラフＢがグラフ格納部１８に格納されると、類似度計算部１９は、グラフ格納部１８からこれら非循環有向グラフＡ，Ｂを取り出して、両方の非循環有向グラフ間の類似度を計算する。この計算に際しては、ソート処理部２１が、非循環有向グラフＡ，Ｂのノードを計算順序でソートし、再帰計算部２２が、ソートされた順序にしたがって上述したように再帰計算を行う。
【００３９】
【実施例】
以下、実例を挙げて、本発明によるテキスト間類似度の計算を説明する。ここでは、「私は車を買った」、「私の買った車」、「私は家を買う」という３つの入力テキストＴ₁〜Ｔ₃について、任意の２者間の類似度を計算する場合を例に挙げて説明する。
【００４０】
最初に、「私は車を買った」、「私の買った車」、「私は家を買う」の各テキストＴ₁〜Ｔ₃を、階層を許した非循環有向グラフで記述する方法を説明する。ここで「階層を許した」とは、上述したように、（下位階層の）非循環有向グラフがグラフのノードとして許されることを意味する。
【００４１】
まず、これらの入力テキストに対して形態素解析を行い、品詞を付与する（図５のステップ１０２）。その結果、図６に示すような結果が得られる。なお、活用語はその終止形で示されている。例えば「私は車を買った」のテキストＴ₁は、「私（名詞）＋は（助詞）＋車（名詞）＋を（助詞）＋買う（動詞）＋た（助動詞）」のように品詞が付与される。
【００４２】
次に、これらの形態素を文節単位にまとめ上げる（図５のステップ１０３）。その結果、図７に示すような結果が得られる。文節も、テキスト内での意味的なまとまりである。文節を［・］で表わすものとすると、「私は車を買った」のテキストであれば、「［私（名詞）＋は（助詞）］＋［車（名詞）＋を（助詞）］＋［買う（動詞）＋た（助動詞）］」と文節にまとめあげられる。これらが、階層を許した非循環有向グラフのノード及びその属性を構成する。具体的には、各文節がそれぞれ非循環有向グラフのノードとなるとともに、文節ごとに、形態素（単語）をノードとする下位階層の非循環有向グラフが構成されることになる。
【００４３】
次に、ここで作成した形態素及び文節（ノード）間の依存関係を決定することによって（図５のステップ１０４）、入力テキストに対する階層を許した非循環有向グラフを生成する（図５のステップ１０５）。ここでは、文節間依存情報を用いた例を示す。上述した入力テキストに対応する、階層を許した非循環有向グラフとして、最終的に図８に示した非循環有向グラフＧ₁〜Ｇ₃が得られる。これらの各非循環有向グラフＧ₁〜Ｇ₃において、自己に対する類似度すなわち、Ｋ(Ｇ₁,Ｇ₁)，Ｋ(Ｇ₂,Ｇ₂)，Ｋ(Ｇ₃,Ｇ₃)は、それぞれ、９９，１０７．８７５，５０である。なお、形態素解析による品詞の付与と、文節へのまとめあげと、形態素及び文節間の依存関係の決定とは、この技術分野において周知の技術であるから、その詳細な手順については説明しない。
【００４４】
「私は車を買った」のテキストであれば、文節「［私（名詞）＋は（助詞）］がノードｎ１、その文節中の「私」、「は」がそれぞれノードｎ２，ｎ３、文節［車（名詞）＋を（助詞）］がノードｎ４、その文節中の「車」、「を」がそれぞれノードｎ５，ｎ６、文節［買う（動詞）＋た（助動詞）］がノードｎ７、その文節中の「買う」、「た」がそれぞれノードｎ８，ｎ９となっている。そして、これらノード間に、ｎ１はｎ７に係り、ｎ４もｎ７に係り、ｎ２はｎ３に係り、ｎ５はｎ６に係り、ｎ８はｎ９に係る、という文法的関係が存在し、それらが階層を許した非循環有向グラフとして表わされる。
【００４５】
次に、このようにして入力テキスト「私は車を買った」、「私の買った車」、「私は家を買う」のそれぞれに対する、階層を許した非循環有向グラフＧ₁〜Ｇ₃が得られたとして、これらの入力テキスト間の類似度を計算する例を説明する。ここで、表１〜表３は、各ノードの
【００４６】
【数８】

【００４７】
の値を示す。実際の計算では、ソートされたノード順に計算することで、効率的に、再帰式をすることが可能となる。また、ペナルティλ＝０．５として計算を行った。
【００４８】
【表１】

【００４９】
【表２】

【００５０】
【表３】

【００５１】
その結果、各入力テキスト間相互の類似度として、
【００５２】
【数９】

【００５３】
が得られる。
【００５４】
以上本発明の好ましい実施の形態について説明したが、本発明に基づく類似度計算装置は、一般には、コンピュータおよびその上で動作するソフトウェアによって実現される。すなわち、上述した類似度計算装置を実現するためのプログラムを、コンピュータに読込ませ、そのプログラムを実行させることによって、本発明による類似度計算装置が実現され、また本発明の類似度計算方法が実行される。これらのプログラムは、磁気テープやＣＤ−ＲＯＭなどの記録媒体によって、あるいはネットワークを介して、コンピュータに読込まれるものである。
【００５５】
以上説明した実施の形態では、狭義のテキスト、すなわち自然言語における単語の意味のあるつながりを、テキストとして扱っている。しかしながら本発明の適用先はこれに限られるものではない。すなわち本発明は、広義のテキスト、すなわち非循環有向グラフで記述可能な離散データに対して広く適用することが可能である。このような離散データで表現されるオブジェクトとは、例えば、文書、遺伝子配列、タンパク質におけるアミノ酸配列、量子化後の音声データ、画像、様々な形式のデータベースなど、そのものが何かの意味を持つ対象（オブジェクト）である。本発明は、これら一つ一つのオブジェクトを表わす情報を、階層を許した非循環有向グラフで記述可能であれば、適用することが可能である。逆に言えば、例で示したオブジェクト以外でも、階層を許した非循環有向グラフで記述可能な構造をもったオブジェクトであれば、どのようなオブジェクトに対しても類似度を計算することが可能である。
【００５６】
【発明の効果】
以上説明したように本発明は、階層を許した非循環有向グラフを用いることにより、これまで扱いが困難であるとされていた構造を考慮した対象の比較を容易に、かつ高速に計算することが可能となる。また、テキストの構造を非循環有向グラフで表わすことにより、自然な形でテキストの構造を記述することができる。本発明を用いることにより、構造を考慮したテキストの類似度計算を高速かつ実用的な時間で計測することが可能となり、実用システムへの応用が実質的に可能となる。
【図面の簡単な説明】
【図１】単語ベクトルのコサイン距離による類似度計算方法を説明する図である。
【図２】テキストにおける構造の一例を示す図である。
【図３】階層を許した非循環有向グラフの部分パスの一例を示す図である。
【図４】本発明の実施の一形態の類似度計算装置の構成を示すブロック図である。
【図５】テキスト間の類似度の計算手順を示すフローチャートである。
【図６】形態素解析の結果を示す図である。
【図７】文節単位へのまとめ上げの結果を示す図である。
【図８】得られた非循環有向グラフを示す図である。
【符号の説明】
１１テキスト入力部
１２テキスト格納部
１４形態素解析部
１５文節解析部
１６依存関係解析部
１７非循環有向グラフ生成部
１８グラフ格納部
１９類似度計算部
２１ソート処理部
２２再帰計算部
１０１〜１０７ステップ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method and apparatus for calculating similarity between input texts.
[0002]
[Prior art]
There is an interest in techniques that efficiently calculate how similar one text and another text are structurally, semantically, and contentually. For example, the text classification task is intended to collect texts having similar characteristics as one category using a computer. That is, it can be considered that the similarity index indicating how much each text is “similar” is an important element in text classification. Further, in the question answering technique using texts, the purpose is to extract a text having a high degree of similarity with a question from a set of texts to be searched. Therefore, the calculation of the degree of similarity between texts plays an important role. Thus, in the field of text processing, there are many applications that require similarity between texts.
[0003]
As a method for expressing the characteristics of a text, a method in which each occurrence word in the text is regarded as one element of a vector, the number of occurrences of the word is used as a value, and the given text is converted into an n-dimensional vector (n: number of words). There is. A method of expressing the feature of a text as a vector using such appearing words as an element is called “bug of words”. In other words, it is a way of thinking that text is characterized by a set of words. Such a method of expressing a text with a word vector is the most commonly used method at the time of similarity calculation such as text classification (Non-patent Document 1).
[0004]
As a method for calculating the similarity between texts, the most common and efficient method is a method for calculating an inner product or cosine (cosine) distance of n-dimensional word vectors obtained from text. A specific calculation formula is shown below. FIG. 1 shows a similarity calculation method based on the cosine distance of a word vector. A specific calculation formula is represented by the following formula.
[0005]
[Expression 1]

[0006]
In FIG. 1, the words (surface layers) in the texts T1 to T3 are “I bought a car”, “My bought car”, and “I buy a house”, respectively. The original form and part of speech are obtained for, and the similarity is calculated using a word vector. Here, when a word vector is created using only the original form, the similarity (T1 * T2) between T1 and T2 is 0.730, and similarly, T2 * T3 is 0.4, and T1 * As T3, 0.730 is obtained. When the similarity is calculated using the original form and the part of speech, T1 * T2 = 0.866, T2 * T3 = 0.694, and T1 * T3 = 0.868 are obtained.
[0007]
Generally, it is considered that structure exists in text. It is also believed that the structure forms the meaning of the text. As the largest structure, there is a meaningful group of characters in the text such as paragraphs, sentences, clauses, and morphemes. In addition, there are clause dependency, anaphoric relationship, and word appearance order. FIG. 2 shows an example of a text structure.
[0008]
As described above, in the method using the word vector, attention is paid to each appearance word alone in the text, but how the word appears in the text is not considered. In other words, the semantic and structural features included in the text such as which word the target word appeared before, which word it appeared after, and which word it has a dependency relationship with are: Not considered. In such a method, it cannot be said that the similarity of the text well reflects the characteristics of the text. The nature of a word is often determined by the word's dependency and semantic structure. As a specific example, when a so-called ambiguous word ignores the structure, it is difficult to determine what meaning the word appears in the text.
[0009]
Although it is possible to replace the structure in the text with a vector representation, for example, when considering a method in which a combination of words is a vector element, the number of words is tens of thousands and the number of combinations is exponential. It is not realistic to count up. For this reason, this method is not actually used. Moreover, as a partial structure, there is a method of handling only a limited combination as an element in a vector, but in the end, a complicated text structure cannot be handled accurately.
[0010]
As described above, a method using a word vector or considering a certain structure as an extension thereof is very easy and efficient, but it is difficult to say that it is a similarity scale that sufficiently reflects the characteristics of text. .
[0011]
[Non-Patent Document 1]
Salton, G., Wong, A. and Yang, C .: "A Vector Space Model for Automatic Indexing," Communication of the ACM, Vol. 11, No. 18, pp. 613-620 (1975)
[0012]
[Problems to be solved by the invention]
As explained above, in the present situation, it is difficult to handle the text structure itself with a computer when evaluating the similarity of text, and in order to calculate the similarity considering the structure, the calculation is complicated and There is a problem that it takes time.
[0013]
SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a method and apparatus that can easily calculate the similarity between texts reflecting the structure of the text, and further, the calculation in the similarity calculation considering the structure is simple and requires a small amount of calculation. It is to provide a method and apparatus that can be.
[0014]
[Means for Solving the Problems]
In order to solve the problem that the structure of the text, which is the first problem, is difficult to handle, the present invention proposes a method that regards the structure of the text as an acyclic directed graph.
[0015]
That is, the similarity calculation method of the present invention includes a step in which the text input unit inputs the text for similarity calculation, and a step in which the acyclic directed graph generation means represents the input text as an acyclic directed graph that allows each hierarchy; The sort processing unit sorts the nodes of the acyclic directed graph corresponding to the pair of texts that are the similarity calculation targets in the calculation order, and the recursive calculation unit calculates the recursive formula according to the sorted order. And the similarity of two acyclic directed graphs given as the sum of the number of matching partial paths in all partial paths in two acyclic directed graphs corresponding to a pair of texts for which similarity is to be calculated degrees calculated by the recursive computation, and outputting the similarity as the similarity between the text.
[0016]
The similarity calculation device of the present invention, a text input section for inputting text and a directed acyclic graph generation means for expressing entered text as a directed acyclic graph that allowed the respective hierarchies, the graph storage unit for storing directed acyclic graph And a sort processing unit that sorts the nodes of the acyclic directed graph respectively corresponding to the pair of texts for which the similarity is calculated, in a calculation order, and a recursive calculation unit that calculates a recursive formula according to the sorted order. And the mutual similarity of the first and second acyclic directed graphs given as the sum of the number of matching partial paths in all the partial paths in the acyclic directed graph respectively corresponding to the pair of texts that are the similarity calculation target calculated by a recursive computation, and wherein also be output from the degree of similarity as a similarity between the text.
[0017]
In the present invention, the calculation of the similarity considering the text structure can be considered to be equivalent to calculating the similarity between the acyclic directed graphs. Moreover, although the structure of the text is complex, it can be described sufficiently within the constraints of the acyclic directed graph. By considering it as an acyclic directed graph in this way, it is possible to describe the text structure naturally, such as nodes as words and links as structures. That is, since it is not replaced with a one-dimensional array such as a vector expression unlike the conventional method, the structure of the text can be expressed more intuitively and directly. As can be seen from FIG. 2, the text can be captured by an acyclic directed graph that allows hierarchy.
[0018]
In order to solve the problem that calculation is complicated and time consuming when calculating the similarity considering the structure of the second text, the present invention recursively defines a calculation formula for similarity between texts. We propose a method for calculating the overall similarity without explicitly calculating all parts. In the present invention, since the text is regarded as an acyclic directed graph, the weighted sum of the number of partial paths that match in all partial paths in the two acyclic directed graphs is treated as the similarity between the two acyclic directed graphs. FIG. 3 shows an example of a partial path of an acyclic directed graph that allows a hierarchy used in the present invention. Here, a single node is also included as a partial path. When calculating the degree of coincidence between paths, the difference between intermediate nodes other than the start point and end point, and the match that allows expansion and contraction are also counted. In this case, however, a penalty λ is given to calculate the similarity. This penalty becomes a weight when calculating the similarity. As a result, it is possible to calculate a flexible similarity that is not a perfect match.
[0019]
In this way, it can be estimated that it is very difficult to explicitly count the number of matches of all partial paths that allow for label differences and the like when the number of nodes increases. However, in the actual calculation, it is possible to calculate the sum of the matching partial paths efficiently by defining a recursive calculation formula instead of counting all the partial paths explicitly and counting the matching paths among them. It is. Moreover, efficient calculation is possible by limiting the number of objects for calculating the similarity to two. This is considered a kind of calculation method called kernel trick by kernel method. By defining the inner product form between two texts, all elements are actually expanded explicitly. Instead, it is a method of replacing high-dimensional calculations with low-dimensional inner product calculations. Similar to this method, in the present invention, it is possible to efficiently calculate from a calculation formula defined by a recursive formula without explicitly expanding the partial path of the target text.
[0020]
The similarity between two acyclic directed graphs that allow hierarchies is given by the following equation.
[0021]
[Expression 2]

[0022]
That is, the sum of the values of the function K (•, •) of each node is given. The function K (·, ·) can be defined by the following recursive expression.
[0023]
[Equation 3]

[0024]
next,
[0025]
[Expression 4]

[0026]
Is defined as follows.
[0027]
[Equation 5]

[0028]
If the node does not include a graph, the equation is (6-1). If the node includes a graph, the equation is (6-2).
[0029]
Finally,
[0030]
[Formula 6]

[0031]
A function K ″ (·, ·) for calculating is defined by the following equation.
[0032]
[Expression 7]

[0033]
By calculating these recursive formulas, the result is equivalent to calculating the number of matches of all the partial paths of the acyclic directed graph that allows the hierarchy. Further, the calculation amount is O (| G ₁ || G ₂ |), and it is possible to calculate with a calculation amount proportional to the product of the number of nodes included in each graph.
[0034]
DETAILED DESCRIPTION OF THE INVENTION
Next, a preferred embodiment of the present invention will be described with reference to the drawings.
[0035]
FIG. 4 is a block diagram showing the configuration of the similarity calculation apparatus according to the embodiment of the present invention, and FIG. 5 is a flowchart showing the similarity calculation procedure in the present embodiment.
[0036]
4 includes a text input unit 11 for inputting input text, a text storage unit 12 for storing the input text, a morpheme analysis unit 14 for performing morphological analysis on the text in the text storage unit 12, Based on the result of the morpheme analysis in the morpheme analysis unit 14, the phrase analysis unit 15 that collects the clauses, the dependency analysis unit 16 that determines the dependency relationship for the clauses obtained by the phrase analysis unit 15, and the morpheme analysis As a result, the acyclic directed graph generation unit 17 that generates the acyclic directed graph that allows the hierarchy corresponding to the text to be processed based on the result of grouping into clauses and the analysis result of the dependency, and the generated acyclic directed graph A graph storage unit 18 for storing and two acyclic directed graphs corresponding to different texts stored in the graph storage unit 18 are extracted. , And a similarity calculator 19 calculates and outputs the similarity between these two directed acyclic graph, a. The similarity calculation unit 19 includes a sort processing unit 21 that sorts the nodes of the acyclic directed graph in the calculation order in order to calculate the similarity by the above-described calculation method using the recursive formula, and the recursive calculation according to the sorted order. And a recursive calculation unit 22.
[0037]
Next, the procedure for calculating the similarity between texts in this embodiment will be described. Roughly speaking, the calculation procedure of this embodiment is as follows:
(1) Select two texts to be compared (2) Convert text to acyclic directed graph that allows hierarchy (3) Calculate similarity between selected texts (3.1) Calculate each acyclic directed graph Sort in order (3.2) Each step of calculating a recursive formula according to the sorted order. In order to calculate recursively most efficiently, it is necessary to determine the calculation order of each node in the acyclic directed graph that allows the hierarchy in the procedure (3.1). However, the partial order has already been determined as a property of the acyclic directed graph. If the partial order is maintained, it can be calculated efficiently.
[0038]
Here, if the similarity between two input texts (text A and text B) is calculated, as shown in FIG. 5, the text input unit 11 receives the text A and the text B in step 101, and the text For each text, the morpheme analysis is performed by the morpheme analysis unit 14 in step 102, the morpheme is compiled into phrases by the clause analysis unit 15 in step 103, and the clause analysis is performed by the dependency analysis unit 16 in step 104. In step 105, the acyclic directed graph generation unit 17 converts the text into a non-circular directed graph that allows hierarchies, and stores the generated acyclic directed graph in the graph storage unit 18. When the acyclic directed graph A and the acyclic directed graph B corresponding to the text A and the text B are stored in the graph storage unit 18 in this way, the similarity calculation unit 19 receives the acyclic directed graph from the graph storage unit 18. A and B are taken out and the similarity between both acyclic directed graphs is calculated. In this calculation, the sort processing unit 21 sorts the nodes of the acyclic directed graphs A and B in the calculation order, and the recursive calculation unit 22 performs the recursive calculation as described above according to the sorted order.
[0039]
【Example】
Hereinafter, the calculation of the similarity between texts according to the present invention will be described with an example. Here, "I bought a car", "I bought the car", "I buy a house" for the three input text T ₁ ~T ₃ that, to calculate the degree of similarity between any two parties A case will be described as an example.
[0040]
First, "I bought a car", "I bought the car", "I buy a house," each text T ₁ ~T ₃ of, explaining how to write in a directed acyclic graph that allowed the hierarchy To do. Here, “hierarchy allowed” means that an acyclic directed graph (lower hierarchy) is allowed as a node of a graph as described above.
[0041]
First, morphological analysis is performed on these input texts, and parts of speech are given (step 102 in FIG. 5). As a result, a result as shown in FIG. 6 is obtained. Inflection words are shown in their final form. For example, the text T ₁ of "I bought a car" has a part of speech like "I (noun) + is (particle) + car (noun) + (particle) + buy (verb) + ta (auxiliary verb)" Is granted.
[0042]
Next, these morphemes are put together in phrase units (step 103 in FIG. 5). As a result, a result as shown in FIG. 7 is obtained. A clause is also a semantic unit in the text. If the phrase is represented by [•], the text "I bought a car" would be "[I (noun) + is (particle)] + [car (noun) + + (particle)] + [Buy (Verb) + Ta (Auxiliary Verb)] is put together in a phrase. These constitute the nodes of the acyclic directed graph that allow the hierarchy and their attributes. Specifically, each clause becomes a node of the acyclic directed graph, and a lower-layer acyclic directed graph having a morpheme (word) as a node is configured for each clause.
[0043]
Next, by determining the dependency between the created morpheme and clauses (nodes) (step 104 in FIG. 5), an acyclic directed graph allowing a hierarchy for the input text is generated (step 105 in FIG. 5). . Here, an example using inter-phrase dependence information is shown. Finally, the acyclic directed graphs G _{1 to} G ₃ shown in FIG. 8 are obtained as the acyclic directed graph corresponding to the input text described above and allowing the hierarchy. In each of these acyclic directed graphs G _{1 to} G ₃ , the similarity to self, that is, K (G ₁ , G ₁ ), K (G ₂ , G ₂ ), K (G ₃ , G ₃ ) is 99, respectively. , 107.875,50. The addition of parts of speech by morphological analysis, summarization into phrases, and determination of dependency relationships between morphemes and phrases are well-known techniques in this technical field, and detailed procedures thereof will not be described.
[0044]
If the text is “I bought a car”, the phrase “[I (noun) + is (particle)]” is node n1, and “I” and “ha” in the phrase are nodes n2, n3, respectively. [Car (noun) + a (particle)] is node n4, "car" and "wo" in the clause are nodes n5 and n6, and the clause [buy (verb) + ta (auxiliary verb)] is node n7, “Buy” and “Ta” in the phrase are nodes n8 and n9, respectively. Between these nodes, there is a grammatical relationship that n1 is related to n7, n4 is related to n7, n2 is related to n3, n5 is related to n6, and n8 is related to n9. Represented as a directed acyclic graph.
[0045]
Next, the acyclic directed graphs G _{1 to} G ₃ that allow hierarchies for the input texts “I bought a car”, “My bought car”, and “I buy a house” are respectively obtained in this way. An example of calculating the similarity between these input texts will be described. Here, Tables 1 to 3 are as follows.
[Equation 8]

[0047]
Indicates the value of. In actual calculation, it is possible to efficiently perform a recursive expression by calculating in the sorted node order. The calculation was performed with a penalty λ = 0.5.
[0048]
[Table 1]

[0049]
[Table 2]

[0050]
[Table 3]

[0051]
As a result, as the similarity between each input text,
[0052]
[Equation 9]

[0053]
Is obtained.
[0054]
Although the preferred embodiments of the present invention have been described above, the similarity calculation device according to the present invention is generally realized by a computer and software operating on the computer. That is, the similarity calculation apparatus according to the present invention is realized by causing a computer to read the program for realizing the above-described similarity calculation apparatus and executing the program, and the similarity calculation method of the present invention is executed. Is done. These programs are read into a computer by a recording medium such as a magnetic tape or a CD-ROM or via a network.
[0055]
In the embodiments described above, text in a narrow sense, that is, meaningful connection of words in natural language, is treated as text. However, the application destination of the present invention is not limited to this. That is, the present invention can be widely applied to a wide range of text, that is, discrete data that can be described by a directed acyclic graph. Objects represented by such discrete data include objects that have meanings such as documents, gene sequences, amino acid sequences in proteins, quantized audio data, images, and various types of databases. (Object). The present invention can be applied as long as the information representing each object can be described by an acyclic directed graph that allows a hierarchy. In other words, the similarity can be calculated for any object other than the object shown in the example as long as it has a structure that can be described by a directed acyclic graph that allows hierarchies. is there.
[0056]
【The invention's effect】
As described above, according to the present invention, by using a directed acyclic graph that allows hierarchies, it is possible to easily and quickly calculate a comparison of objects in consideration of a structure that has been considered difficult to handle. It becomes possible. In addition, the structure of the text can be described in a natural form by representing the structure of the text with an acyclic directed graph. By using the present invention, it is possible to measure the similarity calculation of a text in consideration of the structure at high speed and in a practical time, and practical application to a practical system becomes possible.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining a similarity calculation method based on a cosine distance of a word vector.
FIG. 2 is a diagram illustrating an example of a structure in text.
FIG. 3 is a diagram illustrating an example of a partial path of an acyclic directed graph that allows a hierarchy;
FIG. 4 is a block diagram showing a configuration of a similarity calculation apparatus according to an embodiment of the present invention.
FIG. 5 is a flowchart showing a calculation procedure of similarity between texts.
FIG. 6 is a diagram illustrating a result of morphological analysis.
FIG. 7 is a diagram illustrating a result of grouping into phrases.
FIG. 8 is a diagram showing the obtained acyclic directed graph.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 11 Text input part 12 Text storage part 14 Morphological analysis part 15 Clause analysis part 16 Dependency analysis part 17 Acyclic directed graph generation part 18 Graph storage part 19 Similarity calculation part 21 Sort processing part 22 Recursive calculation part 101-107 step

Claims

A similarity calculation device for calculating the similarity between texts,
A text input section for entering text;
Acyclic directed graph generation means for representing the input text as an acyclic directed graph that allows each hierarchy;
A graph storage unit for storing the acyclic directed graph;
A sort processing unit that sorts the nodes of the acyclic directed graph corresponding to the pair of texts that are the similarity calculation targets, in the calculation order;
A recursive calculation unit that calculates a recursive expression according to the sorted order ; and
It has a recursive calculation of the mutual similarity directed acyclic graph that given as a matching sum of its parts number of passes in all partial paths in the directed acyclic graph corresponding respectively to a pair of text the a similarity calculation target And calculating the similarity as a similarity between the texts .

The acyclic directed graph generation means includes a morpheme analysis unit that performs morphological analysis on input text, a phrase analysis unit that collects phrases based on an analysis result of the morpheme analysis unit, and a phrase analysis unit and a dependency analysis unit for determining dependencies between the clause raised collectively represent the directed acyclic graph based on said dependencies similarity calculation device according to claim 1.

A similarity calculation method for calculating the similarity between texts,
A step in which the text input unit inputs text for similarity calculation;
A step in which the acyclic directed graph generation means represents the input text as an acyclic directed graph that allows a hierarchy;
A step of sorting the nodes of the acyclic directed graph respectively corresponding to a pair of texts that are targets of similarity calculation in a calculation order;
A recursive calculation unit calculating a recursive expression according to the sorted order; and
And the mutual similarity of the two acyclic directed graphs given as the sum of the number of matching partial paths in all the partial paths in the two acyclic directed graphs corresponding to the pair of texts whose similarity is to be calculated. A similarity calculation method comprising: calculating by recursive calculation and outputting the similarity as a similarity between the texts.

The expressing step includes a step in which the morpheme analysis unit performs morpheme analysis on the input text, a step in which the phrase analysis unit assembles phrases based on the result of the morpheme analysis, and a dependency analysis unit The similarity calculation method according to claim 3 , further comprising a step of determining a dependency relationship between the clauses compiled, wherein each acyclic directed graph is expressed based on the dependency relationship.

A program for causing a computer to function as each unit of the similarity calculation device according to claim 1 .

A computer-readable recording medium that stores the program according to claim 5 .