JP3925418B2

JP3925418B2 - Topic boundary determination apparatus and program

Info

Publication number: JP3925418B2
Application number: JP2003024476A
Authority: JP
Inventors: 克人別所
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-01-31
Filing date: 2003-01-31
Publication date: 2007-06-06
Anticipated expiration: 2023-01-31
Also published as: JP2004234512A

Description

【０００１】
【発明の属する技術分野】
本発明は、トピック境界決定装置及びプログラムに係り、特に、複数の文もしくは、単語からなるテキストを意味的なまとまりの単位であるトピック区間に分割し、トピック間の境界を決定するトピック境界決定装置及びプログラムに関する。
【０００２】
【従来の技術】
従来技術として、テキストをトピック単位に分割するHearst法がある（例えば、非特許文献１，非特許文献２参照。）。Hearst法では、テキストを単語に分割し、不要語を除去した後、各単語境界の前後に一定の単語数の単語列の窓をとり、各窓毎に窓に含まれる単語の出現頻度ベクトルをとり、前後の窓に対応するベクトル間の余弦測度を当該単語境界の結束度として計算する。結束度が極小となる単語境界あるいはその直近の文境界をトピック境界と認定する。
【０００３】
また、形態素解析処理で得られた各単語に対応するベクトルを取得し、単語の境界の前後にある個数の集合である単語列をとり、各単語列を構成する単語のベクトルの情報から前後の単語列の類似尺度、または、距離尺度である単語列結束度を算出し、単語列結束度が類似尺度である場合、極小である単語境界を、距離尺度である場合、極大である単語境界を、テキストの意味段落の境界とする方法がある（特許文献１参照。）。
【０００４】
【特許文献１】
特開２００２−３４２３２４号公報
【０００５】
【非特許文献１】
Hearst, M.A.: Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp.9-16 (1994).
【０００６】
【非特許文献２】
Hearst, M.A.: TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages, Computational Linguistics, Vol.23, No.1, pp.33-64 (1997).
【０００７】
【発明が解決しようとする課題】
しかしながら、上記従来のHearst法や単語ベクトルに基づく結束度を用いる方法においては、一定の大きさの窓を各単語境界の前後にとるが、トピック区間が窓幅よりも小さい場合、窓に対応するベクトルは、該トピックの意味を適切に表せず、結果として結束度も適切なものではないため、小さいトピック区間の検出が困難であるという問題がある。
【０００８】
また、Hearst法や単語ベクトルに基づく結束度を用いる方法では、テキストの局所的な範囲でトピックの変わり目を判断しているため、大きなトピック区間も細かく分断されていくつかのトピック区間として検出されることがあり、大きなトピック区間の検出が困難であるという問題もある。
【０００９】
本発明は、上記の点に鑑みなされたもので、小さなトピック区間と大きなトピック区間の双方の検出の精度を向上させることが可能なトピック境界決定装置及びプログラムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
図１は、本発明の原理構成図である。
【００１１】
本発明（請求項１）は、テキストを意味的なまとまりの単位であるトピックに分割し、該トピック間の境界を決定するトピック境界決定装置であって、
テキストを形態素解析して、単語に分割する形態素解析手段と、
単語の意味を表現するベクトルが格納されている記憶手段である概念ベースと、
概念ベースを検索することによって、形態素解析手段で得られた各単語に対応するベクトルを取得する単語ベクトル取得手段と、
単語ベクトルの系列において、ある単語からまたは文から、それ以降のある単語または文までの範囲の単語ベクトルの系列である区間に対し、該区間内の単語ベクトルの重心ベクトルと各単語ベクトルとの間のユークリッド距離の自乗の和を含むコストを求め、任意の分割数に対し、該分割数の分割を構成する各区間のコストの和の最小値及び該最小値をとる分割を求める最小コスト分割取得手段と、からなる。
【００１３】
また、本発明（請求項２）は、分割数に対応する各区間のコストの和の最小値Ｅ _ｊは、分割数ｊに対して単調減少という理論の上で、最小値Ｅ _ｊに対する最小値Ｅ _ｊ−１の比を求め、該比が所定の値以上となる最大の分割数ｊを求め、該最大の分割数ｊに対応した分割を最適な分割と認定する最適分割取得手段と、を更に合わせ持つ。
【００１４】
本発明（請求項３）は、請求項１または請求項２記載のトピック境界決定装置を構成する各手段としてコンピュータを機能させるトピック境界決定プログラムである。
【００１５】
上記のように、本発明は、任意の区間列のクラスタ群としての妥当性をもとに最尤の分割を決定することを可能とする。本発明では、任意の区間を考慮しており、テキストの局所的な範囲も大局的な範囲も同時にみて判断を行うことにより、小さなトピック区間と大きなトピック区間の双方の検出の精度が向上する。
【００１６】
【発明の実施の形態】
以下に、図面と共に本発明の実施の形態を説明する。
【００１７】
図２は、本発明の一実施の形態におけるトピック境界決定装置の構成を示す。同図に示すトピック境界決定装置は、形態素解析部１０、単語ベクトル取得部２０、該念ベース３０、最小コスト分割取得部４０、最適分割取得部５０から構成される。
【００１８】
なお、形態素解析部１０、単語ベクトル取得部２０、最小コスト分割取得部４０、最適分割取得部５０は、ＣＰＵ等の制御手段で行い、単語と対応付けられた単語ベクトル、計算されたコストを分割（区間）の組み合わせと対応付けて記憶手段に格納するものとする。
【００１９】
形態素解析部１０は、入力されたテキストを形態素解析して品詞付きの単語に分解する。形態素解析の結果得られた単語の内、付属語等の単語はトピック境界認定に無関係と考えられるので、形態素解析の後、このような不要語を削除してもよい。
【００２０】
単語ベクトル取得部２０は、各単語に対応して意味を示すベクトル値が格納されている概念ベース３０を検索することにより、形態素解析部１０で分解された単語に対する単語ベクトルを取得する。
【００２１】
図３は、本発明の一実施の形態における概念ベース例を示す。同図に示す概念ベース３０は、ハードディスク等の記憶手段に格納され、各単語毎に、ｆ次元のベクトル値が付与されている。概念ベース３０中の単語は、名詞や動詞、形容詞等の自立語である。概念ベース３０における単語ベクトルは、意味的に類似している単語間ほど距離が近く、意味的に類似していない単語間ほど距離が遠くなるように値が設定されている。
【００２２】
概念ベースの例としては、特開平６−１０３３１５の「類似性判別装置」や、特開平７−３０２２６５の「類似性判別用データ精錬方法およびこの方法を実施する装置」で開示されているデータベースがある。
【００２３】
また、Deerwesterの論文(Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R.: Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, pp.391-407(1990)) では、単語の文書における頻度を記録した単語・文書間の共起行列を特異値分解により次元数を縮退させた行列に変換しているが、この変換後の行列も概念ベースの一例である。Schutze の論文(Schutze,H.: Dimensions of Meaning, Proc. of Supercomputing '92,pp.786-796 (1992))では、コーパス中の単語間の共起頻度を記録した単語・単語間の共起行列を特異値分析により次元数を縮退させた行列に変換しているが、この変換後の行列も概念ベースの一例である。
【００２４】
最小コスト分割取得部４０は、単語ベクトルの系列において、ある単語または文から、それ以降のある単語または文までの範囲の単語ベクトルの系列である区間に対し、区間内の単語ベクトルの重心ベクトルと各単語ベクトルとの間のユークリッド距離の自乗の和等のコストを求め、任意の分割数に対し、当該分割数の分割を構成する各区間のコストの和の最小値及び当該最小値をとる分割を求める。また、最小値としては、一定の制約条件下における最小値を用いることもある。
【００２５】
最適分割取得部５０は、各分割数に対応する最小コスト値間の比を含む、各分割数に対応する最小コストに関係する値及び分割数が一定の条件を満足する分割を最適な分割と認定する。
【００２６】
次に、上記の構成における動作を説明する。
【００２７】
図４は、本発明の一実施の形態におけるトピック境界決定のフローチャートである。
【００２８】
ステップ１０１）形態素解析部１０において、入力テキストを形態素解析して単語に分割する。形態素解析の結果得られた単語のうち、付属語等の単語は、トピック境界認定に無関係と考えられるので、形態素解析の後、このような不要語を削除してもよい。
【００２９】
ステップ１０２）単語ベクトル取得部２０において、単語の意味を表現するベクトルが格納されている概念ベース３０を検索し、形態素解析部１０で得られた各単語に対応するベクトルを取得する。
【００３０】
ステップ１０３）最小コスト分割取得部４０において、単語ベクトルの系列において、ある単語または文から、それ以降のある単語または文までの範囲の単語ベクトルの系列である区間に対し、区間内の単語ベクトルの重心ベクトルと各単語ベクトルとの間のユークリッド距離の自乗の和等のコストを求め、任意の分割数に対し、該分割数の分割を構成する各区間のコストの和の、または、一定の制約条件下における最小値及び該最小値をとる分割を求める。例えば、入力テキストの単語ベクトルの系列を、
ｗ1 ，ｗ2 ，…，ｗx …（数１）
とする。
【００３１】
また、入力テキストの文の系列を、
ｓ1 ，ｓ2 ，…，ｓg …（数２）
とする。
【００３２】
トピック境界は常に文境界であると仮定した場合、入力テキストを分割する任意の区間列は、文番号１≦ｈ≦ｇの分割
Ｔ1 ＝（１，２，…，ｎ2 −１），
Ｔ2 ＝（ｎ2 ，ｎ2 ＋１，…，ｎ3 −１），
・
・
Ｔi ＝（ｎi ，ｎi ＋１，…，ｎi+1 −１），
・
・
Ｔj ＝（ｎj ，ｎj ＋１，…，ｇ） …（数３）
という形になる。各文ｓh は、単語ベクトルの系列
【００３３】
【数１】

から成り立っているものとする。
【００３４】
ここで、トピック境界は文境界とは限らないと仮定したときは、上記の（数２）において各文を単語とみなせばよい。
【００３５】
区間Ｔi ＝（ｎi ，ｎi ＋１，…，ｎi+1 −１）のコストｃ（ｎi ，ｎi+1 −１）を、
【００３６】
【数２】

【００３７】
【数３】

と定義する。ここで、
【００３８】
【数４】

は、区間Ｔi 内の単語ベクトルの重心ベクトルであり、ｃ（ｎi ，ｎi+1 −１）は、当該重心ベクトルと各単語ベクトルとの間のユークリッド距離の自乗和である。
【００３９】
分割Ｔ1 ，Ｔ2 ，…，Ｔj のコストｅ（Ｔ1 ，Ｔ2 ，…，Ｔj ）を、
【００４０】
【数５】

と定義する。各区間をクラスタとみたとき、ｅ（Ｔi ，Ｔ2 ，…，Ｔj ）は、いわゆるクラスタ内変動である。
【００４１】
クラスタ内変動とクラスタ間変動の和は常に全変動に等しいという性質がある。ここでこのことについて説明する。
【００４２】
全変動及びクラスタ間変動の算出は、本発明では必ずしも必要ではない。
【００４３】
入力テキストの単語ベクトルの系列（数１）の全変動Ａは、
【００４４】
【数６】

【００４５】
【数７】

と定義される。
【００４６】
また、区間列（数３）のクラスタ間変動Ｂ（Ｔ1 ，Ｔ2 ，…，Ｔj ）は、
【００４７】
【数８】

と定義される。
【００４８】
任意の区間列（数３）に対し、
Ａ＝Ｂ（Ｔ1 ，Ｔ2 ，…，Ｔj ）＋ｅ（Ｔ1 ，Ｔ2 ，…，Ｔj ）
…（数１０）
が成立する。
【００４９】
全変動Ａは一定であるので、クラスタ内変動ｅ（Ｔ1 ，Ｔ2 ，…，Ｔj ）が小さいほど、クラスタ間変動Ｂ（Ｔ1 ，Ｔ2 ，…，Ｔj ）は大きくなり、各区間の間はクラスタとしてよく分離されているといえる。
【００５０】
上記の（数１０）の式から分かることは、ある区間列を更に細分割して得られる区間列のコストは、分割前の区間列のコスト以下となるということである。クラスタ内変動ｅ（Ｔ1 ，Ｔ2 ，…，Ｔj ）は、分割数が一つの場合（ｊ＝１）、最も大きく、各区間が一文の場合（ｊ＝ｇ）、最も小さくなる。従って、クラスタ群として妥当かどうかは、分割数を固定した場合に意味がある。
【００５１】
任意の分割数ｊに対し、分割数ｊの分割のコストの最小値及び当該最小値をとる分割を求める。
【００５２】
文の系列ｓ1 ，ｓ2 ，…，ｓh （１≦ｈ≦ｇ）をｑ個に分割する分割で、最小のコストをとる分割をＰ（ｈ，ｑ）と表すことにする。
【００５３】
Ｐ（ｈ，ｑ）：Ｔ1 ＝（１，２，…，ｎ2 −１），…，
Ｔq-1 ＝（ｎq-1 ，ｎq-1 ＋１，…，ｎq −１），
Ｔq ＝（ｎq ，ｎq ＋１，…，ｈ） … （数１１）
としたとき、区間列のコストは、各区間のコストの和なので、
Ｐ（ｎq −１，ｑ−１）：Ｔ1 ＝（１，２，…ｎ2 −１），…，
Ｔq-1 ＝（ｎq-1 ，ｎq-1 ＋１，…ｎq −１）
…（数１２）
となる。
【００５４】
この性質を用いて、任意の分割数ｊに対し、Ｐ（ｇ，ｊ）及びｅ（Ｐ（ｇ，ｊ））を、以下のダイナミック・プログラミングで効率的に求めることができる。
（１）１≦ｒ≦ｓ≦ｇなる全てのｒ，ｓに対して、区間（ｒ，ｒ＋１，…，ｓ）のコストＣ（ｒ，ｓ）を計算する。
【００５５】
（２）ｅ（Ｐ（ｈ，２））（２≦ｈ≦ｇ）を、
【００５６】
【数９】

として求める。
【００５７】
【数１０】

として記憶しておく。
【００５８】
（３）分割数３≦ｑ≦ｇに対し、ｅ（Ｐ（ｈ，ｑ））（ｑ≦ｈ≦ｇ）を、
【００５９】
【数１１】

として求める。
【００６０】
【数１２】

として記憶しておく。
【００６１】
（４）分割数２≦ｊ≦ｇに対し、分割Ｐ（ｇ，ｊ）を求める。
【００６２】
Ｐ（ｈ，２）＝（１，…，ｔh,2 −１），（ｔh,2 ，…，ｈ）（２≦ｈ≦ｇ）
…（数１７）
Ｐ（ｈ，ｑ）＝ｐ（ｔh,q −１，ｑ−１），（ｔh,q ，…，ｈ）
（３≦ｑ≦ｇ，ｑ≦ｈ≦ｇ）
…（数１８）
より求めようとする区間列の一番最後の区間の最初の文番号を取得できることから、分割Ｐ（ｇ，ｊ）の区間列を得ることができる。
【００６３】
ステップ１０４）最適分割取得部５０において、各分割数に対応する最小コスト値間の比を含む、各分割数に対応する最小コストに関係する値及び分割数が一定の条件を満足する分割を最適な分割と認定する。
【００６４】
分割数ｊの最小コストｅ（Ｐ（ｇ，ｊ））をＥj と表すことにする。上記の（数１０）の式により、ｊが増えるにつれ、Ｅj は、単調減少していく。
【００６５】
最適分割認定の一例として、Ｅj の平均をμ、標準偏差をσとしたときに、閾値α＝μ＋ｚσ（ｚ：例えば、１．５）をとり、
【００６６】
【数１３】

となるｊで、最大のｊに対応する分割Ｐ（ｇ，ｊ）を最適分割と認定する。
【００６７】
最小コスト値の間の比
【００６８】
【数１４】

は一般に、ｊが増えるにつれ単調減少していく傾向がある。分割がトピック分割とほぼ一致しているとき、それ以上分割数が増えてもコストは殆ど減少することなく変わらず、比の値は１に近くなる。最適分割認定の別の一例として、閾値α（α:例えば１．０００５）をとり、Ｒｊ≧αとなるｊで、最大のｊに対応する分割Ｐ（ｇ，ｊ）を最適分割と認定する。
【００６９】
また、別の一例として、Ｒj の平均をμ、標準偏差をσとしたときに、閾値α＝μ＋ｚσ（ｚ：例えば０．８）をとり、
【００７０】
【数１５】

となるｊで、最大のｊに対応する分割Ｐ（ｇ，ｊ）を最適分割と認定する。
【００７１】
以上、一連の各過程の説明を行ったが、上記のステップ１０３において、区間Ｔi ＝（ｎi ，ｎi ＋１，…，ｎi+1 −１）のコストＣ（ｎi ，ｎi+1 −１）として、以下のように区間内の単語ベクトルのメディアン（各座標毎に中央値をとって得られるベクトル）と各単語ベクトルとの間のマンハッタン距離（各座標毎に値の差の絶対値をとり、それを全ての座標にわたって足し合わせた値）の和をとることもできる。
【００７２】
【数１６】

【００７３】
【数１７】

また、上記のステップ１０３において、任意の分割数に対し、当該分割数で、コストが最小ではないが、一定の制約条件下で最小となる分割及びそのコストを、先に述べたダイナミック・プログラミングよりも短い時間で求めることができる。例えば、以下の（１）、（２）の方法が挙げられる。
【００７４】
（１）文番号列１≦ｈ≦ｇを２分割するコスト最小の分割を求める。次に、先に２分割した分割位置は固定のまま、文番号列１≦ｈ≦ｇを３分割するコスト最小の分割を求める。以降同様に、これまで得られた分割位置を固定したまま、１回分割してコストが最小となる分割を求めていく。
【００７５】
（２）文番号列１≦ｈ≦ｇをｇ分割されたｇ個のクラスタとする。隣接する２個のクラスタを結合して１つのクラスタにすることによって、コストが最小となる分割を求める。以降同様に、これまで得られたクラスタは分割することなく、隣接する２個のクラスタを結合して１つのクラスタにすることによって、コストが最小となる分割を求めていく。
【００７６】
なお、上記の形態素解析、単語ベクトル取得、コスト計算、分割取得等の処理をプログラムとして構築し、当該プログラムを通信回線または、記憶媒体からインストールし、ＣＰＵ等の制御手段で実施することが可能である。
【００７７】
なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【００７８】
【発明の効果】
上述のように、本発明によれば、任意の区間列のクラスタ群としての妥当性をもとに最尤の分割を決定し、また、任意の区間を考慮しており、テキストの局所的な範囲も大局的な範囲も同時にみて判断を行うことにより、小さなトピック区間と大きなトピック区間の双方の検出の精度が向上する。
【図面の簡単な説明】
【図１】本発明の原理構成図である。
【図２】本発明の一実施の形態におけるトピック境界決定装置の構成図である。
【図３】本発明の一実施の形態における概念ベースの例である。
【図４】本発明の一実施の形態におけるトピック境界決定のフローチャートである。
【符号の説明】
１０形態素解析手段、形態素解析部
２０単語ベクトル取得手段、単語ベクトル取得部
３０概念ベース
４０最小コスト分割取得手段、最小コスト分割取得部
５０最適分割取得手段、最適分割取得部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a topic boundary determination apparatus and program, particularly, a plurality of sentences or to divide the text consisting of words semantic topic section is a unit of unity, the topic boundary determination apparatus for determining a boundary between topics And programs .
[0002]
[Prior art]
As a conventional technique, there is a Hearst method in which text is divided into topic units (for example, see Non-Patent Document 1 and Non-Patent Document 2). In the Hearst method, text is divided into words, unnecessary words are removed, a window of word strings with a certain number of words is taken before and after each word boundary, and the appearance frequency vector of the words included in the window is calculated for each window. Then, the cosine measure between the vectors corresponding to the front and back windows is calculated as the cohesion degree of the word boundary. The word boundary where the cohesion degree is minimized or the sentence boundary nearest to it is recognized as the topic boundary.
[0003]
Also, a vector corresponding to each word obtained by the morphological analysis processing is acquired, a word string that is a set of numbers before and after the boundary of the word is taken, and information on the vectors of words constituting each word string When the word string cohesion is calculated as the similarity measure of the word string or the distance measure, if the word string cohesion is the similarity measure, the word boundary that is the minimum is calculated, and if the word string is the distance measure, the word boundary that is the maximum is calculated. , There is a method of making the meaning paragraph boundary of the text (see Patent Document 1).
[0004]
[Patent Document 1]
Japanese Patent Laid-Open No. 2002-342324
[Non-Patent Document 1]
Hearst, MA: Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9-16 (1994).
[0006]
[Non-Patent Document 2]
Hearst, MA: TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages, Computational Linguistics, Vol.23, No.1, pp.33-64 (1997).
[0007]
[Problems to be solved by the invention]
However, in the conventional Hearst method and the method using the degree of cohesion based on the word vector, a window with a certain size is taken before and after each word boundary, but if the topic interval is smaller than the window width, it corresponds to the window. The vector does not appropriately represent the meaning of the topic, and as a result, the degree of cohesion is not appropriate, so that there is a problem that it is difficult to detect a small topic section.
[0008]
Also, in the Hearst method and the method using the cohesion degree based on the word vector, the topic change is judged in the local range of the text, so the large topic section is also divided into small sections and detected as several topic sections. There is also a problem that it is difficult to detect a large topic section.
[0009]
The present invention has been made in view of the above points, and an object of the present invention is to provide a topic boundary determination apparatus and program capable of improving the accuracy of detection of both small topic sections and large topic sections.
[0010]
[Means for Solving the Problems]
FIG. 1 is a principle configuration diagram of the present invention.
[0011]
The present invention (Claim 1) is a topic boundary determination device that divides text into topics that are units of semantic units and determines boundaries between the topics ,
Morphological analysis means for analyzing the text and dividing it into words,
A concept base which is a storage means in which a vector expressing the meaning of a word is stored;
A word vector acquisition means for acquiring a vector corresponding to each word obtained by the morpheme analysis means by searching the concept base;
In a word vector series, for a section that is a series of word vectors ranging from a certain word or sentence to a certain word or sentence thereafter, between the centroid vector of the word vectors in the section and each word vector The cost including the sum of the squares of the Euclidean distance is obtained, and the minimum cost division acquisition for obtaining the minimum value of the cost sum of each section constituting the division of the division number and the division taking the minimum value for the arbitrary division number Means .
[0013]
Further, the present invention (Claim 2), the minimum value E _j of the sum of the cost of each section corresponding to the number of divisions, on the theory that monotonically decreases with respect to the division number j, the minimum value to the minimum value E _j An optimum division acquisition unit that obtains a ratio of E _j−1 , obtains a maximum division number j at which the ratio is equal to or greater than a predetermined value, and certifies a division corresponding to the maximum division number j as an optimum division; In addition, have it together .
[0014]
The present invention (Claim 3) is a topic boundary determination program for causing a computer to function as each means constituting the topic boundary determination apparatus according to Claim 1 or Claim 2.
[0015]
As described above, the present invention makes it possible to determine the maximum likelihood division based on the validity of a cluster group of an arbitrary section sequence. In the present invention, an arbitrary section is taken into consideration, and the accuracy of detection of both a small topic section and a large topic section is improved by making a determination while simultaneously viewing a local range and a global range of text.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0017]
Figure 2 shows the structure of a topic boundary determination apparatus according to an embodiment of the present invention. The topic boundary determination apparatus shown in FIG. 1 includes a morphological analysis unit 10, a word vector acquisition unit 20, a case base 30, a minimum cost division acquisition unit 40, and an optimal division acquisition unit 50.
[0018]
The morphological analysis unit 10, the word vector acquisition unit 20, the minimum cost division acquisition unit 40, and the optimal division acquisition unit 50 are performed by a control unit such as a CPU, and divide the word vector associated with the word and the calculated cost. Assume that it is stored in the storage means in association with the combination of (section).
[0019]
The morpheme analysis unit 10 performs morphological analysis on the input text and breaks it down into words with parts of speech. Of the words obtained as a result of the morphological analysis, words such as attached words are considered to be irrelevant to the topic boundary recognition. Therefore, such unnecessary words may be deleted after the morphological analysis.
[0020]
The word vector acquisition unit 20 acquires a word vector for the word decomposed by the morpheme analysis unit 10 by searching the concept base 30 in which a vector value indicating meaning is stored corresponding to each word.
[0021]
FIG. 3 shows an example of a concept base in an embodiment of the present invention. The concept base 30 shown in the figure is stored in storage means such as a hard disk, and an f-dimensional vector value is assigned to each word. The words in the concept base 30 are independent words such as nouns, verbs, and adjectives. The word vectors in the concept base 30 are set such that the distance between words that are semantically similar is closer, and the distance between words that are not semantically similar is longer.
[0022]
Examples of the concept base include the databases disclosed in “Similarity Discriminating Device” of Japanese Patent Laid-Open No. 6-103315 and “Data Refining Method for Similarity Discrimination and Device for Implementing this Method” of Japanese Patent Laid-Open No. 7-302265. is there.
[0023]
Also, Deerwester's paper (Deerwester, S., Dumais, ST, Furnas, GW, Landauer, TK, and Harshman, R .: Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, pp.391-407 ( 1990)) converts the co-occurrence matrix between words and documents that records the frequency of word documents into a matrix whose dimensionality is reduced by singular value decomposition. This converted matrix is also an example of a concept base. It is. In Schutze's paper (Schutze, H .: Dimensions of Meaning, Proc. Of Supercomputing '92, pp. 786-796 (1992)) The matrix is converted to a matrix with a reduced number of dimensions by singular value analysis, and this converted matrix is also an example of a concept base.
[0024]
The minimum cost division acquisition unit 40 obtains a centroid vector of word vectors in a section for a section that is a series of word vectors ranging from a certain word or sentence to a certain word or sentence thereafter in the word vector series. The cost, such as the sum of squares of the Euclidean distance between each word vector, is obtained, and for any number of divisions, the minimum sum of the costs of each section constituting the division of the division number and the division that takes the minimum value Ask for. Moreover, as the minimum value, the minimum value under a certain constraint condition may be used.
[0025]
The optimal division acquisition unit 50 includes a value related to the minimum cost corresponding to each division number, including a ratio between the minimum cost values corresponding to each division number, and a division that satisfies a certain condition of the division number as an optimal division. Authorize.
[0026]
Next, the operation in the above configuration will be described.
[0027]
FIG. 4 is a flowchart of topic boundary determination in one embodiment of the present invention.
[0028]
Step 101) In the morpheme analysis unit 10, the input text is morphologically analyzed and divided into words. Of the words obtained as a result of the morphological analysis, words such as attached words are considered to be irrelevant to the topic boundary recognition. Therefore, such unnecessary words may be deleted after the morphological analysis.
[0029]
Step 102) The word vector acquisition unit 20 searches the concept base 30 in which a vector representing the meaning of the word is stored, and acquires a vector corresponding to each word obtained by the morpheme analysis unit 10.
[0030]
Step 103) In the minimum cost division acquisition unit 40, in the word vector series, the word vector in the section is compared with the section that is a series of word vectors ranging from a certain word or sentence to a certain word or sentence thereafter. The cost, such as the sum of squares of the Euclidean distance between the centroid vector and each word vector, is obtained, and for any number of divisions, the sum of the costs of each section constituting the division of the division number or a certain constraint The minimum value under the condition and the division that takes the minimum value are obtained. For example, a sequence of word vectors of input text
w1, w2, ..., wx ... (Equation 1)
And
[0031]
In addition, the sentence series of the input text,
s1, s2, ..., sg ... (Equation 2)
And
[0032]
Assuming that the topic boundary is always a sentence boundary, an arbitrary interval sequence for dividing the input text is a division of sentence number 1 ≦ h ≦ g T 1 = (1, 2,..., N 2 −1),
T2 = (n2, n2 + 1,..., N3-1),
・
・
Ti = (ni, ni + 1,..., Ni + 1-1),
・
・
Tj = (nj, nj + 1,..., G) (Equation 3)
It becomes the form. Each sentence sh is a sequence of word vectors.
[Expression 1]

It shall consist of
[0034]
Here, when it is assumed that the topic boundary is not necessarily a sentence boundary, each sentence may be regarded as a word in the above (Equation 2).
[0035]
The cost c (ni, ni + 1 −1) of the section Ti = (ni, ni + 1,..., Ni + 1−1) is
[0036]
[Expression 2]

[0037]
[Equation 3]

It is defined as here,
[0038]
[Expression 4]

Is the centroid vector of the word vector in the section Ti, and c (ni, ni + 1 −1) is the square sum of the Euclidean distance between the centroid vector and each word vector.
[0039]
The cost e (T1, T2,..., Tj) of the divisions T1, T2,.
[0040]
[Equation 5]

It is defined as When each section is regarded as a cluster, e (Ti, T2,..., Tj) is a so-called intracluster fluctuation.
[0041]
The sum of intra-cluster variation and inter-cluster variation is always equal to the total variation. This will be described here.
[0042]
Calculation of total variation and inter-cluster variation is not necessarily required in the present invention.
[0043]
The total variation A of the word vector series (Equation 1) of the input text is
[0044]
[Formula 6]

[0045]
[Expression 7]

Is defined.
[0046]
In addition, the inter-cluster variation B (T1, T2,..., Tj) of the interval sequence (Equation 3) is
[0047]
[Equation 8]

Is defined.
[0048]
For any interval sequence (Equation 3)
A = B (T1, T2,..., Tj) + e (T1, T2,..., Tj)
... (10)
Is established.
[0049]
Since the total variation A is constant, the smaller the intra-cluster variation e (T1, T2,..., Tj) is, the larger the intercluster variation B (T1, T2,..., Tj) is. It can be said that they are well separated.
[0050]
What can be understood from the above equation (10) is that the cost of the section sequence obtained by further subdividing a section sequence is equal to or lower than the cost of the section sequence before the division. The intra-cluster variation e (T1, T2,..., Tj) is the largest when the number of divisions is one (j = 1), and the smallest when each section is one sentence (j = g). Therefore, whether or not the cluster group is valid is meaningful when the number of divisions is fixed.
[0051]
For an arbitrary division number j, a minimum value of the division cost of the division number j and a division that takes the minimum value are obtained.
[0052]
The sentence sequence s1, s2,..., Sh (1 ≦ h ≦ g) is divided into q pieces, and the division that takes the minimum cost is represented as P (h, q).
[0053]
P (h, q): T1 = (1, 2, ..., n2 -1), ...,
Tq-1 = (nq-1, nq-1 +1,..., Nq-1),
Tq = (nq, nq + 1,..., H) (Equation 11)
, The cost of the interval column is the sum of the costs of each interval,
P (nq-1, q-1): T1 = (1, 2, ... n2 -1), ...,
Tq-1 = (nq-1, nq-1 + 1,... Nq-1)
(Equation 12)
It becomes.
[0054]
Using this property, P (g, j) and e (P (g, j)) can be efficiently obtained by the following dynamic programming for an arbitrary division number j.
(1) with respect to 1 ≦ r ≦ s ≦ g becomes all r, s, interval (r, r + 1, ... , s) cost C (r, s) of computing the.
[0055]
(2) e (P (h, 2)) (2 ≦ h ≦ g)
[0056]
[Equation 9]

Asking.
[0057]
[Expression 10]

Remember as.
[0058]
(3) For the number of divisions 3 ≦ q ≦ g, e (P (h, q)) (q ≦ h ≦ g) is
[0059]
## EQU11 ##

Asking.
[0060]
[Expression 12]

Remember as.
[0061]
(4) For the number of divisions 2 ≦ j ≦ g, obtain a division P (g, j).
[0062]
P (h, 2) = (1,..., Th, 2 −1), (th, 2,..., H) (2 ≦ h ≦ g)
... (Equation 17)
P (h, q) = p (th, q−1, q−1), (th, q,..., H)
(3 ≦ q ≦ g, q ≦ h ≦ g)
... (Equation 18)
Since the first sentence number of the last section of the section string to be obtained can be acquired, the section string of the division P (g, j) can be obtained.
[0063]
Step 104) The optimum division acquisition unit 50 optimizes the division that satisfies the condition that the value related to the minimum cost corresponding to each division number and the division number satisfy a certain condition, including the ratio between the minimum cost values corresponding to each division number. Qualify as a split.
[0064]
The minimum cost e (P (g, j)) of the division number j is expressed as Ej. According to the above equation (Equation 10), as j increases, Ej decreases monotonously.
[0065]
As an example of the optimum division certification, when the average of Ej is μ and the standard deviation is σ, a threshold α = μ + zσ (z: for example, 1.5) is taken,
[0066]
[Formula 13]

The division P (g, j) corresponding to the largest j is recognized as the optimal division.
[0067]
Ratio between minimum cost values [0068]
[Expression 14]

Generally has a tendency to monotonously decrease as j increases. When division is substantially equal to the topic division, cost does not change without decline etc.殆be more number of divisions increases, the value of the ratio is close to 1. As another example of the optimum division recognition, a threshold value α (α: for example 1.0005) is taken, and the division P (g, j) corresponding to the maximum j with j satisfying Rj ≧ α is recognized as the optimum division.
[0069]
As another example, when the average of Rj is μ and the standard deviation is σ, a threshold α = μ + zσ (z: 0.8, for example) is taken,
[0070]
[Expression 15]

The division P (g, j) corresponding to the largest j is recognized as the optimal division.
[0071]
The series of processes has been described above. In step 103, the cost C (ni, ni + 1-1) of the section Ti = (ni, ni + 1,..., Ni + 1-1) is The Manhattan distance between the word vector median (vector obtained by taking the median value for each coordinate) and each word vector (the absolute value of the value difference for each coordinate is It is also possible to take the sum of the values of all the coordinates.
[0072]
[Expression 16]

[0073]
[Expression 17]

Further, in the above-described step 103, for any number of divisions, the number of divisions and the cost that is not the minimum but the minimum under certain constraint conditions and the costs thereof are determined by the dynamic programming described above. Can be obtained in a short time. For example, the following methods (1) and (2) may be mentioned.
[0074]
(1) Find the lowest cost division that divides sentence number sequence 1 ≦ h ≦ g into two. Next, the division with the minimum cost for dividing the sentence number sequence 1 ≦ h ≦ g into three is obtained while the division position divided into two is fixed. Thereafter, similarly, the division position obtained so far is fixed, and the division with the lowest cost is obtained by dividing once.
[0075]
(2) Assume that the sentence number string 1 ≦ h ≦ g is g-divided into g clusters. By combining two adjacent clusters into one cluster, a division with the lowest cost is obtained. Similarly, the cluster obtained so far is not divided, but two adjacent clusters are combined into one cluster to obtain a division with the minimum cost.
[0076]
It is possible to construct the above morphological analysis, word vector acquisition, cost calculation, division acquisition, etc. as a program, install the program from a communication line or a storage medium, and implement it by a control means such as a CPU. is there.
[0077]
The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.
[0078]
【The invention's effect】
As described above, according to the present invention, the maximum likelihood division is determined based on the validity of an arbitrary section sequence as a cluster group, and an arbitrary section is considered, and the local text By making the determination while simultaneously viewing the range and the global range, the accuracy of detection of both the small topic section and the large topic section is improved.
[Brief description of the drawings]
FIG. 1 is a principle configuration diagram of the present invention.
FIG. 2 is a configuration diagram of a topic boundary determination device according to an embodiment of the present invention.
FIG. 3 is an example of a concept base in an embodiment of the present invention.
FIG. 4 is a flowchart of topic boundary determination in an embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Morphological analysis means, Morphological analysis part 20 Word vector acquisition means, Word vector acquisition part 30 Concept base 40 Minimum cost division acquisition means, Minimum cost division acquisition part 50 Optimal division acquisition means, Optimal division acquisition part

Claims

A topic boundary determination device that divides a text into topics that are units of semantic units and determines a boundary between the topics,
Morphological analysis means for analyzing the text and dividing it into words,
A concept base which is a storage means in which a vector expressing the meaning of a word is stored;
Word vector acquisition means for acquiring a vector corresponding to each word obtained by the morpheme analysis means by searching the concept base;
In a word vector series, for a section that is a series of word vectors ranging from a certain word or sentence to a certain word or sentence thereafter, between the centroid vector of the word vectors in the section and each word vector seeking costs, including the sum of the squares of the Euclidean distances, for any division number, minimum cost division seeking division taking the minimum value and the outermost small value of the sum of the cost of each section constituting the division of the number of divisions Acquisition means;
A device for determining a topic boundary, comprising:

Minimum value E of the cost sum of each section corresponding to the number of divisions _ｊj Is the minimum value E on the theory of monotonically decreasing with respect to the division number j. _ｊj Minimum value E for _ｊ−１j-1 Optimal division obtaining means for obtaining a ratio of the maximum division number j for which the ratio is equal to or greater than a predetermined value, and certifying a division corresponding to the maximum division number j as an optimum division;
Have more
The topic boundary determination apparatus according to claim 1.

A topic boundary determination program for causing a computer to function as each means constituting the topic boundary determination apparatus according to claim 1 .