JP3790187B2

JP3790187B2 - Text summarization method, apparatus, and text summarization program

Info

Publication number: JP3790187B2
Application number: JP2002147497A
Authority: JP
Inventors: 伸章廣嶋; 隆明長谷川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-05-22
Filing date: 2002-05-22
Publication date: 2006-06-28
Anticipated expiration: 2022-05-22
Also published as: JP2003337821A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストから要約文を作成するテキスト要約方法および装置に関する。
【０００２】
【従来の技術】
テキスト要約技術として従来から行われている代表的な方法は重要文抽出方法である。重要文抽出方法では、テキスト中の各文に対して、特定の重要な表現を含むかどうかなどを調べ、各文に重要度を付与する。その重要度が高い数文を選択し、それを要約とするものである。そのほかの方法としては、テキスト中の各文に対し、文中の不要な表現を削除することによりそれぞれの文を圧縮したものを要約とする文圧縮方法がある。
【０００３】
【発明が解決しようとする課題】
しかしながら、重要文抽出方法では、複数の文に重要な内容が存在するときであっても、その中から数文を選択して要約とするため、内容の一部が網羅されないという問題が生じる。また、文圧縮方法では、不要な語句を削除するという操作に限界があるため、単語数の少ない要約文を作成することができない。
【０００４】
本発明の目的は、テキストに重要な単語が少ないときには文としての読みやすさを優先し、重要な単語が多いときにはテキストの内容を優先するような、バランスを考慮した任意の単語数からなる要約文を作成できるテキスト要約方法、装置、およびプログラムを提供することにある。
【０００５】
【課題を解決するための手段】
本発明のテキスト要約装置は単語分割部と重要度付与部と部分単語列生成部とＮグラム確率付与部と要約文確率算出部と要約文出力部を有する。
【０００６】
単語分割部により、テキストを単語に分割し、各単語の品詞を決定する。重要度付与部により、各単語に当該単語の品詞に基づいて重要度を付与する。部分単語列生成部により、重要度が付与されたテキスト中の単語から任意の部分単語列を生成する。Ｎグラム確率付与部により、Ｎ個の単語が連続して出現する確率であるＮグラム確率を保持するＮグラム確率テーブルを記憶した記憶手段のＮグラム確率テーブルを参照して、前記生成された部分単語列に含まれる全単語に対してＮグラム確率を付与する。要約文確率算出部により、部分単語列の要約文らしさを表す要約文確率を、該部分単語列に含まれる各単語のＮグラム確率と、該部分単語列に含まれる各単語の重要度とから計算する。最後に、要約文出力部により、各部分単語列を構成する単語または文字の数が所定の条件に合致する部分単語列のうち要約文確率が最大となる部分単語列を要約文として選択し、出力する。
【０００７】
テキスト中の複数の文に含まれる重要な内容の単語を、単語間のつながりを考慮しながらつなげ合わせることによって、テキストに重要な単語が少ないときには文としての読みやすさを優先し、重要な単語が多いときにはテキストの内容を優先するような、バランスを考慮した任意の単語数からなる要約文を作成できる。
【０００８】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【０００９】
図１は本発明の一実施形態のテキスト要約装置の構成図である。本実施形態のテキスト要約装置は単語分割部１と重要度付与部２と部分単語列生成部３とＮグラム確率付与部４とＮグラム確率テーブルデータベース５と要約文確率算出部６と要約文出力部７で構成されている。
【００１０】
単語分割部１はテキストをキーボード等から入力し、形態素解析によりテキスト中の各文を単語に切り分け、それぞれの単語に品詞を付与する（参考文献：特開平７−２７１７９２号公報）。重要度付与部２は、品詞付きの単語に対し、単語やその品詞に応じた重要度を決定する。部分単語列生成部３は、重要度が付与されたテキスト中の単語から部分単語列を生成する。Ｎグラム確率付与部４は、部分単語列に対し、その部分単語列に含まれる連続した全単語に対して、任意の単語列が連続して出現する確率（Ｎグラム確率）を保持するＮグラム確率テーブルデータベース５を参照してＮグラム確率を付与する。要約文確率算出部６は、部分単語列の要約文らしさを表す要約文確率を、該部分単語列のＮグラム確率と、該部分単語列に含まれる単語の重要度から算出し、部分単語列に要約文確率を付与する。要約文出力部７は、複数の部分単語列の中から部分単語列を構成する単語または文字の数が所定の条件に合致する部分単語列で、要約文確率が最も高い部分単語列を要約文として選択し、ディスプレイまたはプリンタに出力する。
【００１１】
図２は本実施形態のテキスト要約装置の処理を示すフローチャートである。まず、単語分割部１により、テキストを品詞つきの単語に分割する（ステップ１１）。次に、重要度付与部２により、分割された各単語の品詞が特定の内容を表す語（名詞、動詞、形容動詞のような自立語）、つまり内容語かどうかの判定を行う（ステップ１２）。内容語であれば、ＴＦ・ＩＤＦ法などの既存の手法によって重要度を得、得られた重要度を当該単語に付与し（ステップ１３）、内容語でなければ一定の低い重要度（より具体的に少なくともいかなる内容語の重要度よりも低い値、例えば内容語の平均重要度の１／１０００程度を用いる）を当該単語に付与する（ステップ１４）。ＴＦ・ＩＤＦ法では文書内での単語の出現頻度（文書内頻度，ＴＦ）と文書集合の中でその単語が含まれる文書の数（文書間頻度，ＤＦ）を手がかりにして、ＴＦとＤＦの逆数（ＩＤＦ）との積を重要度として計算する。この計算は予め任意の文書について行ない、自立語単語ごとに重要度と対応づけられた記憶手段に記憶したテーブルを用い、入力文中の自立語単語に対応するものを与える。その場合には複数の文書について計算対象として同様に文書ごとに自立語単語ごとの重要度および出現頻度を計算しておき、記憶手段に各々保持しておく。そこで、特定の文書について他の文書よりも出現頻度の格段に高い単語について当該特定の文書において当該他の文書よりも高くなるように重要度を与えてもよい。また、入力文を構成する文書を分析して当該文書に基づいた重要度を与えてもよい。なお、重要度付与の方法は、ＴＦ・ＩＤＦ法に限定されるものではない。以上の操作をすべての単語について行う（ステップ１５）。次に、部分単語列生成部３により、Ｌ個の単語からなる部分単語列を生成する（ステップ１７）。部分単語列の語順については限定しない。次に、Ｎグラム確率付与部４により、生成された部分単語列に対して要約文確率を求める（ステップ１９）。要約文確率は、部分単語列に含まれる単語の重要度の積と、部分単語列に含まれる連続したＮ単語のＮグラム確率の積を掛け合わせたものとなる。このとき、Ｌ−１番目までの単語までが全く同一の単語で構成され、Ｌ番目の単語だけが異なる２つの部分単語列を考えると、これらの要約文確率を別々に計算するのは非常に効率が悪い。Ｌ−１番目までの単語で構成された部分単語列の要約文確率をあらかじめ求めていれば、Ｌ個の単語からなる部分単語列の要約文確率は、Ｌ−１番目までの要約文確率に、Ｌ番目の単語の重要度と、Ｌ−Ｎ＋１番目からＬ番目までの連続したＮ個の単語のＮグラム確率を掛けることによって求めることができ、計算効率がよい。例えば、Ｎを３として、部分単語列「Ａ／Ｂ／Ｃ／Ｄ／Ｅ」の要約文確率と部分単語列「Ａ／Ｂ／Ｃ／Ｄ／Ｆ」の要約文確率を計算することを考える。これらの確率を個々に計算すると、同じ乗算をする部分がでてきて効率が悪い。そこで、予め部分単語列「Ａ／Ｂ／Ｃ／Ｄ」の要約文確率を求めておく。すると「Ａ／Ｂ／Ｃ／Ｄ／Ｅ」の要約文確率は、「Ａ／Ｂ／Ｃ／Ｄ」の要約文確率と、単語「Ｅ」の重要度と、単語列「Ｃ／Ｄ／Ｅ」のＮグラム確率とを掛け合わせることにより計算できる。「Ａ／Ｂ／Ｃ／Ｄ／Ｆ」の場合も同様にして計算でき、効率がよい。このようにして考えると、Ｌ単語からなる部分単語列の要約文確率を直接求めるのではなく、１単語からなる部分単語列から順に要約文確率を求めていくという方法をとるのがよい。この例では、最終的な要約文の単語数Ｌを５、Ｎグラム確率のＮを３とし、部分単語列を構成する単語数Ｍを１とする（ステップ１６）。Ｍを１としたのは、１単語からなる部分単語列から順に要約文確率を求めていくことによる。Ｍ単語からなる部分単語列を生成し（ステップ１７）、Ｍ−Ｎ＋１番目からＭ番目までの連続したＮ個の単語のＮグラム確率を求める（ステップ１８）。一例として、Ｌ単語からなる文からＮ個の単語を取り出し、取り出された単語で単語列を構成し、その候補として_LＰ_N通りの単語列候補を構成すること、そのうち入力文と語順の一致する_LＣ_N通りを採用してもよい。Ｎグラム確率を求めるステップ１８では、Ｎグラム確率テーブルを記録したデータベース５から、上記で求められた単語列候補に対応するものを読み出す。ここで、−Ｎ＋２番目から０番目までは文頭を表す特殊な単語であるとする。この結果を用いて、要約文確率を計算する（ステップ１９）。Ｍが１の場合は、単語の重要度に、その単語が先頭に出現するＮグラム確率を掛けた値が要約文確率となり、その他の場合は、Ｍ−１番目の単語からなる部分単語列の要約文確率に、Ｍ番目の単語の重要度と、先ほどのＮグラム確率を掛け合わせた値が要約文確率となる。この要約文確率の計算を全ての部分単語列について行う（ステップ２０）。ここで、Ｍの値が指定したＬより小さいかどうかをチェックし（ステップ２１）、小さければ部分単語列を構成する単語数Ｍの値を１増やして（ステップ２２）、要約文確率を求めるという操作を繰り返す。Ｍが指定した長さＬとなった時点で要約文確率の計算は終了し、Ｌ単語からなる部分単語列のうち最も要約文確率が高いものを要約文として出力する（ステップ２３）。
【００１２】
なお、ステップ１８で、１〜Ｎ−１個の単語からなる単語列候補も文頭にＮ−１〜１個に空白が含まれるＮグラムとみなし、Ｎグラム確率テーブルから単語列候補に対応する出現確率を読み出す。一般的に、Ｍ−１個の要約文確率に対し、末尾のＮ−１個の単語とその次の単語からなるＮグラム確率とその次の単語の重要度を掛け合わせてＭ個の要約文確率を求める。
【００１３】
また、上記では文頭から文末へ処理が進む前向きの探索であるが、後向き探索でもよく、それらの組み合わせでもよい。後ろ向き探索の場合には文末の１〜Ｎ−１個の単語からなる単語列候補も文末にＮ−１〜１個に空白が含まれるＮグラムとみなし、テーブルから単語列候補に対応する出現確率を読み出す。
【００１４】
本実施形態では単語数による制限を課しているが、文字数制限を課すことも考えられる。その場合は、一例としては以下のように要約文を求めることができる。特に言及しない限り上記の単語数による制限と共通な手段を用いるので説明を割愛する。
【００１５】
ここで、ステップ１７の後で要約候補文、つまり文頭からの部分単語列までの文字数が予め指定された文字数を越えていないかどうかを確認し、文字数を越えていない候補に対して要約文確率を計算してその部分単語列を含む要約候補文を要約文確率と対応させて記憶手段に記憶しておき、単語数を１増やしてステップ１７に戻るという操作を候補がなくなるまで繰り返す。その後、保存していた要約候補文の中から、１単語あたりの要約文確率、つまり、要約文確率の単語数によるべき乗根である幾何平均値が最も高いものを要約文として選択する。幾何平均をとるのは、要約候補文中の単語数が多くなると各単語または単語列に対応する確率は一般に１未満であることにより要約文確率が低くなる。そのため、単純に要約文確率同士を比較することができなくなることによる。
【００１６】
以下では、具体例を挙げて本実施形態を説明する。生成する要約文の長さＬを５、Ｎグラム確率のＮを３とし、要約の対象とするテキストは図３で示したものとして話を進める。まず、図３のテキストを品詞つきの単語に分割する。表１に分割された単語とその品詞を示す。
【００１７】
【表１】

例に挙げたテキストは９個の単語に分割された。これらの単語に対し、その単語が特定の品詞を持つ内容語であれば、ＴＦ・ＩＤＦ法などのキーワード抽出手法によって重要度を付与する。内容語でなければ、一定の低い値（ここでは０．０１とした）を付与する。この例では、品詞が名詞、動詞、形容詞、形容動詞であるものを内容語とした。重要度を付与した結果を表１に示す。次に、テキスト中の９個の単語のうち５個の単語からなる部分単語列を生成して要約文確率を求める。ここで簡単のため、生成する要約文の語順はもとのテキストの語順と同じであるとする。このように限定しても、なお図４に示したとおり、９個の単語から５個の単語を選ぶ方法は１２６通りもあり、もとのテキストの長さが大きくなると計算量が膨大となる。そこで、Ｍ単語からなる部分単語列の要約文確率を求めるステップで、Ｍ番目の各単語について最も要約文確率の高い部分単語列だけを候補として残すことにより候補を絞り込むという方法をとる。１単語からなる部分単語列から順に要約文確率を求める。１単語の場合の要約文確率は、単語の重要度に、その単語が文頭に出現するＮグラム確率を掛けた値となる。表２にＮグラム確率テーブルを示す。Ｎグラム確率を求める方法については、“北研二著「確率的言語モデル」東京大学出版会”などに詳しく記載されている。このテーブルに存在しないものについてはそのＮグラム確率を０．０１とする。
【００１８】
【表２】

１単語からなる部分単語列の要約文確率をそれぞれ求めると、「昨夜」の要約文確率は、「昨夜」の重要度０．１２に、「（文頭）／（文頭）／昨夜」のＮグラム確率０．２０を掛けた０．０２４となる。同様にして、「彼女」「は」「路上」「で」の要約文確率はそれぞれ０．０３６、０．０００１、０．００５、０．０００１となる。ここまでの結果を図５に示す。括弧の中の値は部分単語列における要約文確率を表す。次に、２単語からなる部分単語列の要約文確率を求める。２番目の各単語について最も要約文確率の高い部分単語列だけを候補として残す。２番目の単語が「彼女」の場合は、図４より「昨夜／彼女」しか部分単語列は考えられないので、この部分単語列を候補として残す。要約文確率は、「昨夜」の要約文確率０．０２４に「彼女」の重要度０．２４と「（文頭）／昨夜／彼女」のＮグラム確率０．１５を掛けた８．７×１０^-4となる。２番目の単語が「は」の場合は、図４より「昨夜／は」の場合と「彼女／は」の場合の２通りが考えられるので、この２つの部分単語列の要約文確率が高いほうを候補として残す。「昨夜／は」の要約文確率は３．６×１０^-5、「彼女／は」は要約文確率は１．６×１０^-4となり、「彼女／は」のほうが要約文確率が高くなるのでこれを候補として残す。２番目の単語が「路上」の場合は、「昨夜／路上」「彼女／路上」「は／路上」の３通りがあり、その中で要約文確率が最も高い「彼女／路上」を候補として残す。残りも同様にする。ここまでの結果を図６に示す。次に、３単語からなる部分単語列の要約文確率を求める。３番目の単語が「は」の場合は、「昨夜／彼女／は」しか部分単語列は考えられないので、この部分単語列を候補として残す。３番目の単語が「路上」の場合は、図４と図６より、「昨夜／彼女／路上」の場合と「彼女／は／路上」の場合の２通りが考えられるので、この２つの部分単語列の要約文確率が高いほうを候補として残す。３番目の単語が「で」の場合は、「昨夜／彼女／で」「彼女／は／で」「彼女／路上／で」の３通りがあり、その中で要約文確率が最も高い「昨夜／彼女／で」を候補として残す。残りも同様にする。ここまでの結果を図７に示す。４単語および５単語からなる部分単語列の要約文確率も同様にして求めると、図８のような結果が得られる。最終的に５個の部分単語列が得られるが、その中で要約文確率の最も高い文は「彼女／は／犬／に／噛まれた」であるので、この部分単語列を要約文として出力する。得られた要約文を図９に示す。
【００１９】
なお、テキスト要約装置の処理は専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フロッピーディスク、光磁気ディスク、ＣＤ−ＲＯＭ等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。
【００２０】
【発明の効果】
以上説明したように本発明によれば、単語の重要度でもとのテキストの内容を網羅すると同時に、単語列のＮグラム確率で要約文のつながりを維持しながら、テキスト中の単語をつなぎ合わせることにより、テキストの内容と文としての読みやすさをバランスよく考慮した要約文を生成することができる。また、生成する要約文の長さは任意に設定することができ、特に、従来手法では実現が難しい数単語からなる見出しのような要約を生成することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態のテキスト要約装置のブロック図である。
【図２】図１のテキスト要約装置の全体の処理の流れを示すフローチャートである。
【図３】要約の対象とするテキストの例を示す図である。
【図４】部分単語列の候補を示す図である。
【図５】単語数が１の場合の部分単語列の候補とその要約文確率を示す図である。
【図６】単語数が２の場合の部分単語列の候補とその要約文確率を示す図である。
【図７】単語数が３の場合の部分単語列の候補とその要約文確率を示す図である。
【図８】単語数が５の場合の部分単語列の候補とその要約文確率を示す図である。
【図９】得られた要約文を示す図である。
【符号の説明】
１単語分割部
２重要度付与部
３部分単語列生成部
４Ｎグラム確率付与部
５Ｎグラム確率テーブルデータベース
６要約文確率算出部
７要約文出力部
１１〜２３ステップ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a text summarization method and apparatus for creating a summary sentence from text.
[0002]
[Prior art]
A typical method conventionally used as a text summarization technique is an important sentence extraction method. In the important sentence extraction method, each sentence in the text is checked whether or not it contains a specific important expression, and the importance is given to each sentence. A few sentences with high importance are selected and summarized. As another method, there is a sentence compression method that summarizes each sentence in the text by compressing each sentence by deleting unnecessary expressions in the sentence.
[0003]
[Problems to be solved by the invention]
However, in the important sentence extraction method, even when important contents exist in a plurality of sentences, a few sentences are selected from the sentences and summarized, so that a part of the contents is not covered. Also, in the sentence compression method, there is a limit to the operation of deleting unnecessary words and phrases, so that a summary sentence with a small number of words cannot be created.
[0004]
SUMMARY OF THE INVENTION An object of the present invention is an abstract composed of an arbitrary number of words in consideration of balance such that priority is given to readability as a sentence when there are few important words in the text, and priority is given to the contents of the text when there are many important words. The object is to provide a text summarization method, apparatus, and program capable of creating a sentence.
[0005]
[Means for Solving the Problems]
The text summarizing apparatus of the present invention includes a word dividing unit, an importance level assigning unit, a partial word string generating unit, an N-gram probability giving unit, a summary sentence probability calculating unit, and a summary sentence output unit.
[0006]
The word dividing unit divides the text into words and determines the part of speech of each word. The importance level assigning section gives importance level to each word based on the part of speech of the word. The partial word string generation unit generates an arbitrary partial word string from the words in the text to which importance is given. The N-gram probability assigning unit refers to the N-gram probability table of the storage means that stores the N-gram probability table that holds the N-gram probability, which is the probability that N words appear consecutively, and generates the portion N-gram probabilities are assigned to all words included in the word string. The summary probability calculation unit, a summary probability representing a summary likelihood of partial word string, the N-gram probability of the words contained in the partial word string, and a degree of importance of each word included in the partial word string calculate. Finally, the summary sentence output unit selects, as a summary sentence, a partial word string having the maximum summary sentence probability from among partial word strings in which the number of words or characters constituting each partial word string meets a predetermined condition, Output.
[0007]
By connecting words with important contents in multiple sentences in the text while considering the connection between words, priority is given to readability as sentences when there are few important words in the text. When there are many words, it is possible to create a summary sentence consisting of an arbitrary number of words in consideration of balance, giving priority to the contents of the text.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0009]
FIG. 1 is a block diagram of a text summarizing apparatus according to an embodiment of the present invention. The text summarizing apparatus of this embodiment includes a word dividing unit 1, an importance level assigning unit 2, a partial word string generating unit 3, an N-gram probability giving unit 4, an N-gram probability table database 5, a summary sentence probability calculating unit 6, and a summary sentence output. It consists of part 7.
[0010]
The word division unit 1 inputs text from a keyboard or the like, cuts each sentence in the text into words by morphological analysis, and gives parts of speech to each word (reference document: Japanese Patent Laid-Open No. 7-271792). The importance level assigning unit 2 determines the importance level according to the word and the part of speech for the word with the part of speech. The partial word string generation unit 3 generates a partial word string from words in the text to which importance is given. The N-gram probability assigning unit 4 holds, for a partial word string, an N-gram that holds a probability (N-gram probability) that an arbitrary word string continuously appears for all consecutive words included in the partial word string. An N-gram probability is given with reference to the probability table database 5. The summary sentence probability calculation unit 6 calculates a summary sentence probability representing the likelihood of the summary sentence of the partial word string from the N-gram probability of the partial word string and the importance of the words included in the partial word string. Is given a summary sentence probability. The summary sentence output unit 7 selects a partial word string that has the highest summary sentence probability and is a partial word string that has a predetermined number of words or characters that constitute the partial word string from a plurality of partial word strings. And output to a display or printer.
[0011]
FIG. 2 is a flowchart showing the processing of the text summarization apparatus of this embodiment. First, the word dividing unit 1 divides the text into words with parts of speech (step 11). Next, the importance level assigning unit 2 determines whether the part of speech of each divided word is a word representing a specific content (an independent word such as a noun, a verb, or an adjective verb), that is, a content word (step 12). ). If it is a content word, the importance is obtained by an existing method such as the TF / IDF method, and the obtained importance is assigned to the word (step 13). In particular, a value lower than the importance of any content word, for example, about 1/1000 of the average importance of the content word is used (step 14). In the TF / IDF method, the frequency of occurrence of a word in a document (intra-document frequency, TF) and the number of documents that contain the word in the document set (inter-document frequency, DF) are used as clues. The product with the reciprocal number (IDF) is calculated as the importance. This calculation is performed in advance for an arbitrary document, and a table corresponding to the independent word in the input sentence is given using a table stored in the storage means associated with the importance for each independent word. In that case, the importance and the appearance frequency for each independent word are similarly calculated for each document as a calculation target for a plurality of documents, and stored in the storage means. Therefore, the degree of importance of a specific document may be given to a word having a remarkably higher frequency of appearance than the other document so that the specific document is higher than the other document. Moreover, the document which comprises an input sentence may be analyzed and the importance based on the said document may be given. Note that the method of assigning importance is not limited to the TF / IDF method. The above operation is performed for all words (step 15). Next, the partial word string generation unit 3 generates a partial word string composed of L words (step 17). The order of partial word strings is not limited. Next, the N-gram probability assigning unit 4 obtains a summary sentence probability for the generated partial word string (step 19). The summary sentence probability is obtained by multiplying the product of the importance levels of the words included in the partial word string by the product of the N-gram probabilities of consecutive N words included in the partial word string. At this time, considering two partial word strings that are composed of exactly the same word up to the (L-1) th word and differ only in the Lth word, it is very difficult to calculate these summary sentence probabilities separately. ineffective. If the summary sentence probabilities of the partial word strings composed of the L-1th words are obtained in advance, the summary sentence probabilities of the partial word strings composed of L words are the same as the L-1 summary sentence probabilities. , By multiplying the importance of the L-th word by the N-gram probabilities of the N consecutive words from L−N + 1 to L-th, the calculation efficiency is good. For example, suppose that N is 3 and the summary sentence probability of the partial word string “A / B / C / D / E” and the summary sentence probability of the partial word string “A / B / C / D / F” are calculated. . If these probabilities are calculated individually, the same multiplication part appears, which is inefficient. Therefore, the summary sentence probability of the partial word string “A / B / C / D” is obtained in advance. Then, the summary sentence probability of “A / B / C / D / E” includes the summary sentence probability of “A / B / C / D”, the importance of the word “E”, and the word string “C / D / E”. ”And the N-gram probability. In the case of “A / B / C / D / F”, the calculation can be performed in the same manner and the efficiency is high. In this way, it is preferable to take a method in which summary sentence probabilities are obtained in order from a partial word string consisting of one word, instead of directly obtaining a summary sentence probability of a partial word string consisting of L words. In this example, the number L of words in the final summary sentence is 5, the N-gram probability N is 3, and the number M of words constituting the partial word string is 1 (step 16). The reason why M is set to 1 is that the summary sentence probabilities are obtained in order from the partial word string consisting of one word. A partial word string composed of M words is generated (step 17), and N-gram probabilities of N consecutive words from the (M−N + 1) th to the Mth are obtained (step 18). As an example, N words are extracted from a sentence composed of L words, a word string is formed from the extracted words, and _L P _N word string candidates are formed as candidates, of which the input sentence matches the word order the _L C _N Street that may be adopted. In step 18 for obtaining the N-gram probability, the data corresponding to the word string candidate obtained above is read from the database 5 in which the N-gram probability table is recorded. Here, it is assumed that −N + 2nd to 0th are special words representing the beginning of a sentence. Using this result, a summary sentence probability is calculated (step 19). When M is 1, a value obtained by multiplying the importance of a word by the N-gram probability that the word appears at the head is the summary sentence probability, and in other cases, the value of the partial word string including the M−1th word A value obtained by multiplying the summary sentence probability by the importance of the Mth word and the N-gram probability is the summary sentence probability. This summary sentence probability is calculated for all partial word strings (step 20). Here, it is checked whether or not the value of M is smaller than the specified L (step 21), and if it is smaller, the value of the number of words M constituting the partial word string is increased by 1 (step 22), and the summary sentence probability is obtained. Repeat the operation. When M becomes the designated length L, the summary sentence probability calculation ends, and the partial word string composed of L words having the highest summary sentence probability is output as a summary sentence (step 23).
[0012]
In step 18, a word string candidate consisting of 1 to N-1 words is also regarded as an N-gram including N-1 to 1 blank at the beginning of the sentence, and appears corresponding to the word string candidate from the N-gram probability table. Read the probability. Generally, M summary sentences are obtained by multiplying the M-1 summary sentence probabilities by the N-gram probability of the last N-1 words and the next word and the importance of the next word. Find the probability.
[0013]
Further, in the above description, the forward search proceeds from the beginning of the sentence to the end of the sentence, but a backward search or a combination thereof may be used. In the case of backward search, a word string candidate consisting of 1 to N-1 words at the end of a sentence is also regarded as an N-gram containing N-1 to 1 at the end of the sentence, and the appearance probability corresponding to the word string candidate from the table. Is read.
[0014]
Although this embodiment imposes a limit based on the number of words, it may be possible to impose a limit on the number of characters. In that case, as an example, a summary sentence can be obtained as follows. Unless otherwise stated, explanations will be omitted because the same means as those described above are used.
[0015]
Here, after step 17, it is confirmed whether or not the number of characters from the summary candidate sentence, that is, the partial word string from the beginning of the sentence exceeds the number of characters designated in advance, and the summary sentence probability for the candidate that does not exceed the number of characters. The summary candidate sentence including the partial word string is stored in the storage means in correspondence with the summary sentence probability, and the operation of increasing the number of words by 1 and returning to step 17 is repeated until there are no candidates. After that, from the stored summary candidate sentences, the summary sentence probability per word, that is, the one having the highest geometric mean value that is a power root according to the number of words of the summary sentence probability is selected as the summary sentence. The geometric average is obtained when the number of words in the summary candidate sentence increases, the probability corresponding to each word or word string is generally less than 1, and thus the summary sentence probability decreases. Therefore, the summary sentence probabilities cannot be simply compared.
[0016]
Hereinafter, the present embodiment will be described with a specific example. The length of the summary sentence to be generated is 5, the N-gram probability N is 3, and the text to be summarized is assumed to have been shown in FIG. First, the text of FIG. 3 is divided into words with parts of speech. Table 1 shows the divided words and their parts of speech.
[0017]
[Table 1]

The example text was split into 9 words. If these words are content words having a specific part of speech, importance is given to these words by a keyword extraction method such as the TF / IDF method. If it is not a content word, a certain low value (here, 0.01) is assigned. In this example, the part of speech is a noun, verb, adjective, or adjective verb. The result of assigning importance is shown in Table 1. Next, a partial word string composed of 5 words out of 9 words in the text is generated to obtain a summary sentence probability. Here, for simplicity, it is assumed that the word order of the generated summary sentence is the same as the word order of the original text. Even with this limitation, as shown in FIG. 4, there are 126 ways to select 5 words from 9 words, and the amount of calculation becomes enormous as the length of the original text increases. . Therefore, in the step of obtaining the summary sentence probability of the partial word string composed of M words, a method is adopted in which candidates are narrowed down by leaving only the partial word string having the highest summary sentence probability as the candidate for each Mth word. Summary sentence probabilities are obtained in order from a partial word string consisting of one word. The summary sentence probability in the case of one word is a value obtained by multiplying the importance of the word by the N-gram probability that the word appears at the beginning of the sentence. Table 2 shows the N-gram probability table. The method for obtaining the N-gram probability is described in detail in “Kitakenji“ Probabilistic Language Model ”The University of Tokyo Press” etc. For those that do not exist in this table, the N-gram probability is set to 0.01. .
[0018]
[Table 2]

When the summary sentence probabilities of the partial word strings consisting of one word are obtained, the summary sentence probability of “last night” is N (grams) of “(start of sentence) / (start of sentence) / last night” with an importance of “last night” of 0.12. The result is 0.024 multiplied by the probability of 0.20. Similarly, the summary sentence probabilities of “she”, “ha”, “on the road”, and “de” are 0.036, 0.0001, 0.005, and 0.0001, respectively. The results so far are shown in FIG. The value in parentheses represents the summary sentence probability in the partial word string. Next, a summary sentence probability of a partial word string composed of two words is obtained. Only the partial word string having the highest summary sentence probability is left as a candidate for each second word. If the second word is “she”, the partial word string can be considered only from “last night / she” from FIG. 4, so this partial word string is left as a candidate. The summary sentence probability is 8.7 × 10, which is obtained by multiplying the summary sentence probability 0.024 of “last night” by the importance 0.24 of “she” and the N-gram probability 0.15 of “(sentence) / last night / her”. ^-4 . When the second word is “ha”, there are two possible cases, “last night / ha” and “she / ha”, from FIG. 4, so the summary sentence probability of these two partial word strings is high. Is left as a candidate. The summary sentence probability of “last night / ha” is 3.6 × 10 ⁻⁵ , the summary sentence probability of “she / ^ha ” is 1.6 × 10 ⁻⁴ , and the summary sentence probability of “she / ^ha ” is higher. So leave this as a candidate. When the second word is “street”, there are three ways of “last night / street”, “she / street”, and “ha / street”, among which “she / street” with the highest summary sentence probability is a candidate. leave. Do the same for the rest. The results so far are shown in FIG. Next, the summary sentence probability of the partial word string composed of three words is obtained. If the third word is “ha”, the partial word string can only be considered “last night / her / ha”, so this partial word string is left as a candidate. When the third word is “street”, from FIG. 4 and FIG. 6, there are two possible cases: “last night / her / street” and “she / ha / street”. A word string with a higher summary sentence probability is left as a candidate. If the third word is “de”, there are three ways: “last night / her / de”, “she / ha / de”, and “her / street / de”, among which “summer night” has the highest summary sentence probability. / She / de "is left as a candidate. Do the same for the rest. The results so far are shown in FIG. When the summary sentence probabilities of the partial word string composed of 4 words and 5 words are obtained in the same manner, the result shown in FIG. 8 is obtained. Finally, 5 partial word strings are obtained. Among them, the sentence with the highest summary sentence probability is “she / ha / dog / bited / bitten”, so this partial word string is used as a summary sentence. Output. The obtained summary sentence is shown in FIG.
[0019]
The processing of the text summarization device is not realized by dedicated hardware, but a program for realizing the function is recorded on a computer-readable recording medium, and the program recorded on the recording medium is recorded. It may be read by a computer system and executed. The computer-readable recording medium refers to a recording medium such as a floppy disk, a magneto-optical disk, a CD-ROM, or a storage device such as a hard disk device built in the computer system. Furthermore, a computer-readable recording medium is a server that dynamically holds a program (transmission medium or transmission wave) for a short period of time, as in the case of transmitting a program via the Internet, and a server in that case. Some of them hold programs for a certain period of time, such as volatile memory inside computer systems.
[0020]
【The invention's effect】
As described above, according to the present invention, the contents of the original text are covered with the importance of the word, and at the same time, the words in the text are connected while maintaining the connection of the summary sentence with the N-gram probability of the word string. Thus, it is possible to generate a summary sentence in consideration of the content of the text and the readability as a sentence in a well-balanced manner. In addition, the length of the summary sentence to be generated can be arbitrarily set. In particular, it is possible to generate a summary such as a headline composed of several words that is difficult to realize by the conventional method.
[Brief description of the drawings]
FIG. 1 is a block diagram of a text summarization apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart showing an overall processing flow of the text summarizing apparatus of FIG. 1;
FIG. 3 is a diagram illustrating an example of text to be summarized.
FIG. 4 is a diagram showing candidates for partial word strings.
FIG. 5 is a diagram showing partial word string candidates and their summary sentence probabilities when the number of words is one;
FIG. 6 is a diagram showing partial word string candidates and their summary sentence probabilities when the number of words is two.
FIG. 7 is a diagram showing partial word string candidates and their summary sentence probabilities when the number of words is three.
FIG. 8 is a diagram showing partial word string candidates and their summary sentence probabilities when the number of words is five.
FIG. 9 is a diagram showing an obtained summary sentence.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Word division part 2 Importance assignment part 3 Partial word sequence generation part 4 N-gram probability assignment part 5 N-gram probability table database 6 Summary sentence probability calculation part 7 Summary sentence output part 11-23

Claims

A text summarization device using a computer, which is composed of word dividing means, importance level giving means, partial word string generating means, N-gram probability giving means, summary sentence probability calculating means, and summary sentence output means, is inputted. A text summarization method for creating a summary sentence from
A word dividing step in which the word dividing means divides the text into words and determines a part of speech of each word;
An importance level assigning step in which the importance level assigning unit assigns an importance level to each word based on the part of speech of the word;
A partial word string generation step in which the partial word string generation means generates an arbitrary partial word string from words in the text given importance;
The N-gram probability giving means refers to the N-gram probability table stored in the storage means for storing the N-gram probability table that holds the N-gram probability, which is the probability that any N words appear successively. An N-gram probability assigning step of assigning an N-gram probability to all the words included in the partial word sequence,
The summary sentence probability calculating means determines the summary sentence probability representing the summary sentence likelihood of the partial word string , the N-gram probability of each word included in the partial word string, and the importance of each word included in the partial word string a summary probability calculating step of calculating from the,
The summary sentence output means selects, as a summary sentence, a partial word string having the maximum summary sentence probability from the partial word strings in which the number of words or characters constituting the partial word string matches a predetermined condition, and outputs A text summarization method comprising: a summary sentence output step.

A text summarization device that creates a summary sentence from input text using a computer ,
Word dividing means for dividing the text into words and determining the part of speech of each word;
Importance assigning means for assigning importance to each word based on the part of speech of the word;
A partial word string generating means for generating an arbitrary partial word string from words in the text to which importance is given;
Referring to the N-gram probability table of the storage means that stores the N-gram probability table that holds the N-gram probability, which is the probability that any N words appear consecutively, they are included in the generated partial word string N-gram probability giving means for giving N-gram probability to all words;
Said portion words summary probability representing a summary likeness column and N-gram probability of the words contained in the partial word string, summary probability calculation means for calculating from the importance of the words contained in the partial word string When,
Select portions word sequence summary probability is maximum out of the partial word string number of words or characters constituting the partial word string matches a predetermined condition as a summary, the summary output unit for outputting Has text summarization device.

An input for causing a computer to function as a text summarizing device comprising word dividing means, importance level giving means, partial word string generating means, N-gram probability giving means, summary sentence probability calculating means, and summary sentence output means. A text summarization program for creating a summary sentence from
A word dividing procedure in which the word dividing means divides the text into words and determines the part of speech of each word;
The importance level assigning means for giving importance level to each word based on the part of speech of the word;
The partial word string generation means for generating an arbitrary partial word string from words in the text to which importance is given;
The N-gram probability giving means refers to the N-gram probability table stored in the storage means for storing the N-gram probability table that holds the N-gram probability, which is the probability that any N words appear successively. An N-gram probability assigning procedure for assigning an N-gram probability to all the words included in the partial word sequence,
The summary sentence probability calculating means determines the summary sentence probability representing the summary sentence likelihood of the partial word string , the N-gram probability of each word included in the partial word string, and the importance of the word included in the partial word string, Summary sentence probability calculation procedure to calculate from
The summary sentence output means selects, as a summary sentence, a partial word string having the maximum summary sentence probability from the partial word strings in which the number of words or characters constituting the partial word string matches a predetermined condition, and outputs text summary program for executing the summary selection procedure in computer.