JP2004334382A

JP2004334382A - Structured document summarizing apparatus, program, and recording medium

Info

Publication number: JP2004334382A
Application number: JP2003126832A
Authority: JP
Inventors: Atsuyuki Goto; 淳之後藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-05-02
Filing date: 2003-05-02
Publication date: 2004-11-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a structured document summarizing apparatus for enabling a user to efficiently grasp the summary of a document by providing a keyword or a summary sentence to the user from a structured document. <P>SOLUTION: This structured document summarizing apparatus divides various document structures into structured information and text information, and converts them into a unified format, and extracts a keyword and a significant sentence by an already existing technology from the divided text information, and re-evaluates the keyword and the significant sentence by referring to the divided structured information. Then, the structured document summarizing apparatus combines re-evaluated keyword and significant sentence to generate the summary of the structured document. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、構造化文書要約装置、プログラムおよび記録媒体に関し、特に、構造化文書からキーワードや要約文を提供することによって、ユーザが効率的に文書の概要を掴むことができる技術に関する。
【０００２】
【従来の技術】
ハードウェアの飛躍的進歩によって、コンピュータ上に蓄積される文書は飛躍的に増大している。特に、文書の全文をコンピュータ上のデータとして蓄積することが容易になり、膨大な量のテキストデータが蓄積され始めている。
特に、近年コンピュータネットワークが発達し、ＷＷＷ技術を用いて個人がホームページを通して情報発信を行うことができるようになり、誰もが情報提供者になれるようになった。
このように、膨大な量の文書データの中から、自分の望む内容の文書を素早く探し出す文書検索の問題が大きくクローズアップされてきている。
【０００３】
しかしながら、検索結果の中のどの文書が自分の望む内容の文書であるかを確かめるには、文書の全文を読まなければ分からないが、１文書の量が大きい場合や１文書の量はさほどではないが検索された文書数が多量の場合には、確めるのが困難である。
このような場合には、文書の要約文を作成するか、または文書の中の重要文を抽出して確めることになる。
【０００４】
従来、要約文を作成するときには、文書中に出現する単語の統計情報（出現頻度等）から重要語を抽出し、また、検索式中の単語も重要語と考えて、これらの重要語を含む文を抽出・合成して作成していた。
しかし、このような方法では、検索に使用した単語が要約対象の文書中の全文至るところに含まれていれば、全文が要約文として残ってしまうため、要約文を作成することができない。
【０００５】
また、特許文献１の「文書要約方法および装置と文書要約プログラムおよび該プログラムを記録した記録媒体」では、入力された文書を形態素解析し、要約種別に応じて要約の手がかりとして必要な単語集合を文書から抽出するとともに、文書を複数の意味的なまとまりに分割し、各意味的なまとまりについて単語集合に含まれる単語の出現密度の高い重要部分を算出し、この重要部分から要約率に応じて文を抽出する。
これにより、単語の出現密度を考慮した重要性に基づき精度の高い要約を要約種別に応じて生成することができる。
【０００６】
【特許文献１】
特開２００２−２５９３７１号公報
【特許文献２】
特開平１１−１８４８６５号公報
【特許文献３】
特開２００１−５２０３２号公報
【０００７】
【発明が解決しようとする課題】
一方、近年作成される文書は、ワープロ文書にしろ、ウェッブ上で公開される文書にしろ構造化された文書となっている。
例えば、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）で作成された文書は、次のようなタグによって文書構造を明示的に表している。
見出しを＜Ｈ１＞〜＜Ｈ６＞により指定、
段落を＜Ｐ＞により指定、
列挙文（データ羅列）を＜ＵＬ＞，＜ＬＩ＞により指定、
表を＜ＴＡＢＬＥ＞，＜ＴＲ＞により指定、
図を＜ＩＭＧ＞，＜ＯＢＪＥＣＴ＞により指定、図の説明を＜ＡＬＴ＞で指定。
【０００８】
また、Ｍｉｃｒｏｓｏｆｔ社のＷｏｒｄのようなワープロソフトで作成されたＲｉｃｈＴｅｘｔ形式の文書は、次のようなコントロールコードによって文書構造を明示的に表している。
見出しは、￥ｐａｒと￥ｌｅｖｅｌＮとフォント属性，
あるいは￥ｓｕｂｄｏｃｕｍｅｎｔにより指定、
表は、￥ｐａｒ，￥ｒｏｗ，￥ｃｅｌｌより指定、
図は、￥ｐｉｃｔにより指定。
【０００９】
一般に、要約文を作成する上で必要となる重要語や重要文は、文書の特定の場所に存在することがあるので、上記のような文書構造情報を使用することにより効果的に抽出することができる。
しかし、従来の要約文作成、例えば、上述の特許文献１の技術では、文書をプレーンテキストとして扱い、文章解析と統計情報を使って作成するだけであり、文書の作成者の意図が現れる文書構造や文字修飾（強調文字、下線、文字の大きさ、字体等）については考慮していない。
【００１０】
これに対して、特許文献２の「文書要約装置」では、ＨＴＭＬ文書からタグを削除したプレーンテキストに対して、従来の技術により重要語を抽出し、この重要語がどのタグ中に出現したかにより重要語の重要度の変化量を計算する。この変化量の計算対象となるタグとしては、タイトルや文字修飾関係のものである。次に、文に含まれる単語の重要度の総和を求めて、文の重要度とし、この文の重要度の高い文を集めて要約文としている。
【００１１】
また、特許文献３の「要約文作成方法及び装置及び要約文作成プログラムを格納した記憶媒体」では、入力された入力テキストのタイトル、本文、表を含む文章やユーザからの入力情報の構造を解析し、出現位置と単語の係り受け関係で単語の重要度を定義した重要単語定義テーブルを用いて単語の出現場所、重要度、重要語、該重要語の属性を抽出し、抽出した重要語に対し、出現場所と重要度で重み付けをして要約で用いる要約語を選択し、選択された要約語の属性に応じて生成する文のテンプレートを選択し、選択されたテンプレートに要約語を埋め込んで要約文を生成している。
これにより、有用な言葉のみで要約文を作成することができる。
【００１２】
しかし、これらの特許文献２および特許文献３の技術では、要約文を作成するときに、文書の構造化情報を使用してはいるが、作成された要約文は、原文書とは別に表示されるものであり、要約文に使われた要約語がどのような文脈で出現しているかを確かめることができない。
また、これらの技術が要約文作成の対象としている文書は、電子メールやＨＴＭＬ文書に限られているので、他の構造をもった構造化文書に対しては、その構造を処理できるようにプログラムを作り直さなければならない。
【００１３】
本発明は、上述の実情を考慮してなされたものであって、要約文を作成するときに文書の構造化情報を有効に使用し、精度の高い要約文を作成できるとともに、作成された要約文が原文書のどのような文脈中に出現しているかを確かめられる構造化文書要約装置、プログラムおよび記録媒体を提供することを目的とする。
さらに、本発明は、異なる構造を持った構造化文書を文書構造に依存しない統一した形式に変換するようにして、要約文作成を統一的に処理できる構造化文書要約装置、プログラムおよび記録媒体を提供することを目的とする。
【００１４】
【課題を解決するための手段】
上記の課題を解決するために、本発明の請求項１は、構造化文書を入力する入力部と、この入力された構造化文書を解析して、キーワード抽出および重要文抽出に適した、構造化情報とテキスト情報に分離する構造化文書パーサーと、前記分離されたテキスト情報のみを対象にキーワードと重要文の抽出を行うキーワード・重要文抽出部と、前記キーワードと重要文とから要約文を生成する要約出力部とを有することを特徴とする。
【００１５】
また、本発明の請求項２は、請求項１に記載の構造化文書要約装置において、前記構造化文書パーサーは、文書構造の異なる種類ごとに用意し、前記入力された構造化文書の種類に応じて適切な構造化文書パーサーが選択されるようにしたことを特徴とする。
【００１６】
また、本発明の請求項３は、請求項１または２に記載の構造化文書要約装置において、前記構造化情報を参照して前記テキスト情報からキーワードや重要文の抽出には無意味なテキストを削除するフィルタリング部を有することを特徴とする。
【００１７】
また、本発明の請求項４は、請求項１、２または３に記載の構造化文書要約装置において、前記キーワード・重要文抽出部は、前記抽出されたキーワードと重要文に対し、前記構造化情報を参照して再評価したキーワードと重要文を選定するようにしたことを特徴とする。
【００１８】
また、本発明の請求項５は、請求項１乃至４のいずれかに記載の構造化文書要約装置において、前記要約出力部は、前記抽出された重要文と、この重要文が属する構造の見出しとを要約文として出力するようにしたことを特徴とする。
【００１９】
また、本発明の請求項６は、請求項５に記載の構造化文書要約装置において、前記要約出力部は、前記入力した構造化文書の中で前記重要文に該当する文字列を他と区別して出力するようにしたことを特徴とする。
【００２０】
また、本発明の請求項７は、請求項５に記載の構造化文書要約装置において、前記見出しが存在しないときには、前記抽出された重要文との共起率が高いキーワードを選定することを特徴とする。
【００２１】
また、本発明の請求項８は、コンピュータに、請求項１乃至７のいずれかに記載の構造化文書要約装置の機能を実行させるためのプログラムである。
また、本発明の請求項９は、請求項８に記載のプログラムを記録したコンピュータ読み取り可能な記録媒体である。
【００２２】
したがって、構造化文書の構造化情報を有効に使用して、精度の高い要約文を作成できる。
また、作成された要約文を原文書と一緒に表示するようにしたので、要約文が原文書のどのような文脈中に出現しているかを確かめることができる。
【００２３】
また、異なる構造を持った構造化文書を文書構造化情報とテキスト情報（プレーンテキスト）に分離して文書構造に依存しない統一した形式に変換するようにしたので、要約文を統一的に処理できるようになった。
この分離した文書構造化情報をもとに、分離されたテキスト情報からキーワードや重要文の抽出には無意味なテキストを削除することによって、キーワードや重要文の抽出精度が高くなり、より精度の高い要約文を作成できる。
【００２４】
また、構造化文書からテキスト情報を分離してあるので、キーワードや重要文の抽出には分離したテキスト情報（プレーンテキスト）を対象とする既存のキーワード抽出技術や要約文抽出技術を使うことができ、ここで求められたキーワードや重要文に対して文書構造化情報をもとに再評価して抽出精度をより高くすることができる。
【００２５】
【発明の実施の形態】
以下、図面を参照して、本発明の構造化文書要約装置の好適な実施形態について説明する。
図１は、本発明の実施形態に係る構造化文書要約装置の機能構成を示すブロック図であり、同図において、構造化文書要約装置は、構造化文書パーサー１０、フィルタリング処理部２０、言語処理部３０、キーワード・重要文抽出部４０、キーワード・重要文再評価部５０、要約出力部６０、構造化情報記憶部７０を少なくとも含んでいる。
【００２６】
（Ａ）構造化文書パーサーと構造化情報記憶部
構造化文書パーサー１０は、入力される文書構造の種類ごと（ＸＭＬ文書、ＨＴＭＬ文書、ワープロ文書、プレーンテキスト文書等）に用意し、入力された文書構造の種類により自動的に切り替わる。
文書構造の種類に対応した構造化文書パーサー１０は、入力された文書の構造化情報を解析して、文書構造、文字修飾情報およびこれらの情報を削除した抽出テキスト（プレーンテキスト）とに分け、文書の種類に依存しない統一したデータ構造をもつ構造化情報記憶部７０を生成する。
また、構造化情報記憶部７０は、フィルタリング処理部２０とキーワード・重要文再評価部５０から参照される。
以下、文書構造に関する情報および文字列の修飾情報とを含めて構造化情報という。
【００２７】
以下の説明では、入力する文書構造の種類をＨＴＭＬとしているが、これに限らず文書構造を持った文書であって、構造化情報記憶部７０のデータ構造が生成できるものであれば本発明を適用することができる。
プレーンテキストの場合は、１つの段落をもつＨＴＭＬ文書と考えれば、本発明を適用することができる。
【００２８】
構造化情報記憶部７０は、図２のようなＢｌｏｃｋＩｎｆｏデータ７１、構造化情報リスト７４、ＴｅｘｔＩｎｆｏデータ７３、文字情報リスト７５、抽出テキスト７２とからなるデータ構造を持っている。
抽出テキスト７２は、入力された文書から構造化情報を削除した、プレーンテキストからなり、メモリ空間上にリニアに格納される。また、この抽出テキスト７２の文字列は、このまま言語処理部３０、キーワード・重要文抽出部４０に渡される。
【００２９】
ＢｌｏｃｋＩｎｆｏデータ７１は、構造化タグ情報のｉｎｄｅｘ、Ｔｙｐｅへのポインタ、テキストｏｆｆｓｅｔ、テキスト長の項目で、１つの文・段落（以下、ブロックという）を管理する。
構造化タグ情報のｉｎｄｅｘは、構造化タグの文字列が登録されているタグテーブルへのインデックスである。例えば、ＨＴＭＬの場合、＜ｂｏｄｙ＞、＜ｐ＞、＜ｌｉ＞、＜ｈｒ＞、＜ｔｄ＞、＜ｈ１＞のようなタグの文字列をタグテーブルとして登録しておき、ＢｌｏｃｋＩｎｆｏデータ７１には構造化文書中に出現したタグの文字列の代わりに、タグテーブルへのインデックスを登録する。
テキストｏｆｆｓｅｔは、１つのブロックの始まりが抽出テキスト７２の何文字目から始まるかを示し、テキスト長はこのブロックの文字列の長さ（文字数）を表す。
ブロックには、例えば、ＨＴＭＬではＢｌｏｃｋＬｅｖｅｌＥｌｅｍｅｎｔに相当する見出し、本文、段落、列挙型、引用、表などの区別があり、この情報をブロック属性といい、図３のようにコード化している。
Ｔｙｐｅへのポインタは、このブロックに対するブロック属性を保持する構造化情報リスト７４へのポインタを表している。
この構造化情報リスト７４は、ブロックの中にブロックを含む、所謂ブロックのネストが可能なため、このネスト関係を線形リストで表している。
この構造化情報リスト７４への登録順序は、リストの最初が自ブロックのブロック属性、リストの２番目以降はネストの内側から外へ向って親ブロックのブロック属性を保持している。
また、ＢｌｏｃｋＩｎｆｏデータ７１に登録されるテキストｏｆｆｓｅｔの値が同じ場合には、ブロックのネストとして処理するため、外側のブロックは登録せずに、より内側のブロックのみを登録する。
したがって、構造化情報リスト７４のブロック属性値をたどることにより、あるブロックがどのようなブロックとネストしているかが分かる。
【００３０】
例えば、図４を用いてブロックに関する構造化情報の登録について説明する。図４において、文書の最外郭のブロックには、見出し１、段落１−１、段落１−２、段落１−３があり、この段落１−２はさらに見出し２、段落２−１、段落２−２、段落２−３を含んでいる。
また、このような構造化情報を削除したプレーンテキストを格納した抽出テキスト７２への各ブロックのテキストｏｆｆｓｅｔ値をそれぞれ０、オフセット１、オフセット２、オフセット３、オフセット４、オフセット５、オフセット６とする。
【００３１】
この図４の構造は、図５に示すようなＢｌｏｃｋＩｎｆｏデータ７１として構造化情報記憶部７０へ登録される。
段落１−２のテキストオフセットが見出し２のテキストオフセットと一致しており、見出し２の方が段落１−２よりネスト構造が深いので、段落１−２はＢｌｏｃｋＩｎｆｏデータ７１には登録されない。
図５にはＴｙｐｅへのポインタがすべて書き込まれていないが、見出し１や段落２−１に対応する構造化情報リストと同じようになる。
図５の構造化情報リストをみると、段落２−１の親ブロックは、「見出し属性」を持つことから、段落２−１は見出しブロックの中に含まれていることがわかる。
【００３２】
ＴｅｘｔＩｎｆｏデータ７３は、タグ情報のｉｎｄｅｘ、Ｔｙｐｅへのポインタ、テキストｏｆｆｓｅｔ、テキスト長の項目で、１つの文字列レベルの情報を管理する。
タグ情報のｉｎｄｅｘは、文字列を修飾するタグやパラメータの文字列が登録されているタグテーブル（このタグテーブルは、構造化タグのタグテーブルと一緒にしてもよい）へのインデックスである。例えば、ＨＴＭＬの場合、＜ｉ＞、＜ｂ＞、ｆｏｎｔ、ｓｉｚｅのような文字列をタグテーブルとして登録しておき、ＴｅｘｔＩｎｆｏデータ７３には構造化文書中に出現したタグの文字列の代わりに、タグテーブルへのインデックスを登録する。
テキストｏｆｆｓｅｔは、修飾された文字列の始まりが抽出テキスト７２の何文字目であるかを示し、テキスト長はこの文字列の長さ（文字数）を表す。
【００３３】
文字列の修飾には、例えば、文字サイズ（極小、小、中、大、極大）、強調、斜体、文字飾り（下線、取消し線）などの区別があり、これらはビットポジションにおいて重複のないようにコード化されている。
【００３４】
Ｔｙｐｅへのポインタは、文字列に対する文字修飾の属性値を保持する文字情報リスト７５へのポインタを表している。
この文字情報リスト７５は、文字修飾のネストが可能なため、このネスト関係を線形リストで表している。
この文字情報リスト７５への登録順序は、リストの最初が最内郭の文字修飾属性、リストの２番目以降は内郭から最外郭へ向かってネストされた文字修飾属性を保持している。
また、ＴｅｘｔＩｎｆｏデータ７３に登録されるテキストｏｆｆｓｅｔの値が同じ場合には、文字修飾のネストとして処理するため、外側の文字修飾属性は登録せずに、より内側の文字修飾属性を登録する。
【００３５】
したがって、文字情報リスト７５の文字修飾属性値をたどることにより、ある文字列がどのように重複した文字修飾属性をもっているかが分かる。
また、この１つの文字列に対してネストした文字修飾属性値の論理和を作成することによって、この文字列の文字修飾属性がビット操作で判断できる。
【００３６】
例えば、下記のようなＨＴＭＬ文書の場合は、図６に示したようなＴｅｘｔＩｎｆｏデータ７３と文字情報リスト７５が生成される。図６で構造化タグ情報のインデックスが空白なのは、構造化タグがない場合である。
【００３７】
＜ｂｏｄｙ＞
＜Ｉ＞Ｈｅ＜／Ｉ＞ｍｅｔ＜Ｂ＞ａｐｒｅｔｔｙ＜／Ｂ＞ｇｉｒｌ．
＜／ｂｏｄｙ＞
【００３８】
また、ＴｅｘｔＩｎｆｏデータ７３に登録されるテキストｏｆｆｓｅｔの値が同じ場合、例えば、下記のようなＨＴＭＬ文書の場合には、図７に示したようなＴｅｘｔＩｎｆｏデータ７３と文字情報リスト７５が生成される。このように、外側の文字修飾属性＜Ｂ＞は登録せずに、より内側の文字修飾属性＜Ｉ＞を登録する。
【００３９】
＜Ｂ＞＜Ｉ＞Ａｂｉｇｄｏｇ＜／Ｉ＞＜Ｂ＞ｒｕｎａｆｔｅｒｍｅ．
【００４０】
このようにして構造化情報記憶部７０に登録されたＢｌｏｃｋＩｎｆｏデータ７１および構造化情報リスト７４は、フィルタリング処理部２０、キーワード・重要文再評価部５０で使用される。
例えば、キーワード・重要文再評価部５０の場合には、見出し文の重要度を上げたり、単なるデータの羅列の場合は、重要度を下げるという操作を行うときにこの構造化情報を利用する。
また、ＴｅｘｔＩｎｆｏデータ７３および文字情報リスト７５は、キーワードの再評価時に使用される。
入力された文書構造の種類がプレーンテキストの場合には、上述のＢｌｏｃｋＩｎｆｏデータ７１およびＴｅｘｔＩｎｆｏデータ７３は作成されず、抽出テキスト７２だけが作成される。
【００４１】
上述のように、種々の構造化された文書がもつ文書構造化情報を文書構造の種類に依存しない統一的な形式に変換しておくことによって、要約文の処理が文書構造に依存せずに処理できる。
また、将来、新たな仕様の構造化文書が規定したとしても、その文書構造に対する構造化文書パーサーを用意するだけで、構造化文書の要約文を作成することができる。
【００４２】
（Ｂ）テキストのフィルタリング
Ｗｅｂ上でいろいろなＷｅｂページを閲覧すると、不要と思われる情報が本文の周りに散乱している。
構造化文書パーサー１０は、構造化文書を解析するときに、タグが持つ制御情報の大部分を削除してしまっている。しかし、ＨＥＡＤ、ＴＩＴＬＥ、リンク情報、スクリプト言語によるプログラムなどはこの時点でも残っている。
【００４３】
したがって、このように不要なものを削除するために、構造化文書の種類や文書内容ごとに不要情報テーブルを設けておき、フィルタリング処理部２０でこの不要情報テーブルを参照して、キーワードや重要文の抽出に不要と思われるテキストを除去して、構造化情報記憶部７０を更新する。
但し、リンク情報等が重要な意味を持つ場合があるので、必ずしも削除してよいとは限らない。これらの情報を削除するか否かは、構造化文書をとりまくｓｅｍａｎｔｉｃ情報に基づき決定し、不要情報テーブルを作成する。
【００４４】
フィルタリング処理部２０における不要な情報の判定は、不要情報テーブルに登録されているブロック属性と同じものが構造化情報記憶部７０のＢｌｏｃｋＩｎｆｏデータ７１および構造化情報リスト７４中にあるかを調べることによって行う。
不要情報テーブルに登録されているブロック属性と同じものが見つかったときには、ＢｌｏｃｋＩｎｆｏデータ７１からそのエントリを削除し、そのブロックに対応する抽出テキストを削除して、構造化情報記憶部７０を更新する。このとき、削除対象となったブロック中にあったＴｅｘｔＩｎｆｏデータ７３のエントリも削除する。
【００４５】
上述したように、構造化文書パーサー１０では削除できなかった、キーワードや重要文の抽出に不要なテキストを削除することができるので、キーワードや重要文の抽出精度がより高くなる。
【００４６】
（Ｃ）言語処理
言語処理部３０およびキーワード・重要文抽出部４０は、プレーンテキストを対象とした解析であるから、既存の技術で実現する。
言語処理部３０は、既存の形態素解析、句合成、構文解析技術により実現するものであって、構造化情報記憶部７０の抽出テキストに対して、形態素に分割し、個々の形態素に品詞を割り当て、形態素同志の接続規則に基づいて形態素を合成し、句レベル（名詞句、動詞句、副詞句、接続詞句．．．等）にまとめ上げる。この結果に対して、句文法に基づき、句レベルでの構文解析を行い、句間の修飾関係、主語、目的語を決定する。
【００４７】
（Ｄ）キーワード・重要文抽出
キーワード・重要文抽出部４０は、既存の技術、例えば、特開平９−３４９０５号公報に記載の方法によって、言語処理部３０で求められた句から名詞句を取り出し、その名詞句の出現頻度に基づく重みと名詞句に対する修飾度に基づく重みをもとに点数付けを行ってキーワードを抽出する。
次に、文内のキーワード間の重複度に基づき文間の関連度を評価し、他の文群との関連度の強さと関連の有無に基づいて文の重要度を求めて重要文を抽出する。
【００４８】
（Ｅ）キーワード・重要文の再評価
キーワード・重要文再評価部５０は、キーワード・重要文抽出部４０で求めた重要文に含まれるキーワードの重要度をこのキーワードの文字列にどのような文字修飾を施しているかを判断して再評価する。
このキーワードの文字列に文字修飾が施されているかは、構造化情報記憶部７０のＴｅｘｔＩｎｆｏデータのテキストｏｆｆｓｅｔとテキスト長を参照することによって判断できる。
次に、その文字列がどのような文字修飾属性を持っているのかは、文字情報リストのリンクにある属性値を辿ることによって判定することができる。
【００４９】
キーワードの重要度の再評価は、例えば、文字修飾の種類によって評価点を次のように変更する。
（１）強調文字ならば、評価点数を１０％上げる。
（２）文字サイズの大小により、評価点数を５％上げ下げする。
（３）文字修飾の種類に応じて、評価点数を５％上げ下げする。
【００５０】
次に、キーワード・重要文再評価部５０は、再評価したキーワードの重要度によって、重要文の重要度を再計算するとともに、この重要文がどのようなブロックに属しているかによって、重要文の重要度を再評価する。
この重要文がどのブロックに属しているかは、構造化情報記憶部７０のＢｌｏｃｋＩｎｆｏデータのテキストｏｆｆｓｅｔとテキスト長を参照することによって判断できる。次に、そのブロックがどのようなブロック属性を持っているのかは、構造化情報リストのリンクにあるブロック属性値を辿ることによって判定することができる。
【００５１】
再評価したキーワードによる重要文の重要度を再計算した後、さらに、重要文の重要度の再評価点を、例えば、ブロック属性の種類によって次のように変更する。
（１）見出しならば、評価点数を２０％上げる。
（２）列挙文ならば、評価点数は２０％下げる。
（３）図、表内の文ならば、評価点数は２５％下げる。
【００５２】
上述したように、プレーンテキストに対して、既存の技術で求められたキーワードや重要文に対して、文書構造化情報を利用して再評価することができるので、キーワードや重要文の抽出精度をより高くすることができる。
【００５３】
（Ｆ）要約文の生成と出力
要約出力部６０は、再評価された重要文の重要度の高いものから重要文を、文書サイズと抽出割合（例えば、文書サイズに対する要約文サイズの割合を指定しておく）と抽出単位（例えば、文字数、単語数または文数等）に応じて決定された要約文のサイズ以内に限定して選択し、その選択した重要文とこの重要文の属するブロックの見出しブロックとを組として文書の要約文とする。
この要約文を出力するときには、表示されたもとの構造化文書の中で要約文に選択した重要文および見出しに該当する文字列を他と区別して、例えば、色を変えて出力する（図１０参照）。
これにより、要約文がもとの文書のどのような文脈中に出現しているかを確かめることができる。
【００５４】
また、要約文をもとの構造化文書の表示とは別に出力するときには、（見出し、重要文）を１つの組として、先に述べた要約文のサイズに応じて、複数組を生成し、文書の要約文とする。
このとき、見出し（ブロック属性が見出しのもの）が存在しない場合、キーワードを組み合わせて見出しの代わりとする。見出しの代わりになるキーワードは、組になる重要文との共起率が高いものをキーワード・重要文抽出部４０で求めたキーワード集合から選択する。
重要文と共起率が高いキーワードとは、重要文に含まれるキーワード集合の中で、重要文中での出現頻度が高いものである。
【００５５】
これにより、抽出したキーワードや重要文を別個に扱うのではなく、組み合わせて要約文とすることで、より原文書の内容を的確に表現できるようになる。
【００５６】
（Ｇ）構造化文書要約装置の処理手順
次に、本発明の構造化文書要約装置をＷｅｂブラウザに組み込んだ場合（例えば、プラグインとして構造化文書要約装置を組み込む）の要約文処理手順を図８のフローチャートを用いて説明する。
先ず、ユーザはＷｅｂブラウザを使って構造化文書を読み込ませる（ステップＳ１）。
例えば、Ｗｅｂブラウザで読み込んだ構造化文書は、図９のように表示される。同図において、「重要文ボタン」は、表示された構造化文書から重要文を抽出して、要約文として表示させるのに使う。また、「キーワードボタン」は、表示された構造化文書から抽出した重要なキーワードを表示させるのに使う。
【００５７】
次に、ユーザが要約文を表示させるために「重要文ボタン」をクリックすると、読み込まれた構造化文書を解析するために、文書構造の種類に応じた構造化文書パーサー１０が起動して、読み込まれた構造化文書を構造化情報とプレーンテキストに分離し、この分離結果を図２に示したようなデータ構造に格納して構造化情報記憶部７０へ記憶する（ステップＳ２）。
【００５８】
例えば、読み込まれた文書の文書構造を判断して、文書がＨＴＭＬ文書であれば、ＨＴＭＬパーサー、文書がＳＧＭＬ文書であればＳＧＭＬパーサー、文書がワープロ文書であればワープロ文書パーサーがそれぞれ自動的に起動される。
図９に表示された文書の場合には、見出しの文字列がボールドで本文のフォントよりサイズが大きいことから、最も単純な構造化テキストの例になっている。この構造化情報記憶部７０に格納されたデータは、文書構造の種類に関係なく、統一されたデータ構造に出力される。
【００５９】
構造化情報記憶部７０に格納されたデータのうち、抽出されたプレーンテキストに対してフィルタリングする（ステップＳ３）。
構造化情報記憶部７０に格納されたデータの抽出されたプレーンテキストには、構造化文書に埋め込まれた制御情報がすでに取り除かれているので、ここでのフィルタリングとは、プレーンテキストから構造化情報（図２のＢｌｏｃｋＩｎｆｏデータ）に基づき、キーワードや重要文の抽出に意味のないテキストを削除することを意味している。
例えば、このフィルタリングは、ＨＴＭＬ文書であれば、リンク情報等を削除することである。
【００６０】
次に、フィルタリング処理されたプレーンテキストを入力して、形態素解析、句合成、構文解析等の基本的言語処理を施す（ステップＳ４）。
この合成された句（主に名詞句）を対象にして、特開平９−３４９０５号公報に基づいた方法でキーワードおよびそのキーワードに基づいた重要文を抽出する（ステップＳ５）。
【００６１】
得られたキーワードについて、キーワードの文字列に施された文字修飾に基づいて、キーワードの重要度を再評価し、この再評価されたキーワードに基づいて重要文の重要度を再計算するとともに、重要文が属するブロック属性（構造化情報）に基づき重要文の再評価を行う（ステップＳ６）。
【００６２】
再評価された重要文の重要度の高いものから重要文を、文書サイズと抽出割合（例えば、文書サイズに対する要約文サイズの割合を指定しておく）と抽出単位（例えば、文字数、単語数または文数等）に応じて決定された要約文のサイズ以内に限定して選択し、その選択した重要文および関連する見出しを組として文書の要約文とし、読み込んだ構造化文書の中で要約文に該当する文字列を他と区別して、例えば、色を変えてＷｅｂブラウザ上に表示する（ステップＳ７）
図９に表示された構造化文書の場合には、図１０に示すように要約文の選択された重要文をハイライティングして表示している。
【００６３】
また、要約文をもとの構造化文書の表示とは別のウインドウに表示するときには、（見出し、重要文）を１つの組として、要約文のサイズに応じて、複数組を生成し、文書の要約文として表示する。
このとき、見出し（ブロック属性が見出しのもの）が存在しない場合、キーワードを組み合わせて見出しの代わりとする。見出しの代わりになるキーワードは、組になる重要文に含まれるキーワード集合の中で、この重要文中での出現頻度が高いキーワードを選択する。
【００６４】
（Ｈ）プログラムおよび記録媒体
本発明は、上述した実施形態のみに限定されたものではない。上述した実施形態の構造化文書要約装置を構成する各機能をそれぞれプログラム化し、予めＣＤ−ＲＯＭ等の記録媒体に書き込んでおき、コンピュータに搭載したＣＤ−ＲＯＭドライブのような媒体駆動装置にこのＣＤ−ＲＯＭ等を装着して、これらのプログラムをコンピュータのメモリあるいは記憶装置に格納し、それを実行することによって、本発明の目的が達成されることは言うまでもない。
この場合、記録媒体から読み出されたプログラム自体が上述した実施形態の機能を実現することになり、そのプログラムおよびそのプログラムを記録した記録媒体も本発明を構成することになる。
【００６５】
なお、プログラムを格納する記録媒体としては半導体媒体（例えば、ＲＯＭ、不揮発性メモリ等）、光媒体（例えば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ等）、磁気媒体（例えば、磁気テープ、フレキシブルディスク等）等のいずれであってもよい。
【００６６】
また、ロードしたプログラムを実行することにより上述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステムあるいは他のアプリケーションプログラム等と共同して処理することによって上述した実施形態の機能が実現される場合も含まれる。
【００６７】
また、市場にプログラムを流通させる場合には、プログラムを格納した可搬型の記録媒体を流通させたり、インターネット等の通信網を介して接続されたサーバコンピュータの記憶装置にプログラムを格納しておき、通信網を通じて他のコンピュータにそのプログラムを転送することによっても流通させることができる。この場合、このサーバコンピュータの記憶装置も本発明の記録媒体に含まれる。
【００６８】
この場合、コンピュータでは、可搬型の記録媒体上のプログラム、または転送されてくるプログラムを、コンピュータに接続した記憶装置にインストールし、そのインストールされたプログラムを実行することによって上述した実施形態の機能が実現される。
【００６９】
尚、本発明は上述した実施形態に限定されることなく、本発明の要旨を逸脱しない範囲内で各種の変形、修正が可能であるのは勿論である。
【００７０】
【発明の効果】
以上説明したように本発明によると、構造化文書の構造化情報を有効に使用して、精度の高い要約文を作成できる。
また、作成された要約文を原文書と一緒に表示するようにしたので、要約文が原文書のどのような文脈中に出現しているかを確かめることができる。
【００７１】
また、異なる構造を持った構造化文書を文書構造化情報とテキスト情報（プレーンテキスト）に分離して文書構造に依存しない統一した形式に変換するようにしたので、要約文を統一的に処理できるようになった。
この分離した文書構造化情報をもとに、さらに分離されたテキスト情報からキーワードや重要文の抽出には無意味なテキストを削除することによって、キーワードや重要文の抽出精度が高くなり、より精度の高い要約文を作成できる。
【００７２】
また、構造化文書からテキスト情報を分離してあるので、キーワードや重要文の抽出には分離したテキスト情報（プレーンテキスト）を対象とする既存のキーワード抽出技術や要約文抽出技術を使うことができ、ここで求められたキーワードや重要文に対して文書構造化情報をもとに再評価して抽出精度をより高くすることができる。
【図面の簡単な説明】
【図１】本発明の実施形態に係る構造化文書要約装置の機能構成を示すブロック図である。
【図２】構造化情報記憶部のデータ構造を説明するための図である。
【図３】ブロック属性値の例である。
【図４】ネスト構造を持ったブロックを含む構造化文書の例である。
【図５】図４の構造化文書のブロック構造を図２のデータ構造に格納した場合の例である。
【図６】文字修飾を施された文字列に対する文字列修飾属性を図２のデータ構造に格納した場合の例である。
【図７】文字修飾を施された文字列に対する文字列修飾属性を図２のデータ構造に格納した場合の例である。
【図８】本発明の構造化文書要約装置をＷｅｂブラウザに組み込んだ場合の要約文処理手順を示すフローチャートである。
【図９】構造化文書をＷｅｂブラウザで表示した例である。
【図１０】図９にあげた文書に対して、要約文を抽出し、もとの構造化文書一緒に要約文も表示した例である。
【符号の説明】
１０…構造化文書パーサー、２０…フィルタリング処理部、３０…言語処理部、４０…キーワード・重要文抽出部、５０…キーワード・重要文再評価部、６０…要約出力部、７０…構造化情報記憶部、７１…ＢｌｏｃｋＩｎｆｏデータ、７２…抽出テキスト、７３…ＴｅｘｔＩｎｆｏデータ、７４…構造化情報リスト、７５…文字情報リスト。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a structured document summarizing apparatus, a program, and a recording medium, and more particularly, to a technique by which a user can efficiently grasp an outline of a document by providing a keyword or an abstract from a structured document.
[0002]
[Prior art]
With the breakthrough in hardware, the amount of documents stored on computers has increased exponentially. In particular, it has become easier to store the entire text of a document as data on a computer, and an enormous amount of text data has begun to be stored.
Particularly, in recent years, computer networks have been developed, and individuals can transmit information through a homepage using WWW technology, and anyone can become an information provider.
As described above, the problem of document search for quickly searching for a document having desired contents from a vast amount of document data has been greatly highlighted.
[0003]
However, in order to confirm which document in the search result is the document of the desired content, it is not possible to know unless the entire text of the document is read, but when the size of one document is large or the amount of one document is not so large. However, if the number of retrieved documents is large, it is difficult to confirm.
In such a case, a summary sentence of the document is created, or important sentences in the document are extracted and confirmed.
[0004]
Conventionally, when creating a summary sentence, important words are extracted from statistical information (frequency of appearance, etc.) of words appearing in a document, and words in a search formula are also considered as important words, and these important words are included. It was created by extracting and synthesizing sentences.
However, according to such a method, if the word used for the search is included throughout the whole sentence in the document to be summarized, the whole sentence remains as a summary sentence, and thus a summary sentence cannot be created.
[0005]
Patent Document 1 discloses a document summarizing method and apparatus, a document summarizing program, and a recording medium on which the program is recorded. The input document is morphologically analyzed, and a set of words necessary as a clue for summarizing is determined according to the summarization type. In addition to extracting from the document, the document is divided into a plurality of semantic units, and for each semantic unit, the important part where the frequency of the words included in the word set is high is calculated. Extract sentences.
Thus, a highly accurate summary can be generated according to the summary type based on importance in consideration of the word appearance density.
[0006]
[Patent Document 1]
JP-A-2002-259371
[Patent Document 2]
JP-A-11-184865
[Patent Document 3]
JP 2001-52032 A
[0007]
[Problems to be solved by the invention]
On the other hand, recently created documents are structured documents regardless of whether they are word-processor documents or documents published on the Web.
For example, a document created by HTML (Hyper Text Markup Language) explicitly indicates the document structure by the following tags.
Heading is specified by <H1> to <H6>,
Specify paragraph by <P>,
Enumeration statement (data list) is specified by <UL>, <LI>,
Table specified by <TABLE>, <TR>,
The figure is specified by <IMG> and <OBJECT>, and the description of the figure is specified by <ALT>.
[0008]
A RichText format document created by word processing software such as Microsoft Word has a document structure explicitly expressed by the following control code.
Headings are $ par and $ levelN and font attributes,
Or specified by $ subdocument,
The table is specified from $ par, $ row, $ cell,
The figure is specified by @pict.
[0009]
In general, important words and important sentences necessary for creating a summary sentence may exist at a specific place in the document. Therefore, it is necessary to extract them effectively using the document structure information as described above. Can be.
However, in the conventional summary sentence creation, for example, in the technique of Patent Document 1 described above, a document is treated as plain text, and is simply created using sentence analysis and statistical information. And character modification (highlighted characters, underlining, character size, font, etc.) are not considered.
[0010]
On the other hand, the "document summarizing apparatus" of Patent Document 2 extracts an important word from plain text in which tags are deleted from an HTML document by a conventional technique, and determines in which tag the important word appears. To calculate the amount of change in the importance of the important word. Tags for which the amount of change is calculated include titles and character modification. Next, the sum of the degrees of importance of the words included in the sentence is obtained, and the sum is determined as the importance of the sentence. The sentences with high importance of the sentence are collected as a summary sentence.
[0011]
Patent Document 3 discloses a method and an apparatus for creating a summary sentence and a storage medium storing a program for creating a summary sentence, which analyze the structure of a sentence including a title, a body, and a table of input input text and information input from a user. Then, using an important word definition table in which the importance of the word is defined based on the appearance position and the dependency relationship of the word, the appearance location, importance, important word, and attribute of the important word are extracted, and the extracted important word is extracted. On the other hand, weighted by the place of appearance and importance, select a summary word to be used in the summary, select a template of the sentence to be generated according to the attribute of the selected summary word, and embed the summary word in the selected template Generates summary sentences.
This makes it possible to create a summary sentence using only useful words.
[0012]
However, in the techniques of Patent Literature 2 and Patent Literature 3, when the abstract is created, the structured information of the document is used, but the created abstract is displayed separately from the original document. It is impossible to confirm in what context the abstract word used in the abstract appears.
In addition, since the documents for which these technologies create summary texts are limited to e-mail and HTML documents, a program for processing structured documents having other structures can be processed. Must be recreated.
[0013]
SUMMARY OF THE INVENTION The present invention has been made in view of the above-described circumstances, and it is possible to effectively use structured information of a document when creating a summary sentence, create a high-accuracy summary sentence, and generate a created summary. An object of the present invention is to provide a structured document summarizing apparatus, a program, and a recording medium that can confirm in which context a sentence appears in an original document.
Further, the present invention provides a structured document summarizing apparatus, a program, and a recording medium that can convert a structured document having a different structure into a unified format that does not depend on the document structure, and that can uniformly process a summary sentence. The purpose is to provide.
[0014]
[Means for Solving the Problems]
In order to solve the above-mentioned problems, a first aspect of the present invention is an input unit for inputting a structured document, and a structure suitable for keyword extraction and important sentence extraction by analyzing the input structured document. A structured document parser that separates the sentence into text information and text information, a keyword / important sentence extraction unit that extracts keywords and important sentences only from the separated text information, and a summary sentence from the keywords and important sentences. And a summary output unit for generating.
[0015]
According to a second aspect of the present invention, there is provided the structured document summarizing apparatus according to the first aspect, wherein the structured document parser is prepared for each of different types of the document structure, and the structured document parser is adapted to the type of the input structured document. An appropriate structured document parser is selected accordingly.
[0016]
According to a third aspect of the present invention, there is provided the structured document summarizing apparatus according to the first or second aspect, wherein a meaningless text is extracted from the text information by extracting a keyword or an important sentence from the text information by referring to the structured information. It has a filtering unit for deleting.
[0017]
According to a fourth aspect of the present invention, in the structured document summarizing apparatus according to the first, second, or third aspect, the keyword / important sentence extracting section performs the structured It is characterized in that keywords and important sentences re-evaluated with reference to information are selected.
[0018]
According to a fifth aspect of the present invention, in the structured document summarizing apparatus according to any one of the first to fourth aspects, the summary output unit includes a header for identifying the extracted important sentence and a structure to which the important sentence belongs. Is output as a summary sentence.
[0019]
According to a sixth aspect of the present invention, in the structured document summarizing apparatus according to the fifth aspect, the summary output unit distinguishes a character string corresponding to the important sentence in the input structured document from the others. It is characterized by outputting separately.
[0020]
According to a seventh aspect of the present invention, in the structured document summarizing apparatus according to the fifth aspect, when the headline does not exist, a keyword having a high co-occurrence rate with the extracted important sentence is selected. And
[0021]
An eighth aspect of the present invention is a program for causing a computer to execute the function of the structured document summarizing apparatus according to any one of the first to seventh aspects.
According to a ninth aspect of the present invention, there is provided a computer-readable recording medium storing the program according to the eighth aspect.
[0022]
Therefore, a highly accurate summary sentence can be created by effectively using the structured information of the structured document.
In addition, since the created abstract is displayed together with the original document, it is possible to confirm in what context of the original document the abstract appears.
[0023]
In addition, since structured documents having different structures are separated into document structured information and text information (plain text) and converted into a unified format independent of the document structure, the summary sentence can be processed uniformly. It became so.
Based on the separated document structured information, by deleting text that is meaningless for extracting keywords and important sentences from the separated text information, the extraction accuracy of keywords and important sentences is increased, and more accurate Can produce high summary sentences.
[0024]
In addition, since text information is separated from structured documents, keywords and important sentences can be extracted using existing keyword extraction technology and abstract sentence extraction technology that targets the separated text information (plain text). The extraction accuracy can be further improved by re-evaluating the keyword or important sentence obtained here based on the document structuring information.
[0025]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a preferred embodiment of a structured document summarizing apparatus according to the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of a structured document summarizing apparatus according to an embodiment of the present invention. In FIG. 1, a structured document summarizing apparatus includes a structured document parser 10, a filtering processing unit 20, a language processing It includes at least a unit 30, a keyword / important sentence extraction unit 40, a keyword / important sentence reevaluation unit 50, a summary output unit 60, and a structured information storage unit 70.
[0026]
(A) Structured document parser and structured information storage
The structured document parser 10 prepares for each type of input document structure (XML document, HTML document, word processing document, plain text document, etc.), and switches automatically according to the type of input document structure.
The structured document parser 10 corresponding to the type of the document structure analyzes the structured information of the input document and divides it into a document structure, character modification information, and an extracted text (plain text) in which the information is deleted, A structured information storage unit having a unified data structure independent of the type of document is generated.
The structured information storage unit 70 is referred to by the filtering processing unit 20 and the keyword / important sentence re-evaluation unit 50.
Hereinafter, it is referred to as structured information including information on the document structure and character string modification information.
[0027]
In the following description, the type of the input document structure is assumed to be HTML. However, the present invention is not limited to this. Can be applied.
In the case of plain text, the present invention can be applied to an HTML document having one paragraph.
[0028]
The structured information storage unit 70 has a data structure including BlockInfo data 71, structured information list 74, TextInfo data 73, character information list 75, and extracted text 72 as shown in FIG.
The extracted text 72 is composed of plain text in which structured information has been deleted from the input document, and is stored linearly in the memory space. The character string of the extracted text 72 is passed to the language processing unit 30 and the keyword / important sentence extraction unit 40 as they are.
[0029]
The BlockInfo data 71 manages one sentence / paragraph (hereinafter, referred to as a block) with items of an index of structured tag information, a pointer to Type, a text offset, and a text length.
The index of the structured tag information is an index into a tag table in which a character string of the structured tag is registered. For example, in the case of HTML, character strings of tags such as <body>, <p>, <li>, <hr>, <td>, and <h1> are registered as a tag table, and the BlockInfo data 71 Register an index into the tag table instead of the character string of the tag that appears in the structured document.
The text offset indicates the start of one block from which character of the extracted text 72, and the text length indicates the length (number of characters) of the character string of this block.
Blocks include, for example, headings, texts, paragraphs, enumerations, citations, tables, and the like corresponding to Block Level Element in HTML. This information is called a block attribute and is coded as shown in FIG.
The pointer to Type represents a pointer to the structured information list 74 holding the block attribute for this block.
In the structured information list 74, so-called nesting of blocks including blocks in blocks is possible, and this nesting relationship is represented by a linear list.
The order of registration in the structured information list 74 is such that the first block in the list has the block attribute of the own block, and the second and subsequent lists have the block attribute of the parent block from inside to outside of the nest.
If the value of the text offset registered in the BlockInfo data 71 is the same, the block is nested. Therefore, only the inner block is registered without registering the outer block.
Therefore, by tracing the block attribute values in the structured information list 74, it is possible to know what block is nested with which block.
[0030]
For example, registration of structured information on a block will be described with reference to FIG. In FIG. 4, the outermost block of the document includes a heading 1, a paragraph 1-1, a paragraph 1-2, and a paragraph 1-3, and the paragraph 1-2 further includes a heading 2, a paragraph 2-1 and a paragraph 2 -2, paragraphs 2-3 are included.
Also, the text offset value of each block to the extracted text 72 storing the plain text from which such structured information has been deleted is set to 0, offset 1, offset 2, offset 3, offset 4, offset 5, offset 6, respectively. .
[0031]
4 is registered in the structured information storage unit 70 as BlockInfo data 71 as shown in FIG.
Since the text offset of paragraph 1-2 matches the text offset of heading 2, and heading 2 has a deeper nest structure than paragraph 1-2, paragraph 1-2 is not registered in BlockInfo data 71.
In FIG. 5, all the pointers to Type are not written, but the structure is the same as the structured information list corresponding to the heading 1 and the paragraph 2-1.
Looking at the structured information list in FIG. 5, it can be seen that the parent block of paragraph 2-1 has a "header attribute", so that paragraph 2-1 is included in the heading block.
[0032]
The TextInfo data 73 manages one character string level information by using items of index, pointer to Type, text offset, and text length of tag information.
The index of the tag information is an index to a tag table in which the character strings of tags and parameters for modifying the character strings are registered (this tag table may be combined with the tag table of the structured tags). For example, in the case of HTML, character strings such as <i>, <b>, font, and size are registered as a tag table, and the TextInfo data 73 is replaced with a character string of a tag that appears in a structured document. , Register the index to the tag table.
The text offset indicates which character of the extracted text 72 is the beginning of the modified character string, and the text length indicates the length (number of characters) of this character string.
[0033]
Character string modifications include, for example, character size (minimal, small, medium, large, maximal), emphasis, italics, and character decoration (underline, strikethrough), etc. Is coded.
[0034]
The pointer to Type represents a pointer to the character information list 75 that holds the attribute value of the character modification for the character string.
Since the character information list 75 can be nested for character modification, this nest relationship is represented by a linear list.
The order of registration in the character information list 75 is such that the first character in the list has the innermost character modification attribute, and the second and subsequent characters in the list have character modification attributes nested from the innermost to the outermost character.
If the value of the text offset registered in the TextInfo data 73 is the same, processing is performed as a character modification nest, so that the inner character modification attribute is registered without registering the outer character modification attribute.
[0035]
Therefore, by following the character modification attribute value of the character information list 75, it is possible to know how a certain character string has duplicate character modification attributes.
Also, by creating a logical sum of nested character modification attribute values for this one character string, the character modification attribute of this character string can be determined by bit manipulation.
[0036]
For example, in the case of the following HTML document, the TextInfo data 73 and the character information list 75 as shown in FIG. 6 are generated. In FIG. 6, the index of the structured tag information is blank when there is no structured tag.
[0037]
<Body>
<I> He </ I> met <B> a prettyity </ B> girl.
</ Body>
[0038]
When the value of the text offset registered in the TextInfo data 73 is the same, for example, in the case of the following HTML document, the TextInfo data 73 and the character information list 75 as shown in FIG. 7 are generated. As described above, the outermost character modification attribute <B> is not registered, but the innermost character modification attribute <I> is registered.
[0039]
<B><I> A big dog </ I><B> run after me.
[0040]
The BlockInfo data 71 and the structured information list 74 thus registered in the structured information storage unit 70 are used by the filtering processing unit 20 and the keyword / important sentence reevaluation unit 50.
For example, in the case of the keyword / important sentence re-evaluation unit 50, this structured information is used when increasing the importance of the headline sentence or performing an operation of decreasing the importance in the case of simple data listing.
Further, the TextInfo data 73 and the character information list 75 are used at the time of re-evaluation of a keyword.
When the type of the input document structure is plain text, the above-described BlockInfo data 71 and TextInfo data 73 are not created, and only the extracted text 72 is created.
[0041]
As described above, by converting the document structuring information of various structured documents into a unified format that does not depend on the type of the document structure, the processing of the summary sentence does not depend on the document structure. Can be processed.
Further, even if a structured document having a new specification is defined in the future, a summary sentence of the structured document can be created only by preparing a structured document parser for the document structure.
[0042]
(B) Text filtering
When browsing various Web pages on the Web, unnecessary information is scattered around the text.
When analyzing the structured document, the structured document parser 10 has deleted most of the control information of the tag. However, HEAD, TITLE, link information, programs in a script language, and the like remain at this point.
[0043]
Therefore, in order to delete such unnecessary information, an unnecessary information table is provided for each type of structured document and each document content, and the filtering processing unit 20 refers to the unnecessary information table and searches for a keyword or an important sentence. The text that is deemed unnecessary for the extraction is removed, and the structured information storage unit 70 is updated.
However, since link information or the like may have an important meaning, it may not always be deleted. Whether to delete such information is determined based on semantic information surrounding the structured document, and an unnecessary information table is created.
[0044]
The determination of unnecessary information in the filtering processing unit 20 is performed by checking whether the same block attribute registered in the unnecessary information table exists in the BlockInfo data 71 and the structured information list 74 of the structured information storage unit 70. Do.
When the same block attribute as that registered in the unnecessary information table is found, the entry is deleted from the BlockInfo data 71, the extracted text corresponding to the block is deleted, and the structured information storage unit 70 is updated. At this time, the entry of the TextInfo data 73 in the block to be deleted is also deleted.
[0045]
As described above, since text that is not deleted by the structured document parser 10 and is unnecessary for extracting keywords and important sentences can be deleted, the extraction accuracy of keywords and important sentences can be further improved.
[0046]
(C) Language processing
The language processing unit 30 and the keyword / important sentence extraction unit 40 are implemented by existing techniques because the analysis is for plain text.
The language processing unit 30 is realized by existing morphological analysis, phrase synthesis, and syntax analysis techniques. The extracted text in the structured information storage unit 70 is divided into morphemes, and parts of speech are assigned to individual morphemes. Morphemes are synthesized based on the connection rules between morphemes, and compiled into phrase levels (noun phrases, verb phrases, adverb phrases, conjunction phrases ..., etc.). The result is subjected to syntactic analysis at the phrase level based on the phrase grammar, and the modification relation between the phrases, the subject, and the object are determined.
[0047]
(D) Keyword / important sentence extraction
The keyword / important sentence extraction unit 40 extracts a noun phrase from the phrase obtained by the language processing unit 30 by using an existing technique, for example, a method described in Japanese Patent Application Laid-Open No. 9-34905, and determines the appearance frequency of the noun phrase. A keyword is extracted by performing scoring based on the weight based on the noun phrase and the weight based on the degree of modification of the noun phrase.
Next, the relevance between sentences is evaluated based on the degree of duplication between keywords in the sentence, and important sentences are extracted by calculating the importance of the sentence based on the strength of relevance with other sentence groups and the presence or absence of association. I do.
[0048]
(E) Re-evaluation of keywords and important sentences
The keyword / important sentence re-evaluation unit 50 determines the importance of the keyword included in the important sentence obtained by the keyword / important sentence extraction unit 40 by determining what character modification is applied to the character string of this keyword, and re-evaluates the keyword. evaluate.
Whether the character string of the keyword has been subjected to character modification can be determined by referring to the text offset and the text length of the TextInfo data in the structured information storage unit 70.
Next, what character modification attribute the character string has can be determined by tracing the attribute value in the link of the character information list.
[0049]
In the re-evaluation of the importance of the keyword, for example, the evaluation point is changed as follows depending on the type of character modification.
(1) If the character is emphasized, the evaluation score is increased by 10%.
(2) Raise or lower the evaluation score by 5% according to the size of the character size.
(3) Raise or lower the evaluation score by 5% according to the type of character modification.
[0050]
Next, the keyword / important sentence re-evaluation unit 50 recalculates the importance of the important sentence on the basis of the importance of the re-evaluated keyword, and determines the type of the important sentence based on what block the important sentence belongs to. Re-evaluate the importance.
Which block this important sentence belongs to can be determined by referring to the text offset and the text length of the BlockInfo data in the structured information storage unit 70. Next, what block attribute the block has can be determined by tracing the block attribute value in the link of the structured information list.
[0051]
After recalculating the importance of the important sentence by the re-evaluated keyword, the re-evaluation point of the importance of the important sentence is changed as follows according to the type of the block attribute, for example.
(1) If it is a headline, increase the evaluation score by 20%.
(2) If it is an enumerated statement, the evaluation score is reduced by 20%.
(3) If the sentence is in a figure or table, the evaluation score is reduced by 25%.
[0052]
As described above, it is possible to re-evaluate, using plain text, the keywords and important sentences required by the existing technology using the document structuring information. Can be higher.
[0053]
(F) Generation and output of a summary sentence
The summary output unit 60 extracts the important sentences from the reevaluated important sentences with the highest importance, the document size and the extraction ratio (for example, the ratio of the summary sentence size to the document size is specified), and the extraction unit (for example, , The number of characters, the number of words, the number of sentences, etc.), and select only the summaries within the size of the summary sentence determined according to the selected important sentence and the headline block of the block to which the important sentence belongs as a set. Sentence.
When outputting the summary sentence, the important sentence selected as the summary sentence and the character string corresponding to the headline in the displayed original structured document are distinguished from others, and are output in different colors, for example (see FIG. 10). ).
As a result, it is possible to confirm in which context of the original document the text appears.
[0054]
When outputting the summary sentence separately from the display of the original structured document, a plurality of sets are generated according to the size of the summary sentence described above, with (heading, important sentence) as one set, and It should be a summary of the document.
At this time, if there is no heading (the heading has a block attribute), a combination of keywords is used instead of the heading. As a keyword that substitutes for a headline, a keyword having a high co-occurrence rate with a key sentence forming a pair is selected from the keyword set obtained by the keyword / important sentence extraction unit 40.
A keyword having a high co-occurrence rate with an important sentence is a keyword having a high appearance frequency in an important sentence in a keyword set included in the important sentence.
[0055]
As a result, the contents of the original document can be expressed more accurately by combining the extracted keywords and important sentences with each other, instead of treating them separately, as a summary sentence.
[0056]
(G) Processing procedure of structured document summarization device
Next, a summary sentence processing procedure when the structured document summarizing apparatus of the present invention is incorporated in a Web browser (for example, the structured document summarizing apparatus is incorporated as a plug-in) will be described with reference to the flowchart of FIG.
First, the user loads a structured document using a Web browser (step S1).
For example, a structured document read by a Web browser is displayed as shown in FIG. In the figure, an “important sentence button” is used to extract an important sentence from the displayed structured document and display it as a summary sentence. The "keyword button" is used to display important keywords extracted from the displayed structured document.
[0057]
Next, when the user clicks the "important sentence button" to display the summary sentence, the structured document parser 10 corresponding to the type of the document structure is activated to analyze the read structured document, The read structured document is separated into structured information and plain text, and the separation result is stored in a data structure as shown in FIG. 2 and stored in the structured information storage unit 70 (step S2).
[0058]
For example, the document structure of the read document is determined, and if the document is an HTML document, an HTML parser, if the document is an SGML document, an SGML parser, and if the document is a word processing document, a word processing document parser is automatically generated. Is activated.
In the case of the document displayed in FIG. 9, the character string of the heading is bold and has a larger size than the font of the main body, and thus is an example of the simplest structured text. The data stored in the structured information storage unit 70 is output to a unified data structure regardless of the type of the document structure.
[0059]
From the data stored in the structured information storage unit 70, filtering is performed on the extracted plain text (step S3).
Since the control information embedded in the structured document has already been removed from the extracted plain text of the data stored in the structured information storage unit 70, the filtering here means that the structured information is extracted from the plain text. Based on (BlockInfo data in FIG. 2), this means deleting text that has no meaning in extracting keywords or important sentences.
For example, in the case of an HTML document, this filtering is to delete link information and the like.
[0060]
Next, the filtered plain text is input and subjected to basic language processing such as morphological analysis, phrase synthesis, and syntax analysis (step S4).
For the synthesized phrase (mainly a noun phrase), a keyword and an important sentence based on the keyword are extracted by a method based on JP-A-9-34905 (step S5).
[0061]
For the obtained keyword, the importance of the keyword is reevaluated based on the character modification applied to the character string of the keyword, and the importance of the important sentence is recalculated based on the reevaluated keyword. The important sentence is reevaluated based on the block attribute (structured information) to which the sentence belongs (step S6).
[0062]
From the re-evaluated important sentences with the highest importance, the important sentences are extracted from the document size and the extraction ratio (for example, the ratio of the summary sentence size to the document size is specified) and the extraction unit (for example, the number of characters, the number of words or (The number of sentences, etc.), selected within the size of the summary sentence determined according to the selected sentence, and sets the selected important sentence and related headings as a set of document summaries. Is displayed on the Web browser in a different color, for example, from the other characters (step S7).
In the case of the structured document displayed in FIG. 9, the selected important sentence of the summary sentence is highlighted and displayed as shown in FIG.
[0063]
When displaying the summary sentence in a window different from the display of the original structured document, a plurality of sets are generated according to the size of the summary sentence, with (heading, important sentence) as one set, and It is displayed as a summary sentence.
At this time, if there is no heading (the heading has a block attribute), a combination of keywords is used instead of the heading. As a keyword that substitutes for a headline, a keyword having a high appearance frequency in the important sentence is selected from a keyword set included in the important sentence that forms a pair.
[0064]
(H) Program and recording medium
The present invention is not limited to only the embodiments described above. Each function of the structured document summarizing apparatus according to the above-described embodiment is programmed, written in advance on a recording medium such as a CD-ROM, and is stored in a medium drive such as a CD-ROM drive mounted on a computer. It goes without saying that the object of the present invention is achieved by installing these programs in a memory or a storage device of a computer by mounting a ROM or the like and executing the programs.
In this case, the program itself read from the recording medium implements the functions of the above-described embodiment, and the program and the recording medium on which the program is recorded also constitute the present invention.
[0065]
In addition, as a recording medium for storing the program, a semiconductor medium (for example, ROM, nonvolatile memory, etc.), an optical medium (for example, DVD, MO, MD, CD, etc.), a magnetic medium (for example, magnetic tape, flexible disk, etc.) And so on.
[0066]
Further, not only the functions of the above-described embodiment are realized by executing the loaded program, but also the above-described execution is performed by performing processing in cooperation with an operating system or another application program based on an instruction of the program. The case where the function of the form is realized is also included.
[0067]
When distributing the program to the market, a portable recording medium storing the program is distributed, or the program is stored in a storage device of a server computer connected via a communication network such as the Internet. The program can also be distributed by transferring the program to another computer via a communication network. In this case, the storage device of the server computer is also included in the recording medium of the present invention.
[0068]
In this case, the computer installs the program on the portable recording medium or the transferred program in a storage device connected to the computer, and executes the installed program to implement the functions of the above-described embodiment. Is achieved.
[0069]
It should be noted that the present invention is not limited to the above-described embodiment, and it is needless to say that various changes and modifications can be made without departing from the gist of the present invention.
[0070]
【The invention's effect】
As described above, according to the present invention, a highly accurate summary sentence can be created by effectively using the structured information of a structured document.
In addition, since the created abstract is displayed together with the original document, it is possible to confirm in what context of the original document the abstract appears.
[0071]
In addition, since structured documents having different structures are separated into document structured information and text information (plain text) and converted into a unified format independent of the document structure, the summary sentence can be processed uniformly. It became so.
Based on the separated document structured information, by deleting text that is meaningless for extracting keywords and important sentences from the separated text information, the accuracy of keyword and important sentence extraction is increased, and more accurate Can create a high-level summary sentence.
[0072]
In addition, since text information is separated from structured documents, keywords and important sentences can be extracted using existing keyword extraction technology and abstract sentence extraction technology that targets the separated text information (plain text). The extraction accuracy can be further improved by re-evaluating the keyword or important sentence obtained here based on the document structuring information.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of a structured document summarizing apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining a data structure of a structured information storage unit.
FIG. 3 is an example of a block attribute value.
FIG. 4 is an example of a structured document including a block having a nest structure.
FIG. 5 is an example of a case where the block structure of the structured document of FIG. 4 is stored in the data structure of FIG. 2;
FIG. 6 is an example of a case where a character string modification attribute for a character string subjected to character modification is stored in the data structure of FIG. 2;
FIG. 7 is an example of a case where a character string modification attribute for a character string subjected to character modification is stored in the data structure of FIG. 2;
FIG. 8 is a flowchart showing a summary sentence processing procedure when the structured document summarization device of the present invention is incorporated in a Web browser.
FIG. 9 is an example in which a structured document is displayed on a Web browser.
10 is an example in which a summary sentence is extracted from the document shown in FIG. 9 and the summary sentence is displayed together with the original structured document.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Structured document parser, 20 ... Filtering processing part, 30 ... Language processing part, 40 ... Keyword / important sentence extraction part, 50 ... Keyword / important sentence reevaluation part, 60 ... Summary output part, 70 ... Structured information storage Section 71 BlockInfo data, 72 Extracted text, 73 TextInfo data, 74 Structured information list, 75 Character information list.

Claims

An input unit for inputting a structured document, a structured document parser that analyzes the input structured document and separates it into structured information and text information suitable for keyword extraction and important sentence extraction; Characterized by having a keyword / important sentence extraction unit for extracting keywords and important sentences only for the extracted text information, and a summary output unit for generating a summary sentence from the keywords and important sentences. apparatus.

2. The structured document summarizing apparatus according to claim 1, wherein the structured document parser is prepared for each different type of document structure, and an appropriate structured document parser is selected according to the type of the input structured document. A structured document summarizing apparatus characterized in that the structured document summarizing apparatus is adapted to be executed.

3. The structured document summarizing apparatus according to claim 1, further comprising a filtering unit that refers to the structured information and deletes a meaningless text from the text information to extract a keyword or an important sentence. Structured document summarization device.

4. The structured document summarizing apparatus according to claim 1, wherein the keyword / important sentence extracting unit includes a keyword re-evaluated for the extracted keyword and important sentence with reference to the structured information. 5. A structured document summarizing apparatus characterized in that an important sentence is selected.

5. The structured document summarizing apparatus according to claim 1, wherein the summary output unit outputs the extracted important sentence and a heading of a structure to which the important sentence belongs as a summary sentence. A structured document summarizing apparatus characterized by the above-mentioned.

6. The structured document summarizing device according to claim 5, wherein the summary output unit is configured to output a character string corresponding to the important sentence in the input structured document in a distinguished manner. Structured document summarization device.

6. The structured document summarizing apparatus according to claim 5, wherein when the heading does not exist, a keyword having a high co-occurrence rate with the extracted important sentence is selected.

A program for causing a computer to execute the function of the structured document summarizing apparatus according to any one of claims 1 to 7.

A computer-readable recording medium recording the program according to claim 8.