JP3652086B2

JP3652086B2 - Speed reading support device

Info

Publication number: JP3652086B2
Application number: JP28930597A
Authority: JP
Inventors: 忠司野本
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-10-22
Filing date: 1997-10-22
Publication date: 2005-05-25
Anticipated expiration: 2017-10-22
Also published as: JPH11126204A

Description

【０００１】
【発明の属する技術分野】
本発明は、日本語文書処理全般にかかり，電子化文書の速読支援、また情報検索等のインターフェイスに利用される。
【０００２】
【従来の技術】
従来、文章Ｓについての要約文の候補を計算機によって決める場合、文章Ｓ内の各文ｓについてその文ｓが要約文になる可能性を確率ＰあるいはＴＦＩＤＦと呼ばれる尺度を使って計算し、文ｓに優先順位をつけ要約文の候補を決めるというが一般的である。例えば、Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. “A Trainable Document Summarizer”In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval、 pages 68-73, Seattle, USA.では（数１）の式を使って、文章Ｓ中におけるある文ｓが要約文として選ばれる確率Ｐを計算する。
Ｐ（ｓ inＸ｜Ｆ１，Ｆ２，Ｆ３，．．．Ｆｋ）（数１）
ここで、Ｘは要約文の集合、Ｆ１，Ｆ２，…，Ｆｋは、文の長さ、手がかり語の有無、段落内の位置などの特徴を表わす。そして、（数１）の式に基づき、各文の要約文としての重要度を決定し、値の上位２５パーセントにあたる文を文章の要約として、ユーザに提示する。
【０００３】
一方、Klaus Zechner. 1996. “Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences” In Proceedings of the 16th International Conference on Computational Linguistics、pages 986-989. Copenhagen, Denmark.では、ＴＦＩＤＦと呼ばれる方式で(数２)による尺度を計算して文ｓの重要度を決定する。
ＴＦＩＤＦ(ｗ，ｓ)＝ＴＦ(ｗ，ｓ)×ｌｏｇ(Ｎ／ｎ(ｗ)) （数２）
ここでｗは特定の文ｓに出現した単語、ＴＦ（ｗ）はその文中での単語ｗの頻度、Ｎは文章Ｓにおける文ｓの総数、ｎ（ｗ）は単語ｗが出現した文ｓの総数である。また文ｓの重要度Ｑ（ｓ）は（数３）で定義する。
Ｑ（ｓ）＝Σ ＴＦＩＤＦ（ｗ，ｓ）（数３）
つまり、文ｓに現われた単語すべてについて、そのＴＦＩＤＦ値を求め、その総和を文ｓの重要度とする。そして、Ｑ（ｓ）の値が上位の文を要約候補として、ユーザに提示する。上記Zechnerの方法では、分野別に特化したチューニングができないため、一般に要約精度が悪い。
【０００４】
【発明が解決しようとする課題】
これら従来方法では、要約文としての正当性の評価はさて置いても、選択された要約文をそれだけで単独表示するので、上位にあるいくつかの文が要約文として表示されても、選択された要約文それぞれは本来関係が無いから、それらの文の前後のつながりが悪くなり、非常に読みにくくなる。さらに、選択されたいくつかの文が単に羅列されるだけなので、元の文章の大意がつかみにくく、その文章がユーザにとって重要か否か判断する上で支障をきたす。
【０００５】
本発明は上記した従来法の問題を解決することを目的とする。
【０００６】
【課題を解決するための手段】
したがって、本発明では、入力された文章について、文章中の各文について所定のルールに従い特徴分析を行い、要約文か否かを決定し、要約文の場合は強調色、そうでない場合は、背景色でユーザに提示するとともに入力された文章の各段落、第一文を要約文とは異る強調色でユーザに提示する。
【０００７】
【発明の実施の形態】
まず、要約文の選択方法として、よく知られたＣ４．５と呼ばれる決定木構成方法を利用する。この方法に従うと、文はいくつかの特徴に基づいてコード化されることになる。本発明では、まず文章Ｓ中の各文ｓを、（１）文章の型、（２）文章中の位置、（３）見出しとの類似度、（４）文章内ＴＦＩＤＦ、（５）態度表現の有無、（６）文の文字数、（７）段落内の位置、の特徴のそれぞれに基づきコード化する。
「文章の型」は、文章が報道記事、社説、随筆等のどの型に属するかを示す。
【０００８】
「文章中の位置」は、文が文章全体の中でどの位置に現われているのか割合で示す。例えば、文章Ｓ中の文ｓの総数が１０であり、当該文がその第一文目に現われているなら、その文の位置を０／１０＝０として表わす。
【０００９】
「見出しとの類似度」は、以下の（数４）で決定する。
ＳＩＭＭ（ｔ，ｓ）＝Σ ＮＦ（ｗ，ｓ）×ＩＤＦ（ｗ）（数４）
ここで、ｔは文章の見出し、ｓは文を表わす。ｔ中に出現した名詞ｗについて、そのＮＦ値とＩＤＦ値を求めて、その総和を見出しとの類似度とする。ＮＦ（ｗ，ｓ）は（数５）のように定義する。
ＮＦ（ｗ，ｓ）＝Ｆ（ｗ，ｓ）／ＭＡＸ＿Ｆ（ｓ）（数５）
ここで、Ｆ（ｗ，ｓ）はｗのｓにおける頻度、ＭＡＸ＿Ｆ（ｓ）は文ｓに出現した名詞の内、頻度の最も高い名詞の頻度である。ＩＤＦ（ｗ）は（数６）のように定義される。
ＩＤＦ（ｗ）＝ｌｏｇ（Ｎ／ＤＦ（ｗ））／ｌｏｇＮ（数６）
ここで、ＤＦ（ｗ）はｗが出現した文の総数。Ｎは文章Ｓの文ｓの総数である。
【００１０】
「文章内ＴＦＩＤＦ」は、（数７）で決定される値である。
Ｄ（ｓ）＝Σ ＮＦ（ｗ，ｓ）×ＩＤＦ（ｗ）（数７）
ここで、ｗは文ｓに出現した名詞。ＮＦ（ｗ，ｓ），ＩＤＦ（ｗ）は上記の定義に従う。
【００１１】
「態度表現の有無」は、文の文末に著者の態度を示す表現があるかないかの情報を示すのに用いられる。ここで、著者の態度を示す表現としては、「〜重要だ」、「〜必要だ」、「〜か」、「〜よ」、「〜ね」等の表現を考える。
【００１２】
「文の文字数」は、文ｓの文字数を示す。
【００１３】
「段落内の位置」は、文ｓの段落内位置を上述の「文章中の位置」と同様に文ｓに先行する文の数／段落の文の総数で示す。
【００１４】
決定木の構成は、文章の各文を、上記の属性について特徴化し、さらに分野別の要約判定情報付きデータを用いた学習というステップを経る。決定木の構成方法についてはＱｕｉｎｌａｎ著「Ｃ４．５」に従う。決定木の構成方法Ｃ４．５はよく知られた方法であるが、概説すると下記のようである。
【００１５】
Ｃ４．５ではデータベース・エントリーの分類をいかにモデル(一般)化するかというのが課題になる。例えば、ある会社の採用実績のデータベースが以下のようなものだとする。
【００１６】
性別年齢婚姻学歴車採用
女性 23 既婚高校あり ○
男性 30 独身大学なし ○
女性 45 既婚高校あり ○
男性 60 既婚大学なし ×
分類のモデル化とは、このデータから採用・不採用の条件のパターンを見付け、任意の人について、その人がこの会社に採用されるか否か予想することである。ちなみに、上記データベースで「採用」の項目を分類、それ以外の項目を属性と呼ぶ。また、それぞれのエントリーをケースと呼ぶ。C4.5ではケースの属性情報を見ながら、同じような属性値を持つケースをまとめ、分類をおこなう。
【００１７】
例えば、上記の例では、以下のような分類モデルが可能である。
性別は？
女性採用
男性車の免許は？
あり→採用
なし→不採用
つまり、「女性であれば、すべて採用。」「男性であれば、車の免許があれば、採用。」という一般化が可能である。実際の場面では、どの属性を分岐条件にするのかという問題がでるが、C4.5では、特にgain ratioという統計尺度を用いて属性の選択をおこなっている。
【００１８】
つぎに得られた決定木を用いて、速読支援の操作を行う。操作は以下の手続きをふむ。
（１）速読したい文章Ｓを画面に呼びだす、（２）決定木を用いて、表示された各文ｓに対して、要約文か否かに分類する、（３）要約文として分類された文をを強調色で、それ以外の文を強調色とは異なる色（背景色という）で表示する、（４）最後に各段落の第１文目を強調色で表示する。
【００１９】
上述したように、分野別の決定木を構成する場合には、特定分野に特化した要約文の生成が可能となる。また、要約文を強調色、その他の文を背景色でユーザに提示することで、文章の表示にめりはりがつく、また、要約文は本文中そのままの形で表示するため、要約文前後の文脈が保存され、必要に応じてすぐに参照できるから、要約の読解が容易になる。さらに、各段落の第１文目を要約とともに表示することで、内容のあらすじが理解可能となる。以下、より具体的な実施例を図面を参照しながら説明する。
【００２０】
実施例１
図１は本発明に係る速読支援方法のデータ処理の考え方を示すブロック図である。
【００２１】
図１において、１は入力ステップであり速読したい文書（記事）を取り込むかあるいはデータベース予め入力されている文書（記事）を取り込む。ここでは図２に示された記事が入力されたとする。２はジャンル情報取得ステップであり、入力された記事の内容に応じたジャンルが決定され出力される。ジャンルの決定手順は以下のようである。まず、入力された記事に対してジャンルを示すキーワードを文章中から探す。もしジャンルを表わすキーワードが発見できない場合はユーザにジャンル情報の入力を要求する。しかし、図２の記事では、見出し部分には「(社説)」というキーワードがあるので、記事のジャンルは社説と決定される。
【００２２】
次に、取得されたジャンル情報と記事は決定木選択ステップ３に送られる。ここではジャンル情報をもとに前もって用意してある決定木のデータベースのなかから記事のジャンルと対応するもの選ぶ。決定木の構成方法は前述したＣ４．５による。記事のジャンルは社説であるから、社説用決定木データベースが選択される。本説明では図４に示された決定木が選択されるとする。
次に特徴抽出ステップ４に進む。ここでは、記事の見出しを除いた本文に現れた各文について、形態素解析処理を施した後、特徴抽出をおこなう。形態素解析は、例えば［櫻井他、形態素解析プログラムＡＮＩＭＡの設計と評価(社)情報処理学会第５４回全国大会講演論文集，１９９７］らの手続きに従う。抽出する特徴は（１）文章の型、（２）文章中の位置、（３）見出しとの類似度、（４）文章内ＴＦＩＤＦ値、（５）態度表現の有無、（６）文の文字数、（７）段落内の位置の七つである。
【００２３】
抽出は以下の手順をふむ。いま、本文中の任意の文をｓとする。電子的に提供されている新聞記事の場合、通常、一般記事、随筆、社説等の分類情報が文章Ｓに付与されている。文sの文章の型は、その分類情報に従う。分類情報がない場合は、ユーザが一般記事、随筆、社説の区別を行ない、型を決定する。文章中の位置は、本文の先頭から文ｓの直前まで現われた文が文章全体に占める割合、つまり、Ｄ（ｓ）／Ｎで与える。ただし、Ｄ（ｓ）は本文の先頭から文ｓの直前までの文の数、Ｎは文章Ｓにおける文の総数である。見出しとの類似度は、文章の見出しをＴとすると、前述した（数４）に従って、ＳＩＭＭ（Ｔ，ｓ）を計算し、その値を類似度とする。文章内ＴＦＩＤＦは、文ｓに現われた名詞ｗそれぞれについて、ＮＦ（ｓ，ｗ）×ＩＤＦ（ｗ）を計算し、その総和を値とする。（ただし、名詞ｗは形態素解析により抽出する。) 態度表現の有無は、文ｓに特定の表現「〜重要だ」、「〜必要だ」、「〜か」、「〜よ」、「〜ね」等（活用してる場合はその終止形）が出現しているか否かで決める。ここでなにもない場合は１とし、「重要だ」「必要だ」などの態度動詞の場合は２とし、「か」「よ」「ね」などの終助詞の場合は３とする。文の文字数は、文ｓの文字数とする。段落内の位置は、文ｓのＰＤ（ｓ）／Ｎ（Ｐ）として与える。ただし、ＰＤ（ｓ）はその段落の先頭から文ｓの直前まで現われた文の数、Ｎ（Ｐ）は段落の文の総数である。
【００２４】
図２の記事の各文に対して上記の手続きにより特徴抽出をおこなった結果を、図３に示す。本実施例では見出しを除き文が7つある例であり、文１が本文の第一文、文２が第２文、文３が第３文、………、文７が第７文という具合に対応する。本文の文はすべて社説の一部であるから文章タイプはすべて「社説」となる。抽出された（１）文章の型、（２）文章中の位置、（３）見出しとの類似度、（４）文章内ＴＦＩＤＦ値、（５）態度表現の有無、（６）文の文字数、（７）段落内の位置の七つの特徴は図に示すとおりである。
【００２５】
次のステップ５では、文から抽出された特徴と選択された決定木をもとにその文が要約文か否かの判定をおこなう。以下では文１〜文７について、実際の判定作業を詳しく見ていく。決定木は図４に示されたものとする。
【００２６】
文１は、まず見出しとの類似度が０．６７９であるのでＮ１０に進む。さらに類似度が１．１８１以下であるのでＮ１２を通る。次にＴＦＩＤＦが９．４４９であるのでＮ１４を通る。次に文字数が４１であるのでＮ１６を通る。次に、ＴＦＩＤＦが９．４４９であるのでＮ１８を通り、最終的に非要約文と判定される。
【００２７】
文２は、見出しとの類似度が０．２６３であるのでＮ１に進む。ところが、文章中の位置が０であるのでＮ２を通り非要約文と判定される。
【００２８】
文３は、見出しとの類似度が０．７６２であるのでＮ１０を通り、Ｎ１２を通る。さらにＴＦＩＤＦ値が４．８９３であるのでＮ１４を通る。次のステップでは文字数が７０であるのでＮ１５を通り、要約文と判定される。
【００２９】
文４は、見出しとの類似度が０．２６３であるのでＮ１を通る。ところが、文章中の位置が０．０７１であるので、Ｎ２を通り、その結果非要約文と判定される。
【００３０】
文５も見出しとの類似度が０．２６３であるのでＮ１を通る。文章中の位置が０．０９５であるので、文４と同じく、Ｎ２を通り、非要約文と判定される。
【００３１】
文６は、見出しとの類似度が０であるのでＮ１を通る。文章中の位置が０．１１９であるので、文４、文５と同じく、Ｎ２を通り、非要約文と判定される。
【００３２】
文７も、見出しとの類似度が０であるのでＮ１を通り、また文章中の位置が０．１４３であるので、文４−文６と同じく、Ｎ２を通り、非要約文と判定される。
【００３３】
次のステップ６では、上記要約文の判定結果に応じ、要約文と判定されたものは強調色、非要約文と判定されたものは背景色で表示する。さらに、表示にめりはりを付けるため、記事の各段落第一文を要約文とは異なる強調色でハイライトし、速読支援処理を終了する。
【００３４】
社説以外の文章、随筆、報道文等についても対応する決定木を参照し上と同等の処理を施すことで、他のジャンルの文章についても速読の支援をおこなうことができる。
【００３５】
図５は上述した処理の具体的な処理フローを示す図である。図５の例は、文章タイプ情報の取得は速読支援の対象とされた文章に対して一度だけ行われる。一方、各文の特徴抽出は、速読支援の対象とされた文章を個々の文毎に未処理文として登録し一文毎に行い、未処理文がなくなったときに処理が終了するものとなる。図５の処理フローは、前述した説明を参照しながら読めば容易に理解できるので、図に参照番号を付して説明することは省略した。前述した図３に示した特徴テーブルは、図５の処理フローによって抽出された特徴を説明のために纏めて示したものである。
【００３６】
図６に、上述の要約文および非要約文の判断結果を反映された記事の表示の状態を示す。図には色が付されないので、強調色とされたものに実線のアンダーラインを付し、記事の各段落第一文には点線のアンダーラインを付した。
【００３７】
なお、上述の実施例においては記事に見出しがあり、これを使ってジャンルの取得および類似度の評価が極めて容易に行われたが、見出しが無い場合には図３における見出しとの類似度のデータが無くなり、図４におけるパスＮ１１、Ｎ１２が無くなるが、実質的な意味での支障はない。
【００３８】
また、当然のことながら、記事が長くて一画面内におさまらないときは、スクロールによって内容を見ることになる。
【００３９】
実施例２
次に、記事の検索支援と上述の速読支援方法を組み合わせた新聞速読支援装置の実施例を説明する。
【００４０】
図７は、このための信号処理の流れの要約を示す図である。７１は検索条件入力ステップでありユーザが読みたいと思う記事の検索条件を入力する。検索条件は任意に設定できるが、キーワード等が一般的であり使いやすい。７２は記事検索のステップであり、任意のデータベースから記事情報を取り込み、上述の検索条件に合った記事を検索する。７３は検索結果表示ステップであり、検索条件に合った記事を、例えば、条件との一致度とともに表示する。７４は記事指定ステップであり、ユーザが、例えば、条件との一致度を参考に読みたいと思う記事を選択する。７５は速読支援指示ステップであり、ユーザが、読みたいと思って選択した記事の速読支援を要求するステップである。７６から８０のステップは図１と対照して明らかなように速読支援のステップであり、記事指定ステップ７４でユーザが選択した記事を対象として速読支援の処理を行う。
【００４１】
図８は、この処理を実行するためのハード構成の一例を示す図である。図８において８０１は出力手段であり、ここではプリンタ等を意味する。８０２はＣＰＵであり、後述するプログラムにしたがって処理を実行する。８０３は入力手段であり、例えば、キーボードおよびマウス等である。８０４はシステムバスである。８１２は表示手段であり、ＣＴＰ等のいわゆるディスプレーである。８０９はプログラム保持手段であり、例えば、ハードディスクが使用される。プログラム保持手段８０９には検索、速読支援インターフェイス作動プログラム８０５、形態素解析プログラム８０６、決定木生成プログラム８０７、特徴抽出プログラム８１６、決定木動作プログラム８０８、検索プログラム８０９、類似度計算プログラム８０９１、文書ランキングプログラム８０９２、重要文表示プログラム８１０およびあらすし表示プログラム８１１が格納される。８１３はメモリの作業領域である。８１４は決定木データベースである。８１５は文書データベースであり、検索対象となる、例えば、新聞記事が蓄積される。各手段及びデータベースはシステムバスを介して結合される。
【００４２】
まず、ユーザが新聞記事の内特定の興味のあるものを読みたいと思ったとき装置を起動して検索、速読支援インターフェイス作動プログラム８０５を作動させ、表示手段８１２の検索インタフェイスの入力画面を介して検索キーワードを入力する。これは図７のステップ７１に対応する。次に入力されたキーワードに対して検索プログラム８０９を実行する(図７−ステップ７２）。
【００４３】
検索プログラムは（数８）にしたがって文書データベース８１５に蓄積された文章と入力キーワードとの類似度Ｄを計算する。
【００４４】
Ｄ（ｑ，ｄ）＝Σ ＴＦ（ｗ，ｄ）×ＩＤＦ（ｗ）（数８）
ここでｑはキーワードのリスト、ｄはある文書で、その中に現れた名詞単語のリスト(重複は除く)として表現する。ｗは、リストｑの要素(単語)を表わす。ＴＦ（ｗ，ｄ）は文書ｄにおけるｗの頻度、ＩＤＦ（ｗ）は文書データベース８１５に蓄積された記事の全体について、（数９）にしたがって計算して求める。
【００４５】
ＩＤＦ（ｗ）＝ｌｏｇ（Ｎ／ＤＦ（ｗ））（数９）
ここで、Ｎはデータベース中の記事総数、ＤＦ（ｗ）は単語ｗは一回でも出現した記事の総数である。ただし、文書中の名詞抽出は、形態素解析プログラム８０６を実行しておこなう。具体的な方法は上で述べた［櫻井他、形態素解析プログラムＡＮＩＭＡの設計と評価、１９９７］の形態素解析プログラムを利用する。このようにして文書データベース８１５中のすべての記事について類似度Ｄを求め、その値の高いものから記事を５つ選択し、ユーザに表示手段８１２の検索インタフェイスの出力画面を介して選択結果を提示する（図７−ステップ７３）。ここで採用された出力画面の例を図9に示す。図９において、９１は見出しを示し、９２は記事本文の表示をオンにするためのスィッチであり、これをクリックすると文書データベースから対応する記事の内容全部が表示手段に表示される（図７−ステップ７４）。
【００４６】
図１０は、ここで、ユーザがスイッチ９２をオンにしたときの画面の例を示す。この例は、記事の内容は図２で説明したのと同じであるが、図１０では本文とともに速読支援を行うか否かの選択スィッチ１００１がユーザに提示される（図７−ステップ７５）。ここで、ユーザが速読支援を選択すると形態素解析プログラム８０６が実行され表示文章の各文について形態素解析がおこなわれ、ジャンル情報の取得（図７−ステップ７６）および決定木選択を行う（図７−ステップ７７）とともに、処理結果を特徴抽出プログラム８１６に渡す。特徴抽出プログラム８１６は形態素解析データから重要文決定に必要な情報を抽出し（図７−ステップ７８）、抽出情報を決定木動作プログラム８０８に渡す。決定木動作プログラム８０８は予め用意されている決定木データベース８１４にアクセスして特徴抽出プログラム８１６で抽出された情報を基に文が要約文か否か決定する（図７−ステップ７９）。もし、要約文であれば、重要文表示プログラム８１０を実行し表示手段８１２上での表示を強調色に、そうでなければ背景色で表示して、次の文の処理に移る。要約文判定の処理の終了後、あらすじ表示プログラム８１１を実行して表示文章の各段落の第一文目を重要文とは異る強調色で表示する（図７−ステップ８０）。
【００４７】
このように、本実施例によれば、例えば、キーワードとうの検索条件に応じた記事の検索と速読支援を一つの流れとして処理できる。
【００４８】
実施例３
図１１は実施例２で説明した速読支援方法をネットワーク型文書検索支援サービスの実施形態の中で実現する実施例の構成図である。図１１においては、サービスの提供装置(サーバー)とサービスの受け手側の装置(クライアント１およびクライアント２とが情報通信ネットワークを介して接続されているものとする。このため、サーバーは図８で説明したシステムバス８０４に通信手段１１０１および情報通信ネットワークとのインタフェイス１１０２が設けられたものとなる。図を簡明にするため、サーバーについては他の装置の表示を省略した。クライアント1において、１１２１は出力手段であり、ここではプリンタ等を意味する。１１２２はＣＰＵであり、後述するプログラムにしたがって処理を実行する。１１２３は入力手段であり、例えば、キーボードおよびマウス等である。１１１２はシステムバスである。１１１３は表示手段であり、ＣＴＰ等のいわゆるディスプレーである。１１１４は検索、速読支援インターフェイスプログラム保持手段であり、例えば、ハードディスクが使用される。１１１５はメモリの作業領域である。１１１６は通信手段である。１１１１はインターフェイスであり、クライアント1とサーバーとを結合する。クライアント２は、この例では同じ構成であるものとしてクライアント１についてのみ具体的に例示し、クライアント２についてはバス１１３２とインタフェイス１１３１のみの表示として図を簡略化した。
【００４９】
ユーザは、まず、入力手段１１２３を通して文書検索サービスの利用開始要求コマンドを入力する。すると、通信手段１１１６により要求コマンドが通信ネットワークを通じてサーバー側に伝達される。コマンドを受け取ったサーバーはプログラム保持手段に蓄積された検索・速読支援インターフェイス作動プログラム８０５を通信ネットワークを介してクライアント１に伝送する。クライアント１はプログラム８０５を受け取ると、検索、速読支援インターフェイスプログラム保持手段１１１４にこれを保持するとともに、計算資源(ＣＰＵ１１２２，作業領域１１１５）を使い、プログラムを動作させる。すると、図８で説明したように検索キーワードを要求する画面が表われる。ユーザは入力手段１１２３を通して検索キーワードを入力する。入力されたキーワードは通信手段２７１１１６によりサーバー側に伝送される。すると、サーバーは検索プログラム８０５を動作させ、伝送されてきたキーワードもとに検索を開始する。次に、得られた検索結果を通信ネットワークを介してクライアント１に伝送する。クライアント１は伝送されて保持されたインターフェイス作動プログラム８０５を使って結果をユーザに提示する。この時の表示内容は図９に示したものと同じである。ユーザがここで本文表示ボタンを選択すると、本文表示要求がネットワークを介してサーバー側に伝達され、サーバーが要求に応じて対応する文書をクライアント１に送り、クライアント１の計算機上で動作しているインターフェイス作動プログラム８０５が送付文書を画面上に表示する。この時の表示内容は図１０に示したものと同じである。ただし、この時サーバー側の作業領域には送付文書のコピーが残されるものとする。ユーザがさらに速読支援のスイッチ１００１をオンすると、その要求がネットワークを介してサーバーに送られ、サーバーは要求を受けて、作業領域に残されている送付済文書を速読支援プログラムに送る。速読支援プログラムは実施例２に示されたのと同じ手順に従って、文書中の各文について、形態素解析、特徴抽出、決定木動作プログラムによる重要文判定と、最後にあらすじの抽出をおこなう。サーバーは、どの文をどの色で表示するかといった情報をネットワークを介してクライアント１に送る。これによりクライアント１のインターフェイス作動プログラムは文書中のそれぞれの文の表示色を調整することが可能になる。
【００５０】
この動作はクライアント２についても同様であるので説明は省略する。
【００５１】
実施例４
実施例３では、クライアントからの検索要求に対応してサーバーから検索・速読支援インターフェイス作動プログラム８０５を通信ネットワークを介してクライアント１に伝送するものとしたが、これをあらかじめクライアントに配布しておき実施例３と同様に動作させるものとすることができる。この場合も、図１１で説明したように、ユーザが文書検索要求コマンドを入力手段１１２３を用いて入力すると、クライアント１上のプログラム保持手段１１１４に蓄積されているインターフェイス作動プログラムが起動し、以後の動作手順は実施例３と同じように処理がなされ遠隔地からの検索および速読支援を可能とする。
【００５２】
【発明の効果】
上の説明から明かなように本発明によれば、要約文は本文中そのままの形で表示されるため、要約文前後の文脈が保存され要約の読解が容易になる。さらに、各段落の第１文目を要約とともに表示することで、内容のあらすじが理解可能となる。
【００５３】
なお、予めジャンル別の決定木を蓄積しておき、これを参照するようにした場合、特定ジャンルに特化した要約文判定が極めて効果的に行えるものとなる。
【図面の簡単な説明】
【図１】本発明に係る速読支援方法のデータ処理の考え方を示すブロック図。
【図２】速読支援の対象として採用された記事の例を示す図。
【図３】図２の記事の各文に対して実施例の手続きにより特徴抽出をおこなった結果を示す図。
【図４】決定木の一例を示す図
【図５】図１のデータ処理の考え方を具体化したフローチャートを示す図。
【図６】図２の記事に対する速読支援の結果の表示例を示す図。
【図７】本発明に係る文書速読支援方法を文書検索支援装置へ適用した場合のデータ処理の考え方を示すブロック図。
【図８】図７に示す処理を実現する装置構成の一例を示す図。
【図９】文書検索結果の表示形態の具体例を示す図。
【図１０】文書検索結果に応じて特定の文書本体の表示をさせたときの一具体例を示す図。
【図１１】本発明に係る速読支援方法を適用した文書検索支援サービスを遠隔地から受けるための実施形態の一具体例を示す図。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to the entire Japanese document processing, and is used for an interface for speed reading support of an electronic document and information retrieval.
[0002]
[Prior art]
Conventionally, when a summary sentence candidate for a sentence S is determined by a computer, the possibility that the sentence s in the sentence S becomes a summary sentence is calculated using a scale called probability P or TFIDF. In general, priorities are assigned to candidates for summary sentences. For example, Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. “A Trainable Document Summarizer” In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 68-73, Seattle, USA. The probability P that a certain sentence s in the sentence S is selected as a summary sentence is calculated using the formula of 1).
P (s inX | F1, F2, F3,... Fk) (Equation 1)
Here, X represents a set of summary sentences, F1, F2,..., Fk represent features such as sentence length, presence / absence of clue words, position within a paragraph, and the like. Then, the importance of each sentence as a summary sentence is determined based on the formula (Equation 1), and sentences corresponding to the top 25% of the values are presented to the user as sentence summaries.
[0003]
On the other hand, Klaus Zechner. 1996. “Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences” In Proceedings of the 16th International Conference on Computational Linguistics, pages 986-989. Copenhagen, Denmark. Calculate the scale according to Equation 2) to determine the importance of the sentence s.
TFIDF (w, s) = TF (w, s) × log (N / n (w)) (Equation 2)
Here, w is a word that appears in a specific sentence s, TF (w) is the frequency of the word w in the sentence, N is the total number of sentences s in the sentence S, and n (w) is the sentence s in which the word w appears. It is the total number. The importance Q (s) of the sentence s is defined by (Equation 3).
Q (s) = Σ TFIDF (w, s) (Equation 3)
That is, for all words appearing in the sentence s, the TFIDF values are obtained and the sum is taken as the importance of the sentence s. Then, a sentence having a higher Q (s) value is presented to the user as a summary candidate. The Zechner method described above generally has poor summarization accuracy because it cannot be tuned specifically for each field.
[0004]
[Problems to be solved by the invention]
In these conventional methods, the evaluation of the correctness as a summary sentence is put aside, and the selected summary sentence is displayed by itself, so even if several upper sentences are displayed as a summary sentence, it is selected. Since each summary sentence is not originally related, the connection before and after those sentences deteriorates and becomes very difficult to read. Furthermore, since some selected sentences are simply enumerated, it is difficult to grasp the meaning of the original sentence, and it is difficult to determine whether the sentence is important for the user.
[0005]
An object of the present invention is to solve the problems of the conventional methods described above.
[0006]
[Means for Solving the Problems]
Therefore, in the present invention, the input sentence is subjected to feature analysis for each sentence in the sentence according to a predetermined rule to determine whether or not the sentence is a summary sentence. Each paragraph and first sentence of the input sentence is presented to the user in a highlighted color different from the summary sentence.
[0007]
DETAILED DESCRIPTION OF THE INVENTION
First, a well-known decision tree construction method called C4.5 is used as a summary sentence selection method. If this method is followed, the sentence will be coded based on several features. In the present invention, first, each sentence s in the sentence S is represented by (1) sentence type, (2) position in the sentence, (3) similarity to the headline, (4) TFIDF in the sentence, (5) attitude expression. And (6) the number of characters in the sentence, and (7) the position within the paragraph.
“Sentence type” indicates to which type the sentence belongs, such as a news report, an editorial, or an essay.
[0008]
“Position in sentence” indicates the position at which the sentence appears in the entire sentence. For example, if the total number of sentences s in the sentence S is 10, and the sentence appears in the first sentence, the position of the sentence is represented as 0/10 = 0.
[0009]
The “similarity with the headline” is determined by the following (Equation 4).
SIMM (t, s) = Σ NF (w, s) × IDF (w) (Equation 4)
Here, t represents a sentence head and s represents a sentence. For the noun w appearing in t, its NF value and IDF value are obtained, and the sum is taken as the similarity to the heading. NF (w, s) is defined as (Equation 5).
NF (w, s) = F (w, s) / MAX_F (s) (Equation 5)
Here, F (w, s) is the frequency of w in s, and MAX_F (s) is the frequency of the noun having the highest frequency among the nouns appearing in the sentence s. IDF (w) is defined as (Equation 6).
IDF (w) = log (N / DF (w)) / logN (Equation 6)
Here, DF (w) is the total number of sentences in which w appears. N is the total number of sentences s in the sentence S.
[0010]
“In-text TFIDF” is a value determined by (Equation 7).
D (s) = Σ NF (w, s) × IDF (w) (Expression 7)
Here, w is a noun that appears in the sentence s. NF (w, s) and IDF (w) follow the above definition.
[0011]
“Presence / absence of attitude expression” is used to indicate whether or not there is an expression indicating the author's attitude at the end of the sentence. Here, as expressions expressing the attitude of the author, expressions such as “˜important”, “˜necessary”, “˜ka”, “˜yo”, “˜ne” are considered.
[0012]
The “number of characters in the sentence” indicates the number of characters in the sentence s.
[0013]
“Position within a paragraph” indicates the position within a paragraph of a sentence s by the number of sentences preceding the sentence s / the total number of sentences in the paragraph, as in the above-described “position within a sentence”.
[0014]
The decision tree is structured by characterizing each sentence of the sentence with respect to the above-mentioned attributes and further learning using the data with summary judgment information for each field. The decision tree construction method follows “C4.5” by Quinlan. The decision tree construction method C4.5 is a well-known method, but the outline is as follows.
[0015]
In C4.5, the issue is how to model (generalize) the classification of database entries. For example, suppose a company's recruitment database is as follows.
[0016]
Gender Age Marriage Educational background Car adoption
Female 23 Married High school Yes ○
Male 30 Single University None ○
Female 45 Married High school Yes ○
Male 60 Married University None ×
Classification modeling is to find a pattern of conditions for adoption / non-recruitment from this data and to predict whether or not a person will be employed by this company. By the way, the “recruited” items in the database are classified, and the other items are called attributes. Each entry is called a case. In C4.5, while observing case attribute information, cases with similar attribute values are grouped and classified.
[0017]
For example, in the above example, the following classification model is possible.
What's your gender?
Female
Men What is a car license?
Yes → Adopt
None → Not adopted
In other words, it is possible to generalize “all women are hired” and “male who have a car license”. In actual situations, there is a problem of which attribute is used as a branching condition. However, in C4.5, attributes are selected using a statistical measure called gain ratio.
[0018]
Next, a speed reading support operation is performed using the obtained decision tree. The operation includes the following procedures.
(1) Call the sentence S you want to read quickly on the screen. (2) Use the decision tree to classify each displayed sentence s as a summary sentence. (3) Classify as a summary sentence. (4) Finally, the first sentence of each paragraph is displayed in an emphasized color. (4) Finally, the first sentence of each paragraph is displayed in an emphasized color.
[0019]
As described above, when constructing a decision tree for each field, a summary sentence specialized for a specific field can be generated. In addition, by presenting the summary text to the user in the highlighted color and other text in the background color, the display of the text will be conspicuous, and since the summary text is displayed as it is in the text, The context of the text is saved and can be easily referenced as needed, making it easier to read the summary. Furthermore, by displaying the first sentence of each paragraph together with the summary, the outline of the contents can be understood. Hereinafter, more specific embodiments will be described with reference to the drawings.
[0020]
Example 1
FIG. 1 is a block diagram showing the concept of data processing of the speed reading support method according to the present invention.
[0021]
In FIG. 1, reference numeral 1 denotes an input step that takes in a document (article) to be read at high speed or takes in a document (article) input in advance in a database. Here, it is assumed that the article shown in FIG. 2 is input. 2 is a genre information acquisition step, in which a genre corresponding to the content of the input article is determined and output. The genre determination procedure is as follows. First, a keyword indicating a genre is searched for in the input article. If a keyword representing a genre cannot be found, the user is requested to input genre information. However, in the article of FIG. 2, since the heading portion has the keyword “(editorial)”, the genre of the article is determined to be editorial.
[0022]
Next, the acquired genre information and article are sent to decision tree selection step 3. Here, the article corresponding to the genre of the article is selected from a database of decision trees prepared in advance based on the genre information. The decision tree is constructed according to C4.5 described above. Since the genre of the article is editorial, the editorial decision tree database is selected. In this description, it is assumed that the decision tree shown in FIG. 4 is selected.
Next, the process proceeds to feature extraction step 4. Here, each sentence appearing in the text excluding the headline of the article is subjected to morphological analysis processing, and then feature extraction is performed. The morphological analysis follows the procedure of, for example, [Sakurai et al., Design and Evaluation of Morphological Analysis Program ANIMA, Information Processing Society of Japan 54th Annual Conference, 1997]. Features to be extracted are (1) sentence type, (2) position in sentence, (3) similarity to headline, (4) TFIDF value in sentence, (5) presence / absence of attitude expression, (6) number of characters in sentence (7) Seven positions in the paragraph.
[0023]
Extraction includes the following procedures. Let s be an arbitrary sentence in the text. In the case of newspaper articles provided electronically, classification information such as general articles, essays, and editorials is usually attached to the sentence S. The sentence type of sentence s follows the classification information. When there is no classification information, the user distinguishes general articles, essays, and editorials, and determines the type. The position in the sentence is given by the ratio of the sentence appearing from the beginning of the text to immediately before the sentence s in the entire sentence, that is, D (s) / N. Here, D (s) is the number of sentences from the beginning of the text to immediately before the sentence s, and N is the total number of sentences in the sentence S. As for the similarity to the headline, if the headline of the sentence is T, SIMM (T, s) is calculated according to the above-described (Equation 4), and the value is set as the similarity. The in-sentence TFIDF calculates NF (s, w) × IDF (w) for each noun w appearing in the sentence s, and uses the sum as a value. (However, the noun w is extracted by morphological analysis.) The presence / absence of the attitude expression is specific to the sentence s "~ important", "~ necessary", "~ ka", "~ yo" ”, Etc. (the end form if it is used). If there is nothing here, 1 is set, 2 is set for attitude verbs such as “important” and “necessary”, and 3 is set for final particles such as “ka”, “yo”, and “ne”. The number of characters in the sentence is the number of characters in the sentence s. The position in the paragraph is given as PD (s) / N (P) of the sentence s. Here, PD (s) is the number of sentences appearing from the beginning of the paragraph to immediately before the sentence s, and N (P) is the total number of sentences in the paragraph.
[0024]
FIG. 3 shows the result of the feature extraction performed on each sentence of the article in FIG. 2 by the above procedure. In this embodiment, there are seven sentences excluding the heading, sentence 1 is the first sentence of the body, sentence 2 is the second sentence, sentence 3 is the third sentence,..., Sentence 7 is the seventh sentence. Corresponds to condition. Since all the sentences in the main text are part of the editorial, the sentence type is all “Editorial”. (1) type of sentence extracted, (2) position in sentence, (3) similarity to headline, (4) TFIDF value in sentence, (5) presence / absence of attitude expression, (6) number of characters in sentence, (7) The seven features of the position in the paragraph are as shown in the figure.
[0025]
In the next step 5, it is determined whether the sentence is a summary sentence based on the features extracted from the sentence and the selected decision tree. In the following, the actual determination work will be examined in detail for sentence 1 to sentence 7. Assume that the decision tree is shown in FIG.
[0026]
Since sentence 1 has a similarity to the headline of 0.679, the process proceeds to N10. Furthermore, since the degree of similarity is 1.181 or less, N12 is passed. Next, since TFIDF is 9.449, it passes through N14. Next, since the number of characters is 41, N16 is passed. Next, since TFIDF is 9.449, it passes through N18 and is finally determined as a non-summary sentence.
[0027]
Since sentence 2 has a similarity to the headline of 0.263, it proceeds to N1. However, since the position in the sentence is 0, it is determined as a non-summary sentence through N2.
[0028]
Sentence 3 passes N10 and N12 because the similarity to the headline is 0.762. Furthermore, since the TFIDF value is 4.893, N14 is passed. In the next step, since the number of characters is 70, it passes through N15 and is determined as a summary sentence.
[0029]
Sentence 4 passes N1 because the similarity to the headline is 0.263. However, since the position in the sentence is 0.071, it passes through N2 and is determined as a non-summary sentence.
[0030]
Sentence 5 also passes N1 because the similarity to the headline is 0.263. Since the position in the sentence is 0.095, similarly to sentence 4, it passes through N2 and is determined to be a non-summary sentence.
[0031]
Sentence 6 passes N1 because the similarity to the headline is 0. Since the position in the sentence is 0.119, it is determined as a non-summary sentence through N2 as in the case of sentences 4 and 5.
[0032]
Sentence 7 also passes through N1 because the similarity to the headline is 0, and since the position in the sentence is 0.143, it passes through N2 and is determined to be a non-summary sentence, as in sentences 4-sentence 6. .
[0033]
In the next step 6, according to the determination result of the summary sentence, the sentence determined to be a summary sentence is displayed in an emphasis color, and the sentence determined to be a non-summary sentence is displayed in a background color. Further, in order to add an accent to the display, the first sentence of each paragraph of the article is highlighted with an emphasis color different from that of the summary sentence, and the speed reading support process is terminated.
[0034]
By referring to the corresponding decision tree for texts other than editorials, essays, news reports, etc., and performing processing equivalent to the above, it is possible to support speed reading for texts in other genres.
[0035]
FIG. 5 is a diagram showing a specific processing flow of the above-described processing. In the example of FIG. 5, the acquisition of the sentence type information is performed only once for the sentence that is the target of the fast reading support. On the other hand, the feature extraction of each sentence is performed for each sentence as the unread sentence for each sentence, and the process ends when there is no unprocessed sentence. . The processing flow of FIG. 5 can be easily understood by reading it with reference to the above description, and therefore, description with reference numerals attached to the drawing is omitted. The feature table shown in FIG. 3 described above is a list of features extracted by the processing flow of FIG. 5 for explanation.
[0036]
FIG. 6 shows the display state of the article reflecting the above-described summary sentence and non-summary sentence determination results. Since the figure is not colored, the underlined solid lines are added to the highlighted colors, and the dotted underline is added to the first sentence of each paragraph of the article.
[0037]
In the above-described embodiment, there is a headline in the article, and using this, genre acquisition and similarity evaluation were performed very easily. However, when there is no headline, the similarity of the headline in FIG. Data is lost and the paths N11 and N12 in FIG. 4 are lost, but there is no substantial problem.
[0038]
Of course, if an article is too long to fit on one screen, you can scroll to see the content.
[0039]
Example 2
Next, an embodiment of a newspaper speed reading support device that combines article search support and the above-described speed reading support method will be described.
[0040]
FIG. 7 is a diagram showing a summary of the flow of signal processing for this purpose. Reference numeral 71 denotes a search condition input step for inputting a search condition for an article that the user wants to read. Search conditions can be set arbitrarily, but keywords are common and easy to use. 72 is an article search step, in which article information is fetched from an arbitrary database, and an article meeting the above-described search conditions is searched. Reference numeral 73 denotes a search result display step, which displays articles that meet the search conditions together with the degree of coincidence with the conditions, for example. 74 is an article designation step, in which the user selects an article that he / she wants to read with reference to the degree of coincidence with the condition, for example. 75 is a speed reading support instruction step, which is a step for requesting speed reading support for an article selected by the user for reading. Steps 76 to 80 are speed reading support steps as apparent from FIG. 1, and the speed reading support processing is performed for the article selected by the user in the article specifying step 74.
[0041]
FIG. 8 is a diagram illustrating an example of a hardware configuration for executing this processing. In FIG. 8, reference numeral 801 denotes output means, which means a printer or the like here. Reference numeral 802 denotes a CPU, which executes processing according to a program described later. Reference numeral 803 denotes input means such as a keyboard and a mouse. Reference numeral 804 denotes a system bus. Reference numeral 812 denotes display means, which is a so-called display such as CTP. Reference numeral 809 denotes program holding means, for example, a hard disk is used. The program holding means 809 includes search, speed reading support interface operation program 805, morphological analysis program 806, decision tree generation program 807, feature extraction program 816, decision tree operation program 808, search program 809, similarity calculation program 8091, document ranking. A program 8092, an important sentence display program 810, and a synopsis display program 811 are stored. Reference numeral 813 denotes a work area of the memory. Reference numeral 814 denotes a decision tree database. A document database 815 stores, for example, newspaper articles to be searched. Each means and database are coupled via a system bus.
[0042]
First, when the user wants to read a particular article of interest in the newspaper article, he activates the device to search and activate the fast-reading support interface operating program 805, and displays the search interface input screen of the display means 812. Enter the search keyword. This corresponds to step 71 in FIG. Next, the search program 809 is executed for the input keyword (FIG. 7—Step 72).
[0043]
The search program calculates the similarity D between the text stored in the document database 815 and the input keyword according to (Equation 8).
[0044]
D (q, d) = Σ TF (w, d) × IDF (w) (Equation 8)
Here, q is a list of keywords, d is a document, and is expressed as a list of noun words (excluding duplicates) appearing therein. w represents an element (word) of the list q. TF (w, d) is calculated by the frequency of w in the document d, and IDF (w) is calculated for the entire article stored in the document database 815 according to (Equation 9).
[0045]
IDF (w) = log (N / DF (w)) (Equation 9)
Here, N is the total number of articles in the database, and DF (w) is the total number of articles in which the word w appears even once. However, noun extraction from the document is performed by executing the morphological analysis program 806. The specific method uses the morphological analysis program described in [Asai et al., Design and Evaluation of Morpheme Analysis Program ANIMA, 1997] described above. In this way, the similarity D is obtained for all articles in the document database 815, five articles are selected from those having a high value, and the selection result is displayed to the user via the output screen of the search interface of the display means 812. Present (step 73 in FIG. 7). An example of the output screen adopted here is shown in FIG. In FIG. 9, 91 indicates a headline, and 92 is a switch for turning on the display of the article text. When this switch is clicked, the contents of the corresponding article from the document database are displayed on the display means (FIG. 7-). Step 74).
[0046]
FIG. 10 shows an example of a screen when the user turns on the switch 92 here. In this example, the content of the article is the same as that described with reference to FIG. 2, but in FIG. 10, the user is presented with a selection switch 1001 as to whether or not to support fast reading along with the text (FIG. 7—Step 75) . Here, when the user selects the rapid reading support, the morphological analysis program 806 is executed, morphological analysis is performed for each sentence of the displayed text, and genre information acquisition (FIG. 7-step 76) and decision tree selection are performed (FIG. 7). Along with step 77), the processing result is passed to the feature extraction program 816. The feature extraction program 816 extracts information necessary for determining the important sentence from the morphological analysis data (FIG. 7—step 78), and passes the extracted information to the decision tree operation program 808. The decision tree operation program 808 accesses a decision tree database 814 prepared in advance and determines whether or not the sentence is a summary sentence based on the information extracted by the feature extraction program 816 (FIG. 7—Step 79). If it is a summary sentence, the important sentence display program 810 is executed, the display on the display means 812 is displayed in a highlighted color, otherwise it is displayed in the background color, and the process proceeds to the next sentence. After the summary sentence determination process is completed, the synopsis display program 811 is executed to display the first sentence of each paragraph of the displayed sentence in an emphasized color different from the important sentence (FIG. 7—Step 80).
[0047]
As described above, according to the present embodiment, for example, article search according to a search condition such as a keyword and rapid reading support can be processed as one flow.
[0048]
Example 3
FIG. 11 is a configuration diagram of an example in which the fast reading support method described in the second embodiment is realized in the embodiment of the network type document search support service. In FIG. 11, it is assumed that a service providing apparatus (server) and a service receiver apparatus (client 1 and client 2 are connected via an information communication network. For this reason, the server will be described with reference to FIG. The system bus 804 is provided with a communication unit 1101 and an interface 1102 with an information communication network, for the sake of simplicity, the display of other devices is omitted for the server. An output means, here means a printer, etc. 1122 is a CPU, and executes processing according to a program described later, 1123 is an input means, for example, a keyboard, a mouse, etc. 1112 is a system bus. Reference numeral 1113 denotes display means, which is a so-called display such as CTP. 1114 is a search / speed reading support interface program holding means, for example, a hard disk is used, 1115 is a work area of a memory, 1116 is a communication means, 1111 is an interface, and client 1 In this example, only the client 1 is specifically illustrated as having the same configuration, and the client 2 is simplified as a display of only the bus 1132 and the interface 1131.
[0049]
First, the user inputs a document search service use start request command through the input unit 1123. Then, the request command is transmitted to the server side by the communication means 1116 through the communication network. The server that has received the command transmits the search / speed reading support interface operation program 805 stored in the program holding means to the client 1 via the communication network. When the client 1 receives the program 805, it holds it in the search / speed reading support interface program holding means 1114 and operates the program using the calculation resources (CPU 1122, work area 1115). Then, a screen for requesting a search keyword appears as described in FIG. The user inputs a search keyword through the input unit 1123. The input keyword is transmitted to the server side by the communication means 271116. Then, the server operates the search program 805 and starts searching based on the transmitted keyword. Next, the obtained search result is transmitted to the client 1 via the communication network. The client 1 presents the result to the user using the interface operation program 805 transmitted and held. The display contents at this time are the same as those shown in FIG. When the user selects a text display button here, a text display request is transmitted to the server side via the network, and the server sends a corresponding document to the client 1 in response to the request, and operates on the computer of the client 1. The interface operation program 805 displays the sent document on the screen. The display contents at this time are the same as those shown in FIG. However, at this time, a copy of the sent document is left in the work area on the server side. When the user further turns on the speed reading support switch 1001, the request is sent to the server via the network, and the server receives the request and sends the sent document remaining in the work area to the speed reading support program. In accordance with the same procedure as shown in the second embodiment, the speed reading support program performs morphological analysis, feature extraction, important sentence determination by the decision tree operation program, and finally synopsis extraction for each sentence in the document. The server sends information such as which sentence is displayed in which color to the client 1 via the network. As a result, the interface operation program of the client 1 can adjust the display color of each sentence in the document.
[0050]
Since this operation is the same for the client 2, the description thereof is omitted.
[0051]
Example 4
In the third embodiment, the search / speed reading support interface operation program 805 is transmitted from the server to the client 1 via the communication network in response to the search request from the client. The same operation as in the third embodiment can be performed. Also in this case, as described with reference to FIG. 11, when the user inputs a document search request command using the input unit 1123, the interface operation program stored in the program holding unit 1114 on the client 1 is started, The operation procedure is processed in the same manner as in the third embodiment to enable search from a remote place and fast reading support.
[0052]
【The invention's effect】
As is clear from the above description, according to the present invention, since the summary sentence is displayed as it is in the text, the context before and after the summary sentence is preserved and the summary is easy to read. Furthermore, by displaying the first sentence of each paragraph together with the summary, the outline of the contents can be understood.
[0053]
When a decision tree for each genre is accumulated in advance and referred to, a summary sentence determination specialized for a specific genre can be performed very effectively.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the concept of data processing of a speed reading support method according to the present invention.
FIG. 2 is a diagram showing an example of an article adopted as a target of fast reading support.
FIG. 3 is a diagram illustrating a result of performing feature extraction on each sentence of the article in FIG.
FIG. 4 is a diagram showing an example of a decision tree
FIG. 5 is a flowchart illustrating the concept of data processing in FIG. 1;
6 is a view showing a display example of a result of speed reading support for the article of FIG. 2;
FIG. 7 is a block diagram showing the concept of data processing when the document speed reading support method according to the present invention is applied to a document search support apparatus.
FIG. 8 is a diagram showing an example of a device configuration that realizes the processing shown in FIG. 7;
FIG. 9 is a diagram showing a specific example of a display form of a document search result.
FIG. 10 is a diagram showing a specific example when a specific document body is displayed according to a document search result.
FIG. 11 is a diagram showing a specific example of an embodiment for receiving a document search support service to which a speed reading support method according to the present invention is applied from a remote place;

Claims

A means to input text consisting of a headline and text;
Means for holding information about the genre of the text,
Information about a genre is acquired based on a predetermined rule for the input sentence, and a feature analysis is performed for each sentence in the sentence according to the predetermined rule while referring to a decision tree corresponding to the genre. Means to determine,
Depending on the result of determining whether or not the sentence is a summary sentence, in the case of a summary sentence, it is displayed in a highlighted color, otherwise it is displayed in a color different from the highlighted color, and the first sentence of each paragraph of the input sentence Means for displaying in a color different from the two colors,
The feature analysis is performed by coding the position in the sentence, the similarity to the headline, the TFIDF in the sentence, the number of characters in the sentence, and the position in the paragraph for each sentence.
The means for determining whether or not the summary sentence is:
(1) Determine the similarity between each sentence and the headline,
(2-1) If the similarity is less than or equal to the first similarity, determine a position in the sentence;
(3-1) If the position in the sentence is equal to or less than the first predetermined value, it is determined as a “non-summary sentence”;
(3-2) If the position in the sentence is greater than the first predetermined value, determine the position in the paragraph;
(4-1) If the position in the paragraph is equal to or less than the second predetermined value, it is determined as a “summary sentence”;
(4-2) If the position in the paragraph is larger than the second predetermined value, the position in the sentence is determined again,
(5-1) If the position in the sentence is equal to or less than a third predetermined value, it is determined as a “non-summary sentence”;
(5-2) If the position in the sentence is greater than the third predetermined value, determine the number of characters in the sentence;
(6-1) If the number of characters in the sentence is equal to or less than a fourth predetermined value, it is determined as a “non-summary sentence”;
(6-2) If the number of characters in the sentence is greater than a fourth predetermined value, it is determined as a “summary sentence”;
(2-2) If the similarity is greater than the first similarity, the similarity is determined again,
(7-1) If the similarity is greater than the second similarity, determine “summary sentence”;
(7-2) If the similarity is equal to or less than the second similarity, TFIDF is determined,
(8-1) If the TFIDF is greater than the fifth predetermined value, determine “non-summary sentence”;
(8-2) If the TFIDF is less than or equal to the fifth predetermined value, determine the number of characters in the sentence;
(9-1) If the number of characters in the sentence is greater than a sixth predetermined value, determine “summary sentence”;
(9-2) If the number of characters in the sentence is less than or equal to the sixth predetermined value, determine TFIDF again,
(10-1) If the TFIDF is equal to or less than a seventh predetermined value, it is determined as a “summary sentence”,
(10-2) A speed reading support device characterized in that if the TFIDF is larger than a seventh predetermined value, it is determined as a “non-summary sentence”.