JP4186321B2

JP4186321B2 - Document processing method and apparatus, and recording medium

Info

Publication number: JP4186321B2
Application number: JP21265299A
Authority: JP
Inventors: 確長尾
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1999-07-27
Filing date: 1999-07-27
Publication date: 2008-11-26
Anticipated expiration: 2019-07-27
Also published as: JP2001043220A

Description

【０００１】
【発明の属する技術分野】
本発明は、電子文書を処理する文書処理方法及び装置並びに電子文書を処理する文書処理プログラムが記録された記録媒体に関する。
【０００２】
【従来の技術】
従来、インターネットにおいて、ウィンドウ形式でハイパーテキスト型情報を提供するアプリケーションサービスとしてＷＷＷ（World Wide Web）が知られている。
【０００３】
ＷＷＷは、文書の作成、公開又は共有化の文書処理を実行し、新しいスタイルの文書の在り方を示したシステムである。しかし、文書の実際上の利用の観点からは、文書の内容に基づいた文書の分類や要約といった、ＷＷＷを越える高度な文書処理が求められている。このような高度な文書処理には、文書の内容の機械的な処理が不可欠である。
【０００４】
しかしながら、文書の内容の機械的な処理は、以下のような理由から依然として困難である。すなわち、第１に、ハイパーテキストを記述する言語であるＨＴＭＬ（Hyper Text Markup Language）は、文書の表現については規定するが、文書の内容についてはほとんど規定しない点、第２に、文書間に構成されたハイパーテキストのネットワークは、文書の読者にとって文書の内容を理解するために必ずしも利用しやすいものではない点、第３に、一般に文章の著作者は読者の便宜を念頭に置かずに著作するが、文書の読者の便宜が著作者の便宜と調整されることはない点が、文書の機械的処理を困難とする理由である。
【０００５】
このように、ＷＷＷは新しい文書の在り方を示したシステムであるが、文書を機械的に処理しないので、高度な文書処理を行うことができなかった。換言すると、高度な文書処理を実行するためには、文書を機械的に処理することが必要となる。
【０００６】
そこで、文書の機械的な処理を目標として、文書の機械的な処理を支援するシステムが自然言語研究の成果に基づいて開発されている。自然言語研究による文書処理として、文書の著作者等による文書の内部構造についての属性情報、いわゆるタグの付与を前提とした、文書に付与されたタグを利用する機械的な文書処理が提案されている。
【０００７】
ところで、近年のコンピュータの普及や、ネットワーク化の進展に伴い、文章処理や、文書の内容に依存した索引などで、テキスト文書の作成、ラベル付け、変更などを行う文書処理の高機能化が求められている。例えば、ユーザの要望に応じた文書の要約や、文書の分類等が望まれる。
【０００８】
すなわち、ユーザは、例えばいわゆるサーチエンジンのような情報検索システムを利用し、インターネットを介して提供される膨大な情報の中から所望の情報を探し出すようにしている。この情報検索システムは、指定されたキーワードに基づいて情報を検索し、検索した情報をユーザに提供するシステムである。ユーザは、提供された情報の中から所望の情報を選択する。
【０００９】
情報検索システムにおいては、このように容易に情報を検索することができるが、ユーザは、検索されて提供された情報を一読して概略を理解し、それが希望する情報であるか否かを判断する必要がある。この作業は、特に、提供された情報の量が多い場合には、ユーザにとって大きな負担となる。そこで、最近、テキスト情報、すなわち文書の内容を自動的に要約するシステムであるいわゆる自動要約文作成システムが注目されている。
【００１０】
自動要約文作成システムは、元の情報、すなわち文書の大意を保持したままテキストの情報の長さや複雑さを減らすことによって、要約文を作成するシステムである。ユーザは、この自動要約文作成システムにより作成された要約文を一読することで、文書の概略を理解することができる。
【００１１】
通常、自動要約文作成システムは、テキスト中の文や単語を１つの単位とし、それに何らかの情報に基づいた重要度を付与して順序付けする。そして、自動要約文作成システムは、上位に順序付けした文や単語を寄せ集め、要約文を作成する。
【００１２】
【発明が解決しようとする課題】
ところで、上述した自動要約文作成システムにおいては、文書から要約文を作成することが可能であるが、作成される要約文の情報量は、文書の情報量等により決定されていた。そのため、自動要約文作成システムにおいては、例えば、作成された要約文が簡略すぎてユーザが文書の概略を把握できない場合、ユーザは、より詳細な要約文を参照することができなかった。
【００１３】
また、元の文中の主語の省略されている部分を要約文に取り入れる場合に、省略されている部分が要約文中に含まれていないと、要約文から正確な内容把握が行えないことにもなる。
【００１４】
本発明は、上述の実情に鑑みて提案されたものであり、入力された文書に対して、ユーザの理解が容易で正確な内容の要約文を自動生成し得るような文書処理方法及び装置、並びに文書処理プログラムが記録されてなる記録媒体を提供することを目的とする。
【００１５】
【課題を解決するための手段】
上述の課題を解決するために、本発明は、電子文書の形態の文書を処理する文書処理方法及び装置において、文書の要約文を作成し、作成される要約文中における省略又は置き換えられた語句が存在する場合、これを補完し、文書は、複数のエレメントが階層化された内部構造を有し、内部構造を示すタグ情報が予め付与されており、タグ情報により示される内部構造に基づいたエレメントの重要度を算出してエレメントに付与し、タグ情報に基づいて補完を行うことを特徴としている。
上記補完は、タグ情報に基づいて作成される要約文中における省略された主語又は目的語が該要約文中に含まれていないとき、元の文書中の対応する主語又は目的語を要約文中に追加するゼロ照応処理を行うことが好ましい。
また、上記補完は、タグ情報に基づいて作成される要約文中に含まれる被参照エレメントに対応する参照エレメントが要約文中に含まれていないときに元の文書中の対応する参照エレメントを要約文中の被参照エレメントに置き換えを行うことが好ましい。
【００１６】
ここで、上記省略された主語又は目的語はゼロ照応エレメントと称され、上記要約文中にこのゼロ照応エレメントが含まれているか否かを判別し、含まれていないときに、当該ゼロ照応エレメントを要約文中に括弧でくくって追加することが好ましい。
また、上記被参照エレメントは例えば代名詞又は限定節であり、上記参照エレメントは例えば先行詞である。上記置き換えの際には、要約文中の被参照エレメントに対応する参照エレメントが要約文中に含まれているか否かを判別し、対応する参照エレメントが要約文中に含まれていないとき要約文中の被参照エレメントを対応する参照エレメントに置き換え、上記対応する参照エレメントが要約文中に含まれているとき上記要約文中の上記被参照エレメントを置き換えないことが好ましい。
【００１７】
これによって、要約文中に省略された主語や目的語が全く含まれないことが回避される。また、要約文中に先行詞なしの代名詞や限定節が含まれることが回避される。
【００１８】
【発明の実施の形態】
以下、図面を参照して、本発明に係る文書処理方法及び装置並びに文書処理プログラムが記録された記録媒体の実施の形態について説明する。
【００１９】
本発明の実施の形態としての文書処理装置は、図１に示すように、制御部１１及びインターフェース１２を備える本体１０と、ユーザからの入力を受けて本体１０に送る入力部２０と、外部からの信号を受信して本体１０に送る通信部２２と、本体１０からの出力を表示する表示部３０と、記録媒体３２に対して情報を記録／再生する記録／再生部３１とを有している。
【００２０】
本体１０は、制御部１１及びインターフェース１２を有し、この文書処理装置の主要な部分を構成している。制御部１１は、この文書処理装置における処理を実行するＣＰＵ１３と、揮発性のメモリであるＲＡＭ１４と、不揮発性のメモリであるＲＯＭ１５とを有している。ＣＰＵ１３は、例えばＲＯＭ１５に記録されたプログラムの手順にしたがって各処理を実行するための制御を行う。ＲＡＭ１４には、ＣＰＵ１３が各種の処理を実行する上で必要なプログラムやデータが一時的に格納される。インターフェース１２は、制御部１１、入力部２０、通信部２２、表示部３０及び記録／再生部３１に接続されている。このインターフェース１２は、制御部１１の制御の下に、入力部２０及び通信部２２からのデータの入力、表示部３０へのデータの送信、記録／再生部３１に対するデータの送受信について、データを送信するタイミングを調整したり、データの形式を変換したりする。
【００２１】
入力部２０は、この文書処理装置に対するユーザの入力を受ける部分であり、例えばキーボードやマウスにより構成される。ユーザは、この入力部２０を用い、キーボードによりキーワードを入力したり、マウスにより表示部３０に表示されている電子文書のエレメントを選択して入力したりすることができる。なお、以下では電子文書を単に文書と称することにする。ここで、エレメントとは文書を構成する要素であって、例えば文書、文、句及び語が含まれる。
【００２２】
通信部２２は、この文書処理装置に外部から通信路、例えば電話回線を介して送信される信号を受信する部分である。具体的には、通信部２２は、例えば、モデム、ターミナルアダプタ等により構成され、電話回線を介してインターネット２３に接続され、インターネットに接続されているサーバ２４にアクセスし、そこから文書等のデータを受信することができるようにされている。このような通信部２２は、外部から送信された複数の文書等のデータを受信し、受信したデータを本体１０に送る。
【００２３】
表示部３０は、この文書処理装置からの文字や画像情報の出力を表示する。表示部３０は、例えば陰極線管（cathode ray tube;CRT）や液晶表示装置（liquid crystal display;LCD）から構成され、例えば単数又は複数のウィンドウを表示したり、文字、図形、又は画像等を表示したりする。
【００２４】
記録／再生部３１は、例えばフロッピーディスク、光ディスク、光磁気ディスクのような着脱可能な記録媒体３２に対してデータの記録及び／又は再生を行う。記録媒体３２には、文書を処理する文書処理プログラムが記録されている。この記録媒体３２には、文書を処理するための電子文書処理プログラムや処理対象とする文書が記録されている。
【００２５】
ハードディスクドライブ３３は、大容量の磁気記録媒体であるハードディスクに対してデータの記録及び／又は再生を行う。
【００２６】
このような文書処理装置は、以下のようにして所望の文書を受信し、表示部３１に表示する。
【００２７】
文書処理装置においては、まずユーザが入力部２０を操作してインターネット２３を介して通信を行うためのプログラムを起動し、サーバ２４（サーチエンジン）のＵＲＬ（Uniform Resource Locator）を入力すると、制御部１１は、通信部２２を制御し、サーバ２４にアクセスする。
【００２８】
これに応じて、サーバ２４は、インターネット２３を介して、文書処理装置の通信部２２に検索画面のデータを出力する。文書処理装置においてＣＰＵ１３は、このデータをインターフェース１２を介して表示部３０に出力し、表示させる。
【００２９】
文書処理装置においては、ユーザが入力部２０を用いてこの検索画面上でキーワード等を入力して検索を指令すると、通信部２２からインターネット２３を介して、サーチエンジンとしてのサーバ２４に対して検索命令が送信される。
【００３０】
サーバ２４は、検索命令を受信すると、この検索命令を実行し、得られた検索結果をインターネット２３を介して通信部２２に送信する。文書処理装置において制御部１１は、通信部２２を制御し、サーバ２４から送信される検索結果を受信させ、その一部を表示部３０に表示させる。
【００３１】
具体的には、ユーザが入力部２０を用いて例えば「ＴＣＰ」というキーワードを入力して検索を指令した場合には、文書処理装置には、サーバ２４から「ＴＣＰ」のキーワードを含む各種情報が送信され、表示部３０に表示される。
【００３２】
続いて、本実施の形態における文書について説明する。本実施の形態においては、文書処理は、文書に付与された属性情報であるタグを参照して行われる。本実施の形態で用いられるタグには、文書の構造を示す統語論的（syntactic）タグと、多言語間で文書の機械的な内容理解を可能にするような意味的（semantic）・語用論的タグとがある。
【００３３】
統語論的なタグとしては、文書の内部構造を記述するものがある。タグ付けによる内部構造は、図２に示すように、文書、文、語彙エレメント等の各エレメントが、通常リンク、参照・被参照リンクにより関連付けられて構成されている。図中において、白丸“○”はエレメントを示し、最下位の白丸は文書における最小レベルの語に対応する語彙エレメントである。また、実線は文書、文、語彙エレメント等のエレメント間のつながり示す通常リンク（normal link）である。破線は参照・被参照による係り受け関係を示す参照リンク（reference link）である。文書の内部構造は、上位から下位への順序で、文書（document）、サブディビジョン（subdivision）、段落（paragraph）、文（sentence）、サブセンテンシャルセグメント（subsentential segment）、・・・、語彙エレメントから構成される。これらのうち、サブディビジョンと段落とは、例えばオプションとして用いられるものである。
【００３４】
一方、意味論・語用論的なタグ付けとしては、係り受け、例えば代名詞の指示対象等を示す統語構造（syntactic structure）に関するタグ付けや多義語の意味のように意味（semantic）の情報を記述するものがある。本実施の形態におけるタグ付けは、ＨＴＭＬ（Hyper Text Markup Language）と同様なＸＭＬ（eXtensible Markup Language）の形式によるものである。
【００３５】
以下にタグ付けされた文や文書の例を示すが、文書へのタグ付けはこの方法に限定されるものではない。また、以下では英語と日本語の文書の例を示すが、タグ付けによる内部構造の記述は、他の言語にも同様に適用することができることは勿論である。
【００３６】
例えば、“Time flies like an arrow.”という文については、下記のようなタグ付けをすることができる。
【００３７】
＜文＞＜名詞句語義＝“time０”＞time＜／名詞句＞
＜動詞句＞＜動詞語義＝“fly１”＞flies＜／動詞＞
＜形容動詞句＞＜形容動詞語義＝“like０”＞like＜／形容動詞＞
＜名詞句＞an＜名詞語義＝“arrow０”＞arrow＜／名詞＞＜／名詞句＞
＜／形容動詞句＞＜／動詞句＞.＜／文＞
ここで＜文＞、＜名詞＞、＜名詞句＞、＜動詞＞、＜動詞句＞、＜形容動詞＞、＜形容動詞句＞は、それぞれ文、名詞、名詞句、動詞、動詞句、形容詞を含む前置詞句又は後置詞句／形容詞句、形容詞句／形容動詞句のような文の統語構造（syntactic structure）を表している。タグは、エレメントの先端の直前及び終端の直後に対応して配置される。エレメントの終端の直後に配置されるタグは、記号“／”によりエレメントの終端であることを示している。エレメントは統語的構成素、すなわち句、節、及び文を示す。なお、語義（word sense）＝“time０”は、語“time”の有する複数の意味、すなわち複数の語義のうちの第０番目の意味を指している。具体的には、語“time”には少なくとも名詞、形容詞、動詞の意味があるが、ここでは語“time”が名詞であることを示している。同様に、語“オレンジ”は少なくとも植物の名前、色、果物の意味があるが、これらも語義によって区別することができる。
【００３８】
本実施の形態に用いられる文書は、図３に示すように、上記図１の表示部３０のウィンドウ１０１に統語構造を表示することができる。このウィンドウ１０１においては、右半面１０３に語彙エレメントが、左半面１０２に文の内部構造がそれぞれ表示されている。このウィンドウ１０１においては、日本語で記述された文書のみならず、英語等の任意の言語で記述された文書についても、統語構造を表示することができる。
【００３９】
このウィンドウ１０１には、一例として、タグ付けにより内部構造を記述された次に示すような文書「Ａ氏のＢ会が終わったＣ市で、一部の大衆紙と一般紙がその写真報道を自主規制する方針を紙面で明らかにした。」の一部が表示されている。この文書のタグ付けの例を次に示す。
【００４０】
＜文書＞＜文＞＜形容動詞句関係＝“位置”＞＜名詞句＞＜形容動詞句場所＝“Ｃ市”＞
＜形容動詞句関係＝“主語”＞＜名詞句識別子＝“Ｂ会”＞＜形容動詞句関係＝“所属”＞＜人名識別子＝“Ａ氏”＞Ａ氏＜／人名＞の＜／形容動詞句＞＜組織名識別子＝“Ｂ会”＞Ｂ会＜／組織名＞＜／名詞句＞が＜／形容動詞句＞
終わった＜／形容動詞句＞＜地名識別子＝“Ｃ市”＞Ｃ市＜／地名＞＜／名詞句＞で、＜／形容動詞句＞＜形容動詞句関係＝“主語”＞＜名詞句識別子＝“press” 統語＝“並列”＞＜名詞句＞＜形容動詞句＞一部の＜／形容動詞句＞大衆紙＜／名詞句＞と＜名詞＞一般紙＜／名詞＞＜／名詞句＞が＜／形容動詞句＞
＜形容動詞句関係＝“目的語”＞＜形容動詞句関係＝“内容” 主語＝“press”＞＜形容動詞句関係＝“目的語”＞＜名詞句＞＜形容動詞句＞＜名詞共参照＝“Ｂ会”＞そ＜／名詞＞の＜／形容動詞句＞写真報道＜／名詞句＞を＜／形容動詞句＞
自主規制する＜／形容動詞句＞方針を＜／形容動詞句＞
＜形容動詞句関係＝“位置”＞紙面で＜／形容動詞句＞
明らかにした。＜／文＞＜／文書＞
この文書においては、「一部の大衆紙と一般紙」は、統語＝“並列”というタグにより並列であることが表されている。並列の定義は、係り受け関係を共有すると言うことである。特に何も指定がない場合は、例えば、＜名詞句関係＝ｘ＞＜名詞＞Ａ＜／名詞＞＜名詞＞Ｂ＜／名詞＞＜／名詞句＞はＡがＢに依存関係のあることを表す。関係＝ｘは関係属性を表す。
【００４１】
関係属性は、統語、意味、修辞についての相互関係を記述する。主語、目的語、間接目的語のような文法機能、動作主、被動作者、受益者などのような主題役割、及び理由、結果などのような修辞関係はこの関係属性により記述される。本実施の形態では、主語、目的語、間接目的語のような比較的容易な文法機能について関係属性を記述する。
【００４２】
また、この文書においては、“Ａ氏”、“Ｂ会”、“Ｃ市”のような固有名詞について、地名、人名、組織名等のタグにより属性が記述されている。これら地名、人名、組織名等のタグが付与される語は固有名詞である。
【００４３】
また、このようなタグ付けされた文書においては、代名詞や限定節についての参照、被参照関係がタグにより表される。例えば、上記文書においては、「その写真報道を」のエレメントの「その」の部分が、「共参照＝“Ｂ会”」という属性を持つことにより、その部分が「識別子＝“Ｂ会”」という属性を持つエレメント（この場合は名詞句）「Ａ氏のＢ会」であることが示されている。従って、上記「その写真報道を」の「その」の部分を置き換えると、「Ａ氏のＢ会の写真報道を」となる。
【００４４】
さらに、このようなタグ付けされた文書においては、省略された主語や目的語等を他の部分によって補うことができる。すなわち、上記文書の例においては、「自主規制する」のエレメントが「主語＝“press”」という属性を持つことにより、その意味上の主語が「識別子＝“press”」という属性を持つエレメント（この場合は名詞句）「一部の一般紙と大衆紙」であることが示されている。従って、主語を補ったものは、「（一部の一般紙と大衆紙が）自主規制する」となる。このように、省略が他の部分によって補われることをゼロ照応と呼ぶ。
【００４５】
以下、本発明に係る実施の形態の文書処理装置の具体的な動作について説明する。本実施の形態の文書処理装置は、上述したようなタグ付けされた文書に対して、自動要約処理を行わせるものであり、この要約文作成の際に、代名詞や限定節の置き換え処理や、省略された主語を補うようなゼロ照応処理を行う。
【００４６】
文書処理装置において文書の要約文を作成する場合には、その文書が図１の表示部３０に文書が表示されている状態で、ユーザが入力部２０を操作し、自動要約モードに切り換える。制御部１１は、この自動要約モードに切り換えられたとき、図４に示すような自動要約文作成プログラムの初期画面を表示して、ユーザによる自動要約文作成の開始操作を待つ。
【００４７】
すなわち、ユーザが上記自動要約文作成モードに切り換え操作したときには、図１の制御部１１は、ハードディスク装置３３に保存されている自動要約文作成プログラムを起動し、表示部３０を制御し、図４に示すような自動要約文作成プログラムの初期画面を表示させる。この例においては、表示部３１に表示されるウィンドウ１９０は、文書の名称が表示される文書名表示部１９１、キーワードが入力されるキーワード入力部１９２、文書の要約文を作成するための実行ボタンである要約文作成実行ボタン１９３等が表示される表示領域２００と、文書が表示される表示領域２１０と、文書の要約文が表示される表示領域２２０とに区分されている。
【００４８】
表示領域２００の文書名表示部１９１には、表示領域２１０に表示される文書の文書名等が表示される。また、キーワード入力部１９２には、例えば入力部２０のキーボード等を用いて文書の要約文を作成するためのキーワードが入力される。要約文作成実行ボタン１９３は、例えば入力部２０のマウス等を用いて押されることによって、表示領域２１０に表示されている文書の要約文作成処理を実行開始するための実行ボタンである。
【００４９】
表示領域２１０には、文書が表示される。表示領域２１０の右端には、スクロールバー２１１と、このスクロールバー２１１を上下に動かすためのボタン２１２，２１３が設けられており、ユーザが例えば入力部２０のマウス等を用いて、スクロールバー２１１を上下に直接動かしたり、ボタン２１２，２１３を押してスクロールバー２１１を上下に動かすことによって、表示領域２１０に表示される表示内容を縦方向にスクロールすることができる。ユーザは、入力部２０を操作することによって、表示領域２１０に表示されている文書の一部を選択して要約させることもでき、文書全体を要約させることもできる。
【００５０】
表示領域２２０には、要約文が表示される。図４においては、要約文がまだ作成されていない状態であるため、この表示領域２２０には、何も表示されていない。ユーザは、入力部２０を操作することによって、要約文の表示領域２２０の表示範囲（大きさ）を変更することができる。具体的には、ユーザは、同図に示す表示領域２２０の表示範囲（大きさ）を、例えば図５に示すように拡大することができる。
【００５１】
文書処理装置は、ユーザが例えば入力部２０のマウス等を用いて、要約文作成実行ボタン１９３を押してオン状態とすると、ＣＰＵ１３の制御のもとに、図６に示す処理を実行して要約文の作成を開始する。
【００５２】
文書から要約文を作成する処理は、文書の内部構造に関するタグ付けに基づいて実行される。文書処理装置においては、先に図５に示したように、ウィンドウ１９０の表示領域２２０の大きさを変更することができる。文書処理装置は、ＣＰＵ１３の制御のもとに、新たにウィンドウ１９０が表示部３１に描画されるか、又は、表示領域２２０の大きさが変更された後、要約文作成実行ボタン１９３が操作されたときには、表示領域２２０に適合するように、ウィンドウ１９０の表示領域２１０に少なくともその一部が表示されている文書から、要約文を作成する処理を実行する。
【００５３】
まず、文書処理装置は、図６に示すように、ステップＳ２１において、ＣＰＵ１３の制御のもとに、活性拡散と呼ばれる処理を行う。本実施の形態においては、活性拡散により得られた中心活性値を重要度として採用することによって、文書の要約文を行う。すなわち、内部構造に関するタグ付けがされた文書においては、活性拡散を行うことによって、各エレメントに対して、内部構造に関するタグ付けに応じた中心活性値を付与することができる。
【００５４】
ここで、活性拡散は、中心活性値の高いエレメントと関わりのあるエレメントにも高い中心活性値を与えるような処理である。すなわち、活性拡散は、照応（anaphora；共参照（coreference））表現されたエレメントとその先行詞との間で中心活性値が等しくなり、それ以外では各中心活性値が同じ値に収束していく。この中心活性値は、文書の内部構造に関するタグ付けに応じて決定されるため、内部構造を考慮した文書の分析に利用することができる。
【００５５】
文書処理装置は、図７に示す一連の工程を経ることによって、活性拡散を実行する。
【００５６】
まず、文書処理装置は、図７に示すように、ステップＳ４１において、ＣＰＵ１３の制御のもとに、各エレメントの初期化を行う。文書処理装置は、語彙エレメントを除いた全てのエレメントと語彙エレメントとに対して中心活性値の初期値を割り当てる。例えば、文書処理装置は、中心活性値の初期値として、語彙エレメントを除いた全てのエレメントに対しては“１”を、語彙エレメントに対しては“０”を割り当てる。また、文書処理装置は、各エレメントの中心活性値の初期値に均一ではない値を予め割り当てることによって、活性拡散の結果得られた中心活性値に、初期値の偏りを反映させることができる。例えば、文書処理装置は、ユーザが関心を有するエレメントに対しては、中心活性値の初期値を高く設定することによって、ユーザの関心を反映した中心活性値を得ることができる。
【００５７】
エレメント間で参照・被参照による係り受けの関係にあるリンクである参照・被参照リンクと、それ以外のリンクである通常リンクとに関しては、エレメントを連結するリンクの端点の端点活性値を“０”に設定する。文書処理装置は、このようにして付与した端点活性値の初期値を例えばＲＡＭ１４に記憶させる。
【００５８】
ここで、エレメントとエレメントの連結構造の一例を図８に示す。同図においては、文書を構成するエレメントとリンクの構造の一部として、エレメントＥ_i及びエレメントＥ_jが示されている。エレメントＥ_iとエレメントＥ_jとは、それぞれ、中心活性値ｅ_i，ｅ_jを有し、リンクＬ_ijにて接続されている。リンクＬ_ijのエレメントＥ_iに接続する端点は、Ｔ_ijであり、エレメントＥ_jに接続する端点は、Ｔ_jiである。エレメントＥ_iは、リンクＬ_ijにより接続されるエレメントＥ_jの他に、リンクＬ_ik，Ｌ_il及びＬ_imにより図示しないエレメントＥ_k，Ｅ_l及びＥ_mにそれぞれ接続している。エレメントＥ_jは、リンクＬ_jiにより接続されるエレメントＥ_iの他に、リンクＬ_jp，Ｌ_jq及びＬ_jrにより図示しないエレメントＥ_p，Ｅ_q及びＥ_rにそれぞれ接続している。
【００５９】
続いて、文書処理装置は、図７中のステップＳ４２において、ＣＰＵ１３の制御のもとに、文書を構成するエレメントＥ_iを計数するカウンタの初期化を行う。すなわち、文書処理装置は、エレメントを計数するカウンタのカウンタ値ｉを“１”に設定する。このことにより、カウンタは、第１番目のエレメントＥ₁を参照していることになる。
【００６０】
続いて、文書処理装置は、ステップＳ４３において、ＣＰＵ１３の制御のもとに、カウンタが参照するエレメントについて、新たな中心活性値を計算するリンク処理を実行する。このリンク処理については、さらに後述する。
【００６１】
続いて、文書処理装置は、ステップＳ４４において、ＣＰＵ１３の制御のもとに、文書中の全てのエレメントについて新たな中心活性値の計算が完了したか否かを判断する。
【００６２】
ここで、文書処理装置は、文書中の全てのエレメントについて新たな中心活性値の計算が完了したことを判断した場合には、ステップＳ４５へと処理を移行し、一方、文書中の全てのエレメントについて新たな中心活性値の計算が完了していないことを判断した場合には、ステップＳ４７へと処理を移行する。
【００６３】
具体的には、文書処理装置は、ＣＰＵ１３の制御のもとに、カウンタのカウンタ値ｉが、文書が含むエレメントの総数に達したか否かを判断する。そして、文書処理装置は、カウンタのカウンタ値ｉが、文書が含むエレメントの総数に達したことを判断した場合には、全てのエレメントが計算済みであるものとして、ステップＳ４５へと処理を移行する。一方、文書処理装置は、カウンタのカウンタ値ｉが、文書が含むエレメントの総数に達していないことを判断した場合には、全てのエレメントについて計算が終了していないものとしてステップＳ４７へと処理を移行する。
【００６４】
文書処理装置は、カウンタのカウンタ値ｉが、文書が含むエレメントの総数に達していないことを判断した場合には、ステップＳ４７において、ＣＰＵ１３の制御のもとに、カウンタのカウント値ｉを“１”だけインクリメントさせ、カウンタのカウント値を“ｉ＋１”とする。このことにより、カウンタは、ｉ＋１番目のエレメント、すなわち次のエレメントを参照する。そして、文書処理装置は、ステップＳ４３へと処理を移行し、端点活性値の計算及びこれに続く一連の行程が、次のｉ＋１番目のエレメントについて実行される。
【００６５】
また、文書処理装置は、カウンタのカウンタ値ｉが、文書が含むエレメントの総数に達したことを判断した場合には、ステップＳ４５において、ＣＰＵ１３の制御のもとに、文書に含まれる全てのエレメントの中心活性値の変化分、すなわち新たに計算された中心活性値の元の中心活性値に対する変化分について平均値を計算する。
【００６６】
文書処理装置は、ＣＰＵ１３の制御のもとに、例えばＲＡＭ１４に記憶された元の中心活性値と新たに計算した中心活性値を、文書に含まれる全てのエレメントについて読み出す。文書処理装置は、新たに計算した中心活性値の元の中心活性値に対するそれぞれの変化分の総和を文書に含まれるエレメントの総数で除することにより、全てのエレメントの中心活性値の変化分の平均値を計算する。文書処理装置は、このように計算した全てのエレメントの中心活性値の変化分の平均値を、例えばＲＡＭ１４に記憶させる。
【００６７】
そして、文書処理装置は、ステップＳ４６において、ＣＰＵ１３の制御のもとに、ステップＳ４５で計算した全てのエレメントの中心活性値の変化分の平均値が、予め設定された閾値以内であるか否かを判断する。そして、文書処理装置は、この変化分が閾値以内であると判断した場合には、この一連の行程を終了する。一方、文書処理装置は、変化分が閾値以内でないと判断した場合には、ステップＳ４２へと処理を移行し、カウンタのカウント値ｉを“１”に設定して文書のエレメントの中心活性値を計算する一連の行程を再び実行する。文書処理装置においては、これらのステップＳ４２乃至ステップＳ４６のループが繰り返される毎に、変化分は、徐々に減少する。
【００６８】
文書処理装置は、このようにして活性拡散を行うことができる。つぎに、この活性拡散を行うためにステップＳ４３において実行されるリンク処理について図９を参照して説明する。なお、同図に示すフローチャートは、１つのエレメントＥ_iに対する処理を示したものであるが、この処理は、全てのエレメントに対して行われるものである。
【００６９】
まず、文書処理装置は、図９に示すように、ステップＳ５１において、ＣＰＵ１３の制御のもとに、文書を構成する１つのエレメントＥ_iと一端が接続されたリンクを計数するカウンタの初期化を行う。すなわち、文書処理装置は、リンクを計数するカウンタのカウント値ｊを“１”に設定する。このカウンタは、エレメントＥ_iと接続された第１番目のリンクＬ_ijを参照することになる。
【００７０】
続いて、文書処理装置は、ステップＳ５２において、ＣＰＵ１３の制御のもとに、エレメントＥ_iとＥ_jを接続するリンクＬ_ijについて、関係属性のタグを参照することによって、そのリンクＬ_ijが通常リンクであるか否かを判断する。文書処理装置は、リンクＬ_ijが、語に対応する語彙エレメント、文に対応する文エレメント、段落に対応する段落エレメント等の間の関係を示す通常リンクと、参照・被参照による係り受けの関係を示す参照リンクのいずれであるかを判断する。文書処理装置は、リンクＬ_ijが通常リンクであると判断した場合には、ステップＳ５３へと処理を移行し、リンクＬ_ijが参照リンクであると判断した場合には、ステップＳ５４へと処理を移行する。
【００７１】
文書処理装置は、リンクＬ_ijが通常リンクであると判断した場合には、ステップＳ５３において、エレメントＥ_iの通常リンクＬ_ijに接続された端点Ｔ_ijの新たな端点活性値を計算する処理を行う。
【００７２】
このステップＳ５３では、ステップＳ５２における判別により、リンクＬ_ijが通常リンクであることが明らかになっている。エレメントＥ_iの通常リンクＬ_ijに接続される端点Ｔ_ijの新たな端点活性値ｔ_ijは、エレメントＥ_jの端点活性値のうち、リンクＬ_ij以外のリンクに接続する全ての端点Ｔ_jp，Ｔ_jq，Ｔ_jrの端点活性値ｔ_jp、ｔ_jq，ｔ_jrと、エレメントＥ_iがリンクＬ_ijにより接続されるエレメントＥ_jの中心活性値ｅ_jとを加算し、この加算で得た値を文書に含まれるエレメントの総数で除することにより求められる。
【００７３】
文書処理装置は、ＣＰＵ１３の制御のもとに、例えばＲＡＭ１４から必要な端点活性値及び中心活性値を読み出す。文書処理装置は、読み出された端点活性値及び中心活性値について、上述のようにその通常リンクと接続された端点の新たな端点活性値を計算する。そして、文書処理装置は、このように計算した新たな端点活性値を、例えばＲＡＭ１４に記憶させる。
【００７４】
一方、文書処理装置は、リンクＬ_ijが通常リンクでないと判断した場合には、ステップＳ５４において、エレメントＥ_iの参照リンクに接続された端点Ｔ_ijの端点活性値を計算する処理を行う。
【００７５】
このステップＳ５４では、ステップＳ５２における判別により、リンクＬ_ijが参照リンクであることが明らかになっている。エレメントＥ_iの参照リンクＬ_ijに接続される端点Ｔ_ijの端点活性値ｔ_ijは、エレメントＥ_jの端点活性値のうち、リンクＬ_ijを除いたリンクに接続される全ての端点Ｔ_jp，Ｔ_jq，ｔ_jrの端点活性値ｔ_jp，ｔ_jq，ｔ_jrと、エレメントＥ_iがリンクＬ_ijにより接続されるエレメントＥ_jの中心活性値ｅ_jとを加算することにより求められる。
【００７６】
文書処理装置は、ＣＰＵ１３の制御のもとに、例えばＲＡＭ１４に記憶された端点活性値及び中心活性値から、必要な端点活性値及び中心活性値を読み出す。文書処理装置は、読み出された端点活性値及び中心活性値を用いて、上述のように参照リンクと接続された新たな端点活性値を計算する。そして、文書処理装置は、このように計算した端点活性値を、例えばＲＡＭ１４に記憶させる。
【００７７】
これらのステップＳ５３における通常リンクの処理及びステップＳ５４における参照リンクの処理は、ステップＳ５２からステップＳ５５に至り、ステップＳ５７を介してステップＳ５２に戻るループに示すように、カウント値ｉにより参照されているエレメントＥ_iに接続される全てのリンクＬ_ijに対して実行される。なお、ステップＳ５７では、エレメントＥ_iに接続されるリンクを計数するカウント値ｊをインクリメントしている。
【００７８】
文書処理装置は、これらのステップＳ５３又はステップＳ５４の処理を行った後、ステップＳ５５において、ＣＰＵ１３の制御のもとに、エレメントＥ_iに接続される全てのリンクについて端点活性値が計算されたか否かを判別する。そして、文書処理装置は、全てのリンクについて端点活性値が計算されていると判断した場合には、ステップＳ５６の処理へと移行し、全てのリンクについて端点活性値が計算されていないと判断した場合には、ステップＳ５７へと処理を移行する。
【００７９】
ここで、文書処理装置は、全てのリンクについて端点活性値が計算されていると判断した場合には、ステップＳ５６において、ＣＰＵ１３の制御のもとに、エレメントＥ_iの中心活性値ｅ_iの更新を実行する。
【００８０】
エレメントＥ_iの中心活性値ｅ_iの新たな値、すなわち更新値は、エレメントＥ_iの現在の中心活性値ｅ_iと、エレメントＥ_iの全ての端点の新たな端点活性値との和であるｅ_i’＝ｅ_i＋Σｔ_j’をとることにより求められる。ここで、プライム“’”は、新たな値という意味である。このように、新たな中心活性値は、そのエレメントの元の中心活性値に、そのエレメントの端点の新たな端点活性値の総和に加えることにより得られる。
【００８１】
文書処理装置は、ＣＰＵ１３の制御のもとに、例えばＲＡＭ１４に記憶された端点活性値及び中心活性値から必要な端点活性値を読み出す。文書処理装置は、上述したような計算を実行し、そのエレメントＥ_iの中心活性値ｅ_iを算出する。そして、文書処理装置は、計算した新たな中心活性値ｅ_iを例えばＲＡＭ１４に記憶させる。
【００８２】
このようにして、文書処理装置は、文書中の各エレメントについて、新たな中心活性値を計算する。そして、文書処理装置は、このようにして図６中のステップＳ２１における活性拡散を実行する。
【００８３】
続いて、文書処理装置は、図６中のステップＳ２２において、ＣＰＵ１３の制御のもとに、先に図４に示した表示部３０に表示されているウィンドウ１９０の表示領域２２０の大きさ、すなわちこの表示領域２２０に表示可能な最大文字数をＷ_sと設定する。また、文書処理装置は、ＣＰＵ１３の制御のもとに、要約文Ｓを初期化して初期値Ｓ₀＝””と設定する。これは、要約文に何も文字列が存在していないことを示す。文書処理装置は、このように設定した、表示領域２２０に表示可能な最大文字数Ｗ_s及び要約文Ｓの初期値Ｓ₀を、例えばＲＡＭ１４に記憶させる。
【００８４】
続いて、文書処理装置は、ステップＳ２３において、ＣＰＵ１３の制御のもとに、要約文の骨格の順次での作成をカウントするカウンタのカウント値ｉを“１”に設定する。すなわち、文書処理装置は、カウント値について、ｉ＝１と設定する。文書処理装置は、このように設定したカウント値ｉを例えばＲＡＭ１４に記憶させる。
【００８５】
続いて、文書処理装置は、ステップＳ２４において、ＣＰＵ１３の制御のもとに、カウンタのカウント値ｉについて、要約文作成対照の文章からｉ番目に平均中心活性値の高い文の骨格を抽出する。ここで、平均中心活性値とは、１つの文を構成する各エレメントの中心活性値を平均したものである。文書処理装置は、例えばＲＡＭ１４に記憶させた要約文Ｓ_i-1を読み出し、この要約文Ｓ_i-1に対して抽出した文の骨格の文字列を加えて、要約文Ｓ_iとする。そして、文書処理装置は、このようにして得た要約文Ｓ_iを、例えばＲＡＭ１４に記憶させる。同時に、文書処理装置は、文の骨格に含まれないエレメントの中心活性値順のリストｌ_iを作成し、このリストｌ_iを例えばＲＡＭ１４に記憶させる。
【００８６】
すなわち、このステップＳ２４においては、文書処理装置は、ＣＰＵ１３の制御のもとに、活性拡散の結果を用いて、平均中心活性値の大きい順に文を選択し、選択された文の骨格を抽出する。文の骨格は、文から抽出した必須エレメントにより構成される。必須エレメントになり得るものは、エレメントの主辞（head）と、主語（subject）、目的語（object）、間接目的語（indirect object）、所有者（posessor）、原因（cause）、条件（condition）又は比較（comparison）の関係属性を有するエレメントと、等位構造とされた関連するエレメントが必須エレメントのときには、その等位構造に直接含まれるエレメントとである。文書処理装置は、文の必須エレメントをつなげて文の骨格を生成し、要約文に加える。
【００８７】
続いて、文書処理装置は、ステップＳ２５において、ＣＰＵ１３の制御のもとに、要約文Ｓ_iの長さ、すなわち文字数が、ウィンドウ１９０の表示領域２２０の最大文字数Ｗ_sよりも多いか否かを判断する。
【００８８】
ここで、文書処理装置は、要約文Ｓ_iの文字数が最大文字数Ｗ_sよりも多いと判断した場合には、ステップＳ３０において、ＣＰＵ１３の制御のもとに、要約文Ｓ_i-1を最終的な要約文として設定し、一連の処理を終了する。なお、この場合には、要約文Ｓ_i＝Ｓ₀＝“”を出力するため、要約文は、表示領域２２０に表示されないことになる。
【００８９】
一方、文書処理装置は、要約文Ｓ_iの文字数が最大文字数Ｗ_sよりも多くないと判断した場合には、ステップＳ２６の処理へと移行し、ＣＰＵ１３の制御のもとに、ｉ＋１番目に平均中心活性値が高い文の平均中心活性値と、ステップＳ２４で作成したリストｌ_iのエレメントの中で最も中心活性値が高いエレメントの中心活性値とを比較する。そして、文書処理装置は、ｉ＋１番目に平均中心活性値が高い文の平均中心活性値が、リストｌ_iのエレメントの中で最も中心活性値が高いエレメントの中心活性値よりも高いと判断した場合には、ステップＳ２７へと処理を移行する。一方、文書処理装置は、ｉ＋１番目に平均中心活性値が高い文の平均中心活性値が、リストｌ_iのエレメントの中で最も中心活性値が高いエレメントの中心活性値よりも高くないと判断した場合には、ステップＳ２８へと処理を移行する。
【００９０】
文書処理装置は、ｉ＋１番目に平均中心活性値が高い文の平均中心活性値が、リストｌ_iのエレメントの中で最も中心活性値が高いエレメントの中心活性値よりも高いと判断した場合には、ステップＳ２７において、ＣＰＵ１３の制御のもとに、カウンタのカウント値ｉを“１”だけインクリメントさせ、ステップＳ２４へと処理を戻す。
【００９１】
また、文書処理装置は、ｉ＋１番目に平均中心活性値が高い文の平均中心活性値が、リストｌ_iのエレメントの中で最も中心活性値が高いエレメントの中心活性値よりも高くないと判断した場合には、ステップＳ２８において、ＣＰＵ１３の制御のもとに、リストｌ_iのエレメントの中で最も中心活性値の高いエレメントｅを要約文Ｓ_iに加えてＳＳ_iを生成し、さらに、エレメントｅをリストｌ_iから削除する。そして、文書処理装置は、このようにして生成した要約文ＳＳ_iを例えばＲＡＭ１４に記憶させる。
【００９２】
続いて、文書処理装置は、ステップＳ２９において、ＣＰＵ１３の制御のもとに、要約文ＳＳ_iの文字数がウィンドウ１９０の表示領域２２０の最大文字数Ｗ_sよりも多いか否かを判別する。文書処理装置は、要約文ＳＳ_iの文字数が最大文字数Ｗ_sよりも多くないと判別した場合には、ステップＳ２６からの処理を繰り返す。一方、文書処理装置は、要約文ＳＳ_iの文字数が最大文字数Ｗ_sよりも多いと判別した場合には、ステップＳ３１において、ＣＰＵ１３の制御のもとに、要約文Ｓ_iを最終的な要約文として設定し、表示領域２２０に表示して一連の処理を終了する。このようにして、文書処理装置は、最大文字数Ｗ_sよりも多くならないように要約文を生成する。
【００９３】
文書処理装置は、このような一連の処理を行うことによって、タグ付けされた文書を要約して要約文を作成することができる。文書処理装置は、例えば図４に示した文書を要約した場合には、図１０に示すような要約文を作成し、表示範囲の表示領域２２０に表示する。
【００９４】
すなわち、文書処理装置は、「TCP/IPの歴史はARPANETを抜きにして語ることはできない。ARPANETは1969年北米西海岸の４個所の大学、研究機関のホストコンピュータを50kbpsの回線で結んだ小規模なネットワークからARPANETは出発した。当時は1964年にメインフレームの汎用コンピュータシリーズが開発された。この時代背景を考えると、将来のコンピュータ通信の最盛を見越したこのようなプロジェクトは、まさに米国ならではのものであったといえるだろう。」という要約文を作成し、表示領域２２０に表示する。
【００９５】
文書処理装置においては、ユーザは、文書の全文章を一読する代わりに、この要約文を読むことで、文章の概要を理解し、この文章が所望する情報であるか否かを判定することができる。
【００９６】
なお、文書処理装置においては、文書中のエレメントに対して重要度を付与する方法としては、必ずしも上述したような活性拡散を用いる必要はなく、例えば、文書中に出現する単語の重みの総和を文書の重要度とする方法でもよい。また、重要度の付与方法は、これらの方法以外のものを利用することもできる。さらに、表示領域２００のキーワード入力部１９２にキーワードを入力することによって、そのキーワードに基づいた重要度の設定を行うこともできる。
【００９７】
さて、文書処理装置は、先に図５に示したように、表示部３１に表示されるウィンドウ１９０の表示領域２２０の表示範囲を拡大することができるが、作成した要約文が表示領域２２０に表示されている状態において、表示領域２２０の表示範囲を変更すると、その表示範囲に応じて、要約文の情報量を変更することができる。
【００９８】
この場合、文書処理装置は、ＣＰＵ１３の制御のもとに、ユーザが入力部２０を操作することに対応して、表示部３１に表示されたウィンドウ１９０の表示領域２２０の表示範囲が変更されるまで待機する。そして、文書処理装置は、表示領域２２０の表示範囲が変更されると、ＣＰＵ１３の制御のもとに、先に図６に示した一連の処理と同様の処理を行い、表示領域２２０の表示範囲に対応した要約文を作成する。
【００９９】
文書処理装置は、このようにして、表示領域２２０の表示範囲に応じた要約文を新たに作成することができる。例えば、文書処理装置は、ユーザが入力部２０のマウスをドラッグ操作することにより表示領域２２０の表示範囲を拡大すると、より詳細な要約文を新たに作成し、図１１に示すように、新たな要約文をウィンドウ１９０の表示領域２２０に表示する。
【０１００】
すなわち、文書処理装置は、「TCP/IPの歴史はARPANETを抜きにして語ることはできない。ARPANETはアメリカ国防省DODの国防高等研究計画局がスポンサーとなって構築されてきた、実験および研究用のパケット交換ネットワークである。1969年北米西海岸の４個所の大学、研究機関のホストコンピュータを50kbpsの回線で結んだきわめて小規模なネットワークからARPANETは出発した。当時は1945年に世界初のコンピュータであるENIACがペンシルバニア大学で開発され、1964年にはじめてICを理論素子として実装したメインフレームの汎用コンピュータシリーズが開発され、やっとコンピュータが産声をあげたばかりあった。この時代背景を考えると、将来のコンピュータ通信の最盛を見越したこのようなプロジェクトは、まさに米国ならではのものであったといえるだろう。」という要約文を作成し、表示領域２２０に表示する。
【０１０１】
このように、文書処理装置においては、表示された要約文が簡略すぎて文書の概略を把握することができない場合、ユーザは、表示領域２２０の表示範囲を拡大することで、より多くの情報量を有するより詳細な要約文を参照することができる。
【０１０２】
ここで、文書処理装置は、このようにして文書の要約文を作成する際に、その要約文中に、代名詞や限定節が要約文に含まれる場合には置き換えを行い、また主語や目的語等が省略されている場合には対応する主語や目的語等を補うような上述したゼロ照応の処理を行っている。
【０１０３】
先ず、この代名詞等の置き換えやゼロ照応の具体例について、次のような文書を参照しながら説明する。
【０１０４】
「仕事について。
【０１０５】
わたしは今の仕事があまり好きではない。しかし、それをやらなければならない。」
この文書のタグ付けの例は、次のようになる。
【０１０６】
＜文書＞
＜タイトル＞＜形容動詞句関係＝“目的語”＞＜名詞句＞仕事＜／名詞句＞に＜／形容動詞句＞ついて＜／タイトル＞
＜段落＞
＜文＞＜形容動詞句関係＝“主語”＞＜名詞句識別子＝“識別子１”＞わたし＜／名詞句＞は＜／形容動詞句＞＜形容動詞句関係＝“目的語”＞＜名詞句
識別子＝“識別子０”＞＜形容動詞句＞今の＜／形容動詞句＞仕事＜／名詞句＞が＜／形容動詞句＞＜動詞＞＜形容動詞句関係＝“程度”＞あまり＜／形容動詞句＞＜動詞＞好きではない＜／動詞＞＜／動詞＞。＜／文＞＜文＞＜形容動詞句＞しかし、＜／形容動詞句＞＜動詞＞＜形容動詞句関係＝“目的語”＞＜名詞句参照＝“識別子０”＞それ＜／名詞句＞を＜／形容動詞句＞＜動詞主語＝“識別子１”＞やらなければならない＜／動詞＞＜／動詞＞。＜／文＞
＜／段落＞
＜／文書＞
【０１０７】
この文書において、「それ」という名詞句は、「参照＝“識別子０”」という属性を有し、「識別子＝“識別子０”」を含むエレメントである「今の仕事」という名詞句を参照している。すなわち、被参照エレメントである代名詞「それ」に対応する先行詞となる参照エレメントが「今の仕事」である。従って、要約文中に名詞句「それ」が含まれるにも拘わらず名詞句「今の仕事」が含まれていない場合には、要約文中の「それ」を「今の仕事」に置き換えるものである。
【０１０８】
また、上記文書において、「やらなければならない」という動詞は、「主語＝“識別子１”」という属性を有することから、その意味上の主語は、「識別子＝“識別子１”」という属性を有するエレメントである「わたし」という名詞句であることが分かる。すなわち、エレメント「やらなければならない」のゼロ照応エレメントが「わたし」である。従って、要約文中に「やらなければならない」が含まれているにも拘わらず意味上の主語「わたし」が含まれていない場合には、要約文中で、「（わたしが）やらなければならない」のようにゼロ照応エレメントを補うものである。
【０１０９】
このような代名詞等の置き換え処理やゼロ照応処理は、上述した要約文作成に続いて、あるいは要約文作成と同時に行われるものであり、これらの処理内容の具体例について、図１２及び図１３のフローチャートを参照しながら説明する。
【０１１０】
すなわち、図１２は、代名詞や限定節が要約文に含まれる場合の処理を説明するためのフローチャートであり、この図１２に示す処理は、例えば上記図６のステップＳ３０，Ｓ３１に続いて行われる。この図１２に示す処理において、文書中の参照・被参照関係における代名詞などの被参照エレメントをリストアップするために被参照リストRBListを用いており、要約用の語彙エレメントの配列のｉ番目の要素をｔ_i とし、この語彙エレメントｔ_i の参照エレメントをｒ_i としている。
【０１１１】
図１２の最初のステップＳ７１において、文書処理装置は、図１のＣＰＵ１３の制御のもとに、上記被参照リストRBListを空にする。次のステップＳ７２で、文書処理装置は、要約用の語彙エレメントを配列順にカウントするカウンタのカウント値ｉを１に設定する（ｉ＝１）。
【０１１２】
次のステップＳ７３で、文書処理装置は、要約用の語彙エレメントの配列のｉ番目のエレメントｔ_i に関して、該語彙エレメントｔ_i の被参照エレメント集合を被参照リストRBListに加える。また、語彙エレメントｔ_i の参照エレメントをｒ_i とする。このステップＳ７３での処理は、当該エレメントｔ_i を他の代名詞等が参照している場合には他の代名詞等を被参照リストRBListに加え、当該エレメントｔ_i が代名詞等であって他のエレメント（先行詞）を参照している場合には参照しているエレメント（先行詞）を参照エレメントｒ_i とするものである。
【０１１３】
次のステップＳ７４で、文書処理装置は、語彙エレメントｔ_i の参照エレメントｒ_i が存在するか否かを判別する。文書処理装置は、このステップＳ７４でＹES、すなわちｒ_i が存在する、と判別されたときはステップＳ７５に進み、ＮＯのときはステップＳ７６に進む。すなわち、ステップＳ７４での判別により、当該エレメントｔ_i が代名詞等であって参照エレメントｒ_i が存在しているときのみ、ステップＳ７５に進む。
【０１１４】
ステップＳ７５で、文書処理装置は、語彙エレメントｔ_i が上記被参照リストRBListの要素であるか否かを判別し、ＹESのときはステップＳ７６に進み、ＮＯのときはステップＳ７７に進む。ステップＳ７６で、文書処理装置は、語彙エレメントｔ_i を要約文に追加し、ステップＳ７９に進む。ステップＳ７７で、文書処理装置は、ｔ_i の参照エレメントｒ_i の語彙列を要約文に追加して、ステップＳ７８に進み、ｒ_i の被参照エレメント集合を被参照リストRBListに加えた後、ステップＳ７９に進む。
【０１１５】
これらのステップＳ７５〜Ｓ７８での処理は、エレメントｔ_i が代名詞等であって他のエレメントを参照している場合に、当該エレメントｔ_i が被参照リストRBListの要素であれば、すなわち既に先行詞が要約文中に含まれていれば、エレメントｔ_i を先行詞で置き換えることなくそのまま要約文に加え、エレメントｔ_i が被参照リストRBListの要素でなければ、先行詞がまだ要約文中にふくまれていないことから、当該エレメントｔ_i を先行詞である参照エレメントｒ_i で置き換えて要約文に加えるものである。
【０１１６】
ステップＳ７９で、文書処理装置は、要約用の語彙エレメントの配列の全てについてステップＳ７３以降の処理が終了したか否かを判別し、ＮＯのときはステップＳ８０にて上記カウント値ｉを１だけインクリメント（ｉ＝ｉ＋１）した後、ステップＳ７３に戻り、ＹESのときは処理を終了する。
【０１１７】
以上のような置き換え処理により、要約文中に代名詞や限定節等の被参照エレメントが存在するにも拘わらず、対応する先行詞としての参照エレメントが要約文中に含まれていない場合に、最初に現れた被参照エレメントｔ_i が参照エレメントｒ_i で置き換えられると共に、この参照エレメントｒ_i の被参照エレメント集合が被参照リストRBListに加えられるから、その後の同じ参照エレメントｒ_i に対応する被参照エレメントについては、置き換えされずにそのまま要約文に加えられることになる。
【０１１８】
なお、図１２に示す具体的な置き換え処理の例については、上記図６に示す要約文作成の処理に続いて行うものとして説明しているが、要約文作成と同時に行わせてもよい。
【０１１９】
次に、図１３は、要約文に省略された主語や目的語等を含む文が存在する場合の前述したようなゼロ照応処理を説明するためのフローチャートであり、この図１３に示す処理は、例えば上記図６のステップＳ３０，Ｓ３１に続いて、上記図１２の処理の前、後、あるいは同時に行われる。この図１３に示す処理において、文書中の省略された主語や目的語等のゼロ照応エレメントをリストアップするためにゼロ照応リストZAListを用いており、要約用の語彙エレメントの配列のｉ番目の要素をｔ_i とし、この語彙エレメントｔ_i のゼロ照応エレメントをｚ_i としている。
【０１２０】
図１３の最初のステップＳ８１において、文書処理装置は、図１のＣＰＵ１３の制御のもとに、上記ゼロ照応リストZAListを空にする。次のステップＳ８２で、文書処理装置は、要約用の語彙エレメントを配列順にカウントするカウンタのカウント値ｉを１に設定する（ｉ＝１）。
【０１２１】
次のステップＳ８３で、文書処理装置は、要約用の語彙エレメントの配列のｉ番目の要素をｔ_i に関して、該語彙エレメントｔ_i のゼロ照応エレメントが存在すればそれをｒ_i とする。次のステップＳ８４で、文書処理装置は、語彙エレメントｔ_i のゼロ照応エレメントｚ_i が存在するか否かを判別する。このステップＳ８４でＹES（ゼロ照応エレメントｚ_i が存在）と判別されたときはステップＳ８５に進み、ＮＯのときはステップＳ８６に進む。
【０１２２】
ステップＳ８５で、文書処理装置は、語彙エレメントｔ_i が上記ゼロ照応リストZAListの要素であるか否かを判別し、ＹESのときはステップＳ８６に進み、ＮＯのときはステップＳ８７に進む。ステップＳ８６では、語彙エレメントｔ_i を要約文に追加し、ステップＳ８９に進む。文書処理装置は、ステップＳ８７では、ゼロ照応エレメントｚ_i が要約文中に既に含まれて存在しているか否かを判別し、ＹESのときは上記ゼロ照応リストZAListにゼロ照応エレメントｚ_i を加えてステップＳ８６に進み、ＮＯのときはステップＳ８９に進む。ステップＳ８９で、文書処理装置は、語彙エレメントｔ_i のゼロ照応エレメントｚ_i の語彙列を括弧でくくり、語彙エレメントｔ_i と共に要約文に追加して、ステップＳ９０に進み、語彙エレメントｔ_i とゼロ照応エレメントｚ_i とをゼロ照応リストZAListに加えた後、ステップＳ９１に進む。ステップＳ８９、Ｓ９０においては、文書処理装置は、ゼロ照応エレメントｚ_i が主語ならばその語彙列に「が」を付加して括弧でくくり、目的語ならばその語彙列に「を」を付加して括弧でくくって、語彙エレメントｔ_i の前あるいは後に配置する。日本語の場合には、ゼロ照応エレメントは語彙エレメントの前に配置する。
【０１２３】
文書処理装置は、ステップＳ９１で、要約用の語彙エレメントの配列の全てについてステップＳ８３以降の処理が終了したか否かを判別し、ＮＯのときはステップＳ９２にて上記カウント値ｉを１だけインクリメント（ｉ＝ｉ＋１）した後、ステップＳ８３に戻り、ＹESのときは処理を終了する。
【０１２４】
以上のようなゼロ照応処理により、要約文中に主語や目的語等が省略されたエレメントｔ_i が存在するにも拘わらず、対応するゼロ照応エレメントｚ_i が要約文中に含まれていない場合に、当該エレメントｔ_i にゼロ照応エレメントｚ_i が括弧でくくられて付加されると共に、このエレメントｔ_i とゼロ照応エレメントｚ_i とがゼロ照応リストZAListに加えられるから、その後の同じゼロ照応エレメントｚ_i に対応するエレメントについては、ゼロ照応エレメントｚ_i が付加されずにそのまま要約文に加えられることになる。
【０１２５】
ところで、これらの図１２に示す代名詞等の置き換え処理や、図１３に示すゼロ照応処理を、上述した図６に示した要約文作成に続いて行う場合には、要約文中の文字数が変化し、上述した要約文の最大文字数、すなわち上記要約文表示領域の大きさに応じて決まる表示可能な最大文字数を超えてしまったり、最大文字数よりも少なくなってしまうことがある。そこで、要約文中の文字数を上記最大文字数以内の最大の文字数にするために、最終的な文字数の調整作業が必要である。これは、上記代名詞等の置き換え処理やゼロ照応処理を行うことにより要約文中の文字数が最大文字数を超えた場合には、要約文中の重要度の低いエレメント、すなわち上記中心活性値の低いエレメントから順次削除して、要約文中の文字数が上記最大文字数以内に収まるようにする。また、文字数が最大文字数よりも少なくなった場合には、元の文書中の要約文に含まれないエレメントの内の最も中心活性値が高いエレメントから順に要約文中に付加して行き、上記最大文字数を超える直前でエレメントの付加を停止することで、上記最大文字数に最も近く、最大文字数以内の文字数の要約文を得ることができる。
【０１２６】
なお、本実施の形態においては、文書へのタグ付けの方法の一例を示したが、本発明がこのタグ付けの方法に限定されないことは勿論である。また、本実施の形態においては、文書処理装置の通信部２２に外部から電話回線を介して文書が送信されるとしたが、本発明はこれに限定されない。例えば、衛星等を介して文書が送信される場合にも適用でき、また、記録／再生部３１において記録媒体３２から読み出されたり、文書処理装置のＲＯＭ１３に文書が書き込まれていたりしてもよい。
【０１２７】
また、本発明の実施の形態においては、上記図１の記録媒体３２として、上述した文書処理プログラムが書き込まれたディスク状記録媒体やテープ状記録媒体等を提供することも容易に実現できる。さらに、上述した文書処理プログラムについては、通信回線等の伝送媒体を介して供給することも容易に実現できる。
【０１２８】
また、上述の実施の形態においては、文書処理装置の表示部３０に表示された文書から所望のエレメントを選択するデバイスとしてマウスを例示したが、本発明がこれに限定されないことはいうまでもない。文書処理装置におけるエレメントの入力には、タブレット、ライトペン等の他のデバイスを利用することができる。
【０１２９】
さらに、上述の実施の形態においては日本語の文章を例示したが、本発明は、日本語に限定されず、英語、ドイツ語、フランス語、ロシア語、イタリア語、スペイン語、中国語、韓国語等の種々の言語に適用できることはいうまでもない。
【０１３０】
【発明の効果】
以上の説明からも明らかなように、本発明によれば、文書の要約文を作成し、作成される要約文中における省略された主語又は目的語が該要約文中に含まれていないとき、元の文書中の対応する主語又は目的語を要約文中に追加することにより、ゼロ照応エレメントが要約文中に必ず１回は現れることになり、ユーザの理解が容易で正確な内容の要約文を自動生成することができる。
【図面の簡単な説明】
【図１】本実施の形態を適用した文書処理装置の概略構成を示すブロック図である。
【図２】文書のタグ付けによる内部構造の一例を示す図である。
【図３】文書のタグ付けによる内部構造を表示したウィンドウを示す図である。
【図４】文書を表示したウィンドウを示す図である。
【図５】文書を表示したウィンドウを示す図であって、要約文を表示する表示領域が図１３に示す表示領域よりも拡大された様子を示す図である。
【図６】要約文を作成する際の一連の処理を説明するフローチャートである。
【図７】活性拡散を行う際の一連の処理を説明するフローチャートである。
【図８】活性拡散の処理を説明するためのエレメントの連結構造を示す図である。
【図９】活性拡散のリンク処理を行う際の一連の処理を説明するフローチャートである。
【図１０】文書とその要約文を表示したウィンドウを示す図である。
【図１１】文書とその要約文を表示したウィンドウを示す図であって、図５に示すウィンドウに要約文を表示した様子を示す図である。
【図１２】被参照エレメントが要約文に含まれる場合の参照エレメントでの置き換え処理を説明するためのフローチャートである。
【図１３】要約文中でのゼロ照応処理を説明するためのフローチャートである。
【符号の説明】
１０文書処理装置の本体、１１制御部、１２インターフェース、１３ＣＰＵ、２０入力部、２２通信部、３０表示部、３１記録／再生部、３２記録媒体、３３ハードディスク装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document processing method and apparatus for processing an electronic document, and a recording medium on which a document processing program for processing the electronic document is recorded.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, WWW (World Wide Web) is known as an application service that provides hypertext information in a window format on the Internet.
[0003]
The WWW is a system that performs document processing for creating, publishing, or sharing a document, and shows a new style of document. However, from the viewpoint of practical use of documents, advanced document processing exceeding WWW, such as document classification and summarization based on document contents, is required. For such advanced document processing, mechanical processing of document contents is indispensable.
[0004]
However, mechanical processing of document contents is still difficult for the following reasons. That is, first, HTML (Hyper Text Markup Language), which is a language for describing hypertext, defines the expression of a document but hardly specifies the contents of the document. Second, it is configured between documents. Third, hypertext networks are not always accessible to readers of a document to understand the content of the document, and thirdly, the author of the document generally writes without regard to the reader's convenience However, the convenience of the reader of the document is not coordinated with the convenience of the author, which is why it is difficult to mechanically process the document.
[0005]
As described above, the WWW is a system that indicates the state of a new document. However, since the document is not mechanically processed, advanced document processing cannot be performed. In other words, in order to perform advanced document processing, it is necessary to mechanically process the document.
[0006]
Therefore, with the goal of mechanical processing of documents, a system that supports mechanical processing of documents has been developed based on the results of natural language research. As document processing by natural language research, mechanical document processing that uses the tag attached to the document on the premise of the attribute information on the internal structure of the document by the author of the document, so-called tag assignment, has been proposed. Yes.
[0007]
By the way, with the spread of computers in recent years and the progress of networking, it is necessary to improve the functionality of document processing that creates, labels, and changes text documents with text processing and indexing depending on the contents of the document. It has been. For example, document summarization or document classification according to the user's request is desired.
[0008]
That is, the user uses an information search system such as a so-called search engine to search for desired information from a vast amount of information provided via the Internet. This information retrieval system is a system that retrieves information based on a specified keyword and provides the retrieved information to a user. The user selects desired information from the provided information.
[0009]
In the information retrieval system, information can be easily retrieved in this way, but the user reads the information provided by the retrieval, understands the outline, and determines whether or not it is the desired information. It is necessary to judge. This work is a heavy burden on the user, particularly when the amount of information provided is large. Therefore, recently, a so-called automatic summary sentence creation system, which is a system that automatically summarizes text information, that is, the contents of a document, has attracted attention.
[0010]
The automatic summary sentence creation system is a system that creates a summary sentence by reducing the length and complexity of text information while retaining the original information, that is, the meaning of the document. The user can understand the outline of the document by reading the summary sentence created by the automatic summary sentence creation system.
[0011]
Usually, an automatic summary sentence creation system uses sentences and words in text as one unit, and assigns importance based on some information to order them. Then, the automatic summary sentence creation system creates a summary sentence by gathering together sentences and words ordered in higher rank.
[0012]
[Problems to be solved by the invention]
By the way, in the above-described automatic summary sentence creation system, it is possible to create a summary sentence from a document. However, the information amount of the summary sentence to be created is determined by the information amount of the document or the like. Therefore, in the automatic summary sentence creation system, for example, when the created summary sentence is too simple and the user cannot grasp the outline of the document, the user cannot refer to a more detailed summary sentence.
[0013]
In addition, when the omitted part of the subject in the original sentence is included in the summary sentence, if the omitted part is not included in the summary sentence, accurate contents cannot be grasped from the summary sentence. .
[0014]
The present invention has been proposed in view of the above-described circumstances, and a document processing method and apparatus capable of automatically generating a summary sentence of an accurate content that is easy for the user to understand for an input document, An object of the present invention is to provide a recording medium on which a document processing program is recorded.
[0015]
[Means for Solving the Problems]
  In order to solve the above-described problems, the present invention provides a document processing method and apparatus for processing a document in the form of an electronic document, wherein a summary sentence of the document is created, and the omitted or replaced phrase in the created summary sentence is Complement this if presentHowever, the document has an internal structure in which a plurality of elements are hierarchized, tag information indicating the internal structure is assigned in advance, and the importance of the element is calculated based on the internal structure indicated by the tag information. Assign to elements and complement based on tag informationIt is characterized by that.
  The above complement isBased on tag informationWhen the omitted subject or object in the created summary sentence is not included in the summary sentence, it is preferable to perform a zero anaphoric process for adding the corresponding subject or object in the original document to the summary sentence.
  The above complement isBased on tag informationIt is preferable to replace the corresponding reference element in the original document with the referenced element in the summary sentence when the reference element corresponding to the referenced element included in the created summary sentence is not included in the summary sentence.
[0016]
  Here, the abbreviated subject or object is called a zero anaphoric element, and it is determined whether or not this zero anatomical element is included in the summary sentence. It is preferable to add the summary text in parentheses.
  The referenced element is, for example, a pronoun or a limiting clause, and the reference element is, for example, an antecedent. At the time of the above replacement, it is determined whether or not a reference element corresponding to the referenced element in the summary sentence is included in the summary sentence. If the corresponding reference element is not included in the summary sentence, the referenced element in the summary sentence is determined. Preferably, an element is replaced with a corresponding reference element, and the referenced element in the summary is not replaced when the corresponding reference element is included in the summary.
[0017]
  This avoids the absence of any subject or object omitted in the summary.In addition, it is avoided that pronouns or clauses without antecedents are included in the summary sentence.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of a document processing method and apparatus and a recording medium on which a document processing program is recorded will be described with reference to the drawings.
[0019]
As shown in FIG. 1, a document processing apparatus according to an embodiment of the present invention includes a main body 10 including a control unit 11 and an interface 12, an input unit 20 that receives input from a user and sends the input to the main body 10, and an external device. A communication unit 22 that receives and transmits the signal to the main body 10, a display unit 30 that displays the output from the main body 10, and a recording / reproducing unit 31 that records / reproduces information on the recording medium 32. Yes.
[0020]
The main body 10 includes a control unit 11 and an interface 12 and constitutes a main part of the document processing apparatus. The control unit 11 includes a CPU 13 that executes processing in the document processing apparatus, a RAM 14 that is a volatile memory, and a ROM 15 that is a nonvolatile memory. The CPU 13 performs control for executing each process according to the procedure of the program recorded in the ROM 15, for example. The RAM 14 temporarily stores programs and data necessary for the CPU 13 to execute various processes. The interface 12 is connected to the control unit 11, the input unit 20, the communication unit 22, the display unit 30, and the recording / playback unit 31. The interface 12 transmits data under the control of the control unit 11 for data input from the input unit 20 and the communication unit 22, data transmission to the display unit 30, and data transmission / reception to the recording / playback unit 31. Adjust the timing of data conversion and convert the data format.
[0021]
The input unit 20 is a part that receives a user's input to the document processing apparatus, and includes, for example, a keyboard and a mouse. The user can use the input unit 20 to input a keyword using a keyboard, or to select and input an element of an electronic document displayed on the display unit 30 using a mouse. Hereinafter, an electronic document is simply referred to as a document. Here, an element is an element constituting a document, and includes, for example, a document, a sentence, a phrase, and a word.
[0022]
The communication unit 22 is a part that receives a signal transmitted to the document processing apparatus from the outside via a communication path, for example, a telephone line. Specifically, the communication unit 22 includes, for example, a modem, a terminal adapter, etc., is connected to the Internet 23 via a telephone line, accesses a server 24 connected to the Internet, and from there, data such as documents Have been able to receive. Such a communication unit 22 receives data such as a plurality of documents transmitted from the outside, and sends the received data to the main body 10.
[0023]
The display unit 30 displays the output of characters and image information from the document processing apparatus. The display unit 30 includes, for example, a cathode ray tube (CRT) or a liquid crystal display (LCD), and displays, for example, a single or a plurality of windows or displays characters, figures, images, or the like. To do.
[0024]
The recording / reproducing unit 31 records and / or reproduces data with respect to a removable recording medium 32 such as a floppy disk, an optical disk, or a magneto-optical disk. The recording medium 32 records a document processing program for processing a document. The recording medium 32 records an electronic document processing program for processing a document and a document to be processed.
[0025]
The hard disk drive 33 performs data recording and / or reproduction with respect to a hard disk which is a large-capacity magnetic recording medium.
[0026]
Such a document processing apparatus receives a desired document and displays it on the display unit 31 as follows.
[0027]
In the document processing apparatus, first, when the user operates the input unit 20 to start a program for performing communication via the Internet 23 and inputs a URL (Uniform Resource Locator) of the server 24 (search engine), the control unit 11 controls the communication unit 22 and accesses the server 24.
[0028]
In response to this, the server 24 outputs search screen data to the communication unit 22 of the document processing apparatus via the Internet 23. In the document processing apparatus, the CPU 13 outputs this data to the display unit 30 via the interface 12 and displays it.
[0029]
In the document processing apparatus, when a user inputs a keyword or the like on the search screen using the input unit 20 and instructs a search, the communication unit 22 searches the server 24 as a search engine via the Internet 23. An instruction is sent.
[0030]
When the server 24 receives the search command, the server 24 executes the search command and transmits the obtained search result to the communication unit 22 via the Internet 23. In the document processing apparatus, the control unit 11 controls the communication unit 22 to receive the search result transmitted from the server 24 and display a part of the search result on the display unit 30.
[0031]
Specifically, when the user inputs a keyword such as “TCP” using the input unit 20 to instruct search, various information including the keyword “TCP” is received from the server 24 in the document processing apparatus. It is transmitted and displayed on the display unit 30.
[0032]
Next, the document in the present embodiment will be described. In the present embodiment, document processing is performed with reference to tags that are attribute information given to the document. The tags used in the present embodiment include a syntactic tag indicating the structure of the document, and a semantic word that enables understanding of the mechanical contents of the document between multiple languages. There is a logical tag.
[0033]
Some syntactic tags describe the internal structure of a document. As shown in FIG. 2, the internal structure by tagging is configured such that each element such as a document, a sentence, and a vocabulary element is associated with each other by a normal link and a reference / referenced link. In the figure, a white circle “◯” indicates an element, and the lowest white circle is a vocabulary element corresponding to the word at the lowest level in the document. A solid line is a normal link indicating a connection between elements such as a document, a sentence, and a vocabulary element. A broken line is a reference link indicating a dependency relationship by reference / reference. The internal structure of a document is in order from top to bottom, from document, subdivision, paragraph, sentence, subsentential segment, ..., vocabulary elements Composed. Of these, the subdivision and paragraph are used as options, for example.
[0034]
On the other hand, semantic / pragmatic tagging includes dependency information such as tagging on syntactic structure indicating the target of pronouns and semantic information such as meaning of polysemy. There is something to describe. Tagging in the present embodiment is based on the XML (eXtensible Markup Language) format similar to HTML (Hyper Text Markup Language).
[0035]
Examples of tagged sentences and documents are shown below, but tagging of documents is not limited to this method. In the following, examples of English and Japanese documents will be shown, but it goes without saying that the description of the internal structure by tagging can be applied to other languages as well.
[0036]
For example, the sentence “Time flies like an arrow.” Can be tagged as follows.
[0037]
<Sentence> <noun phrase meaning = “time0”> time </ noun phrase>
<Verb phrase> <Verb meaning = “fly1”> flies </ verb>
<Adjective verb phrase> <adjective verb meaning = “like0”> like </ adjective verb>
<Noun phrase> an <noun meaning = “arrow0”> arrow </ noun> </ noun phrase>
</ Adjective verb phrase> </ Verb phrase>. </ Sentence>
Where <sentence>, <noun>, <noun phrase>, <verb>, <verb phrase>, <adjective verb>, and <adjective verb phrase> are sentence, noun, noun phrase, verb, verb phrase, and adjective, respectively. It represents a syntactic structure of a sentence such as a prepositional phrase or a postpositional phrase / adjective phrase, an adjective phrase / adjective verb phrase. The tags are arranged correspondingly immediately before the end of the element and immediately after the end. The tag placed immediately after the end of the element indicates the end of the element by the symbol “/”. Elements indicate syntactic constituents, ie phrases, clauses and sentences. Note that word sense = “time 0” indicates a plurality of meanings of the word “time”, that is, the 0th meaning among the plurality of meanings. Specifically, the word “time” has at least a noun, an adjective, and a verb meaning, but here the word “time” indicates a noun. Similarly, the word “orange” has at least a plant name, color, and fruit meaning, which can also be distinguished by meaning.
[0038]
As shown in FIG. 3, the document used in this embodiment can display the syntactic structure in the window 101 of the display unit 30 in FIG. In this window 101, vocabulary elements are displayed on the right half surface 103, and the internal structure of the sentence is displayed on the left half surface 102. In this window 101, the syntactic structure can be displayed not only for documents written in Japanese but also for documents written in an arbitrary language such as English.
[0039]
In this window 101, as an example, the following document describing the internal structure by tagging "C city where Mr. A's B meeting ended, some popular papers and general papers report on the photo coverage. Part of "The policy for self-regulation was clarified on paper" is displayed. Here is an example of tagging this document:
[0040]
<Document> <Sentence> <Adjective Verb Phrase Relationship = “Position”> <Noun Phrase> <Adjective Verb Phrase Location = “C City”>
<Adjective verb phrase relation = “subject”> <noun phrase identifier = “B society”> <adjective verb phrase relation = “affiliation”> <person name identifier = “Mr. A”> Mr. Phrase> <organization name identifier = “group B”> group B </ organization name> </ noun phrase> is </ adjective verb phrase>
</ Adjective verb phrase> <place name identifier = “C city”> C city </ place name> </ noun phrase> </ adject verb phrase> <adjective verb phrase relationship = “subject”> <noun phrase identifier = "Press" Syntax = "Parallel"> <noun phrase> <adjective verb phrase> Some </ adject verb phrases> popular paper </ noun phrase> and <noun> general paper </ noun> </ noun phrase> </ Adjective verb phrase>
<Adjective verb phrase relation = “object”> <adjective verb phrase relation = “content” subject = “press”> <adjective verb phrase relation = “object”> <noun phrase> <adjective verb phrase> <noun co-reference = "B society"> So </ noun> </ adjective verb phrase> Photo report </ noun phrase> </ adjective verb phrase>
Self-regulating </ adjective verb phrase> policy </ adjective verb phrase>
<Adjective Verb Phrase Relation = “Position”> On the page </ Adjective Verb Phrase>
Revealed. </ Sentence> </ Document>
In this document, “some popular papers and general papers” are represented as parallel by the tag “Syntax =“ Parallel ””. The definition of parallel is to share a dependency relationship. Unless otherwise specified, for example, <noun phrase relationship = x> <noun> A </ noun> <noun> B </ noun> </ noun phrase> indicates that A is dependent on B To express. Relation = x represents a relation attribute.
[0041]
Relational attributes describe the interrelationships between syntax, meaning, and rhetoric. Grammatical functions such as subject, object, indirect object, subject roles such as actors, actees, beneficiaries, and rhetorical relationships such as reasons, results, etc. are described by this relation attribute. In the present embodiment, relational attributes are described for relatively easy grammatical functions such as a subject, object, and indirect object.
[0042]
In this document, the attributes of proper nouns such as “Mr. A”, “Group B”, and “C City” are described by tags such as place names, person names, and organization names. Words to which tags such as place names, person names, and organization names are given are proper nouns.
[0043]
Further, in such a tagged document, a reference and a referenced relationship with respect to pronouns and limited clauses are represented by tags. For example, in the above document, the “that” part of the “photo report” element has the attribute “co-reference =“ Group B ””, so that the part is “identifier =“ Group B ””. It is indicated that the element (in this case, a noun phrase) having the attribute of “Mr. A's B meeting”. Therefore, replacing the “part” of “the photo report” above becomes “the photo report of Mr. A's B meeting”.
[0044]
Further, in such a tagged document, the omitted subject, object, etc. can be supplemented by other parts. That is, in the example of the above document, the element “self-regulating” has the attribute “subject =“ press ””, and the semantic subject has the attribute “identifier =“ press ”” ( In this case, it is indicated that it is a noun phrase) “some general and popular paper”. Therefore, the supplement to the subject is “(Some general papers and popular papers are self-regulating”). Thus, the fact that the omission is supplemented by other parts is called zero anaphora.
[0045]
The specific operation of the document processing apparatus according to the embodiment of the present invention will be described below. The document processing apparatus according to the present embodiment causes automatic summarization processing to be performed on a tagged document as described above, and at the time of creating this summary sentence, Performs zero anaphoric processing to compensate for the omitted subject.
[0046]
When creating a summary sentence of a document in the document processing apparatus, the user operates the input unit 20 while the document is displayed on the display unit 30 in FIG. 1, and switches to the automatic summary mode. When the control unit 11 is switched to the automatic summary mode, the control unit 11 displays an initial screen of the automatic summary sentence creation program as shown in FIG. 4 and waits for the user to start the automatic summary sentence creation operation.
[0047]
That is, when the user switches to the automatic summary sentence creation mode, the control unit 11 in FIG. 1 starts an automatic summary sentence creation program stored in the hard disk device 33, controls the display unit 30, and FIG. The initial screen of the automatic summary sentence creation program as shown in is displayed. In this example, a window 190 displayed on the display unit 31 includes a document name display unit 191 for displaying a document name, a keyword input unit 192 for inputting a keyword, and an execution button for creating a summary sentence of the document. Are displayed in a display area 200 for displaying a summary sentence creation execution button 193 and the like, a display area 210 for displaying a document, and a display area 220 for displaying a summary sentence of the document.
[0048]
In the document name display portion 191 of the display area 200, the document name of the document displayed in the display area 210 is displayed. Further, a keyword for creating a summary sentence of a document is input to the keyword input unit 192 using, for example, the keyboard of the input unit 20 or the like. The summary sentence creation execution button 193 is an execution button for starting execution of the summary sentence creation process for the document displayed in the display area 210 when pressed using, for example, the mouse of the input unit 20.
[0049]
A document is displayed in the display area 210. A scroll bar 211 and buttons 212 and 213 for moving the scroll bar 211 up and down are provided at the right end of the display area 210. The user can use the mouse of the input unit 20 to move the scroll bar 211, for example. The display content displayed in the display area 210 can be scrolled in the vertical direction by directly moving it up and down or by pressing the buttons 212 and 213 to move the scroll bar 211 up and down. The user can select and summarize a part of the document displayed in the display area 210 by operating the input unit 20 or can summarize the entire document.
[0050]
A summary sentence is displayed in the display area 220. In FIG. 4, since a summary sentence has not yet been created, nothing is displayed in this display area 220. The user can change the display range (size) of the summary sentence display area 220 by operating the input unit 20. Specifically, the user can enlarge the display range (size) of the display area 220 shown in the figure, for example, as shown in FIG.
[0051]
When the user presses the summary sentence creation execution button 193 using the mouse of the input unit 20 to turn it on, for example, the document processing apparatus executes the process shown in FIG. Start creating.
[0052]
The process of creating a summary sentence from a document is executed based on tagging regarding the internal structure of the document. In the document processing apparatus, the size of the display area 220 of the window 190 can be changed as shown in FIG. In the document processing apparatus, under the control of the CPU 13, after the window 190 is newly drawn on the display unit 31 or the size of the display area 220 is changed, the summary sentence creation execution button 193 is operated. If so, a process of creating a summary sentence is executed from a document that is displayed at least in part in the display area 210 of the window 190 so as to conform to the display area 220.
[0053]
First, as shown in FIG. 6, the document processing apparatus performs a process called active diffusion under the control of the CPU 13 in step S21. In this embodiment, a summary sentence of a document is performed by adopting the central activity value obtained by activity diffusion as the importance. That is, in a document tagged with an internal structure, a central activity value corresponding to the tagging with respect to the internal structure can be given to each element by performing active diffusion.
[0054]
Here, the active diffusion is a process for giving a high central activity value to an element related to an element having a high central activity value. In other words, active diffusion is such that the central activity value is the same between the element represented by anaphora (coreference) and its antecedent, and otherwise the central activity value converges to the same value. . Since the central activity value is determined according to tagging related to the internal structure of the document, it can be used for analysis of the document considering the internal structure.
[0055]
The document processing apparatus performs active diffusion through a series of steps shown in FIG.
[0056]
First, as shown in FIG. 7, the document processing apparatus initializes each element under the control of the CPU 13 in step S41. The document processing apparatus assigns an initial value of the central activation value to all elements and vocabulary elements except the vocabulary element. For example, the document processing apparatus assigns “1” to all elements except the vocabulary element and “0” to the vocabulary element as the initial value of the central activation value. Further, the document processing apparatus can reflect the bias of the initial value in the central activity value obtained as a result of the active diffusion by pre-assigning a non-uniform value to the initial value of the central activity value of each element. For example, the document processing apparatus can obtain a central activity value that reflects the user's interest by setting a high initial value of the central activity value for an element that the user is interested in.
[0057]
For a reference / referenced link that is a dependency relationship by reference / reference between elements and a normal link that is other than that, the end point activation value of the end point of the link connecting the elements is set to “0”. Set to "". The document processing apparatus stores the initial value of the end point activation value thus assigned in, for example, the RAM 14.
[0058]
Here, an example of an element-element connection structure is shown in FIG. In the figure, as part of the structure of the elements and links that make up the document, element E_iAnd element E_jIt is shown. Element E_iAnd element E_jAnd the central activity value e_i, E_jLink L_ijConnected at. Link L_ijElement E_iThe end point connected to is T_ijAnd element E_jThe end point connected to is T_jiIt is. Element E_iIs the link L_ijE connected by_jIn addition to link L_ik, L_ilAnd L_imElement E (not shown)_k, E_lAnd E_mIs connected to each. Element E_jIs the link L_jiE connected by_iIn addition to link L_jp, L_jqAnd L_jrElement E (not shown)_p, E_qAnd E_rIs connected to each.
[0059]
Subsequently, in step S42 in FIG. 7, the document processing apparatus controls the element E constituting the document under the control of the CPU 13._iThe counter that counts is initialized. That is, the document processing apparatus sets the counter value i of the counter for counting elements to “1”. As a result, the counter has the first element E₁Will refer to.
[0060]
Subsequently, in step S43, the document processing apparatus executes link processing for calculating a new center activation value for the element referred to by the counter under the control of the CPU 13. This link process will be further described later.
[0061]
Subsequently, in step S44, the document processing apparatus determines whether calculation of new center activation values for all elements in the document has been completed under the control of the CPU 13.
[0062]
Here, when the document processing apparatus determines that the calculation of the new central activity value has been completed for all the elements in the document, the process proceeds to step S45, while all the elements in the document. If it is determined that the calculation of the new central activity value has not been completed, the process proceeds to step S47.
[0063]
Specifically, under the control of the CPU 13, the document processing apparatus determines whether the counter value i of the counter has reached the total number of elements included in the document. When the document processing apparatus determines that the counter value i of the counter has reached the total number of elements included in the document, the document processing apparatus assumes that all elements have been calculated and proceeds to step S45. . On the other hand, when the document processing apparatus determines that the counter value i of the counter has not reached the total number of elements included in the document, the document processing apparatus determines that calculation has not been completed for all elements and proceeds to step S47. Transition.
[0064]
When the document processing apparatus determines that the counter value i of the counter has not reached the total number of elements included in the document, the document processing apparatus sets the count value i of the counter to “1” under the control of the CPU 13 in step S47. Is incremented by "1", and the count value of the counter is "i + 1". Thus, the counter refers to the (i + 1) th element, that is, the next element. Then, the document processing apparatus shifts the processing to step S43, and the calculation of the endpoint activation value and the series of processes following this are executed for the next i + 1th element.
[0065]
Further, when the document processing apparatus determines that the counter value i of the counter has reached the total number of elements included in the document, in step S45, under the control of the CPU 13, all the elements included in the document. The average value is calculated for the change in the central activity value of the current, that is, the change in the newly calculated center activity value with respect to the original center activity value.
[0066]
Under the control of the CPU 13, the document processing apparatus reads, for example, the original central activation value stored in the RAM 14 and the newly calculated central activation value for all elements included in the document. The document processing apparatus divides the sum of the respective changes of the newly calculated center activity value with respect to the original center activity value by the total number of elements included in the document, thereby calculating the change in the center activity value of all the elements. Calculate the average value. The document processing apparatus stores, for example, the RAM 14 in the RAM 14 the average value of the change in the central activation value of all the elements calculated in this way.
[0067]
In step S46, under the control of the CPU 13, the document processing apparatus determines whether the average value of the change in the central activation value of all elements calculated in step S45 is within a preset threshold value. Judging. When the document processing apparatus determines that the change is within the threshold value, the document processing apparatus ends the series of steps. On the other hand, if the document processing apparatus determines that the amount of change is not within the threshold, the process proceeds to step S42, where the count value i of the counter is set to “1” and the central activation value of the document element is set. Repeat the sequence of calculations to calculate. In the document processing apparatus, the amount of change gradually decreases each time the loop of step S42 to step S46 is repeated.
[0068]
The document processing apparatus can perform active diffusion in this way. Next, the link process executed in step S43 for performing this active diffusion will be described with reference to FIG. Note that the flowchart shown in FIG._iThis processing is performed for all elements.
[0069]
First, as shown in FIG. 9, in step S51, the document processing apparatus controls one element E constituting the document under the control of the CPU 13._iAnd a counter that counts links to which one end is connected is initialized. That is, the document processing apparatus sets the count value j of the counter for counting links to “1”. This counter is the element E_iFirst link L connected to_ijWill be referred to.
[0070]
Subsequently, in step S52, the document processing apparatus controls the element E under the control of the CPU 13._iAnd E_jLink L connecting_ijFor the link L by referring to the tag of the relationship attribute_ijIt is determined whether or not is a normal link. The document processing apparatus is linked L_ijIs either a normal link indicating the relationship between a vocabulary element corresponding to a word, a sentence element corresponding to a sentence, a paragraph element corresponding to a paragraph, or a reference link indicating a relationship of reference / referenced dependency Determine whether. The document processing apparatus is linked L_ijIs determined to be a normal link, the process proceeds to step S53, and the link L_ijIf it is determined that is a reference link, the process proceeds to step S54.
[0071]
The document processing apparatus is linked L_ijIs determined to be a normal link, element E is determined in step S53._iNormal link L_ijEnd point T connected to_ijThe process of calculating a new end point activation value of is performed.
[0072]
In this step S53, the link L is determined by the determination in step S52._ijHas been found to be a normal link. Element E_iNormal link L_ijEnd point T connected to_ijNew end point activation value t_ijIs element E_jLink L of the end point activation values of_ijAll end points T connected to links other than_jp, T_jq, T_jrEnd point activation value t_jp, T_jq, T_jrAnd element E_iIs link L_ijE connected by_jCenter activity value e_jAnd the value obtained by this addition is divided by the total number of elements included in the document.
[0073]
The document processing apparatus reads necessary end point activation values and center activation values from, for example, the RAM 14 under the control of the CPU 13. The document processing apparatus calculates a new end point activation value of the end point connected to the normal link as described above for the read end point activation value and center activation value. Then, the document processing apparatus stores the new endpoint activation value calculated in this way, for example, in the RAM 14.
[0074]
On the other hand, the document processing apparatus_ijIs determined not to be a normal link, element E is determined in step S54._iEnd point T connected to the reference link of_ijThe process of calculating the end point activation value of is performed.
[0075]
In step S54, the link L is determined by the determination in step S52._ijIs a reference link. Element E_iReference link L_ijEnd point T connected to_ijEnd point activation value t_ijIs element E_jLink L of the end point activation values of_ijAll end points T connected to the link excluding_jp, T_jq, T_jrEnd point activation value t_jp, T_jq, T_jrAnd element E_iIs link L_ijE connected by_jCenter activity value e_jIs obtained by adding.
[0076]
Under the control of the CPU 13, the document processing apparatus reads necessary endpoint activation values and center activation values from, for example, the endpoint activation values and center activation values stored in the RAM 14. The document processing apparatus calculates a new end point activation value connected to the reference link as described above, using the read end point activation value and center activation value. Then, the document processing apparatus stores the end point activation value thus calculated in, for example, the RAM 14.
[0077]
The normal link processing in step S53 and the reference link processing in step S54 are referred to by the count value i, as shown in the loop from step S52 to step S55, and returning to step S52 via step S57. Element E_iAll links L connected to_ijIs executed against. In step S57, element E_iThe count value j for counting the links connected to is incremented.
[0078]
After performing the processing of step S53 or step S54, the document processing apparatus performs element E under the control of the CPU 13 in step S55._iIt is determined whether or not endpoint activation values have been calculated for all links connected to. When the document processing apparatus determines that the endpoint activation values have been calculated for all links, the document processing apparatus proceeds to the process of step S56 and determines that the endpoint activation values have not been calculated for all links. In that case, the process proceeds to step S57.
[0079]
If the document processing apparatus determines that the endpoint activation values have been calculated for all links, the element E is controlled under the control of the CPU 13 in step S56._iCenter activity value e_iPerform the update.
[0080]
Element E_iCenter activity value e_iThe new value, i.e. the updated value, of element E_iCurrent central activity value e_iAnd element E_iE, which is the sum of all endpoints of and the new endpoint activity value_i‘= E_i+ Σt_jIt is calculated | required by taking '. Here, the prime “′” means a new value. Thus, the new center activity value is obtained by adding the new center activity value of the element to the sum of the new endpoint activity values of the end points of the element.
[0081]
Under the control of the CPU 13, the document processing apparatus reads a necessary end point activation value from, for example, the end point activation value and the center activation value stored in the RAM 14. The document processing apparatus executes the calculation as described above, and the element E_iCenter activity value e_iIs calculated. The document processing apparatus then calculates the calculated new center activation value e._iIs stored in the RAM 14, for example.
[0082]
In this way, the document processing apparatus calculates a new central activity value for each element in the document. Then, the document processing apparatus thus performs active diffusion in step S21 in FIG.
[0083]
Subsequently, in step S22 in FIG. 6, the document processing apparatus controls the size of the display area 220 of the window 190 displayed on the display unit 30 shown in FIG. The maximum number of characters that can be displayed in this display area 220 is W_sAnd set. Further, the document processing apparatus initializes the summary sentence S under the control of the CPU 13 and initializes the initial value S.₀Set “=”. This indicates that no character string exists in the summary sentence. The document processing apparatus sets the maximum number of characters W that can be displayed in the display area 220 as set above._sAnd initial value S of summary S₀Is stored in the RAM 14, for example.
[0084]
Subsequently, in step S23, under the control of the CPU 13, the document processing apparatus sets the count value i of the counter that counts the sequential creation of the summary text skeleton to “1”. That is, the document processing apparatus sets i = 1 for the count value. The document processing apparatus stores the count value i set in this way in, for example, the RAM 14.
[0085]
Subsequently, in step S24, under the control of the CPU 13, the document processing apparatus extracts the skeleton of the sentence having the i-th highest average central activity value from the summary sentence creation reference sentence for the count value i of the counter. Here, the average central activity value is an average of the central activity values of the elements constituting one sentence. The document processing apparatus, for example, has a summary sentence S stored in the RAM 14._i-1And this summary sentence S_i-1A summary sentence S is added to the extracted sentence skeleton._iAnd The document processing apparatus then obtains the summary sentence S thus obtained._iIs stored in the RAM 14, for example. At the same time, the document processing apparatus makes a list l of the central activation values of the elements not included in the sentence skeleton._iCreate this list_iIs stored in the RAM 14, for example.
[0086]
That is, in this step S24, the document processing apparatus selects sentences in descending order of the average central activity value under the control of the CPU 13, and extracts the skeleton of the selected sentence. . The skeleton of the sentence is composed of essential elements extracted from the sentence. The required elements can be the element head, subject, object, indirect object, posessor, cause, condition Alternatively, an element having a comparison relationship attribute and an element directly included in the coordinate structure when the related element having the coordinate structure is an essential element. The document processing apparatus connects essential elements of a sentence to generate a skeleton of the sentence and adds it to the summary sentence.
[0087]
Subsequently, in step S25, the document processing apparatus controls the summary sentence S under the control of the CPU 13._iIs the maximum number of characters W in the display area 220 of the window 190._sIt is judged whether there is more than.
[0088]
Here, the document processing apparatus uses the summary sentence S._iThe maximum number of characters is W_sIf it is determined that there is more than the sum, the summary sentence S is controlled under the control of the CPU 13 in step S30._i-1Is set as a final summary sentence, and a series of processing ends. In this case, the summary sentence S_i= S₀Since “=” is output, the summary text is not displayed in the display area 220.
[0089]
On the other hand, the document processing apparatus is a summary sentence S._iThe maximum number of characters is W_sIf it is determined that the average central activity value is not more than the average central activity value of the sentence with the i + 1th highest average central activity value under the control of the CPU 13, the process proceeds to step S26. List l_iThe central activity value of the element having the highest central activity value is compared with the element having the highest central activity value. Then, the document processing apparatus determines that the average center activity value of the sentence having the i + 1th average center activity value is the list l_iIf it is determined that the element having the highest central activity value is higher than the central activity value, the process proceeds to step S27. On the other hand, the document processing apparatus determines that the average central activity value of the sentence whose i + 1th average central activity value is the highest is the list l._iIf it is determined that the element is not higher than the center activity value of the element having the highest center activity value among the elements, the process proceeds to step S28.
[0090]
The document processing apparatus determines that the average central activity value of the sentence having the i + 1th average central activity value is the list l_iIf it is determined that the central active value of the element having the highest central active value is higher than the central active value, the count value i of the counter is incremented by “1” under the control of the CPU 13 in step S27. The process returns to step S24.
[0091]
Also, the document processing apparatus determines that the average central activity value of the sentence having the i + 1th average central activity value is the list l._iIf it is determined that the center activity value of the element having the highest center activity value is not higher than the center activity value, the list l is controlled under control of the CPU 13 in step S28._iThe element e with the highest central activity value among the elements of_iIn addition to SS_i, And list the element e_iDelete from. The document processing apparatus then generates the summary sentence SS generated in this way._iIs stored in the RAM 14, for example.
[0092]
Subsequently, in step S29, the document processing apparatus performs the summary sentence SS under the control of the CPU 13._iIs the maximum number of characters W in the display area 220 of the window 190._sIt is determined whether or not there are more. The document processing device is a summary sentence SS._iThe maximum number of characters is W_sIf it is determined that there is not more, the processing from step S26 is repeated. On the other hand, the document processing apparatus uses the summary sentence SS._iThe maximum number of characters is W_sIf it is determined that there are more, the summary sentence S under the control of the CPU 13 in step S31._iIs set as a final summary sentence, displayed in the display area 220, and a series of processing ends. In this way, the document processing apparatus can determine the maximum number of characters W_sA summary sentence is generated so that there is no more.
[0093]
By performing such a series of processing, the document processing apparatus can create a summary sentence by summarizing the tagged documents. For example, when the document processing apparatus summarizes the document shown in FIG. 4, the document processing apparatus creates a summary sentence as shown in FIG. 10 and displays it in the display area 220 of the display range.
[0094]
In other words, the document processing device said, “The history of TCP / IP cannot be told without ARPANET. ARPANET is a small network that connected the host computers of four universities and research institutes on the west coast of North America in 1969 with a 50kbps line. ARPANET departed from a major network, and at that time the mainframe general-purpose computer series was developed in 1964. Considering the background of this era, such a project in anticipation of the future of computer communications is truly unique to the United States. A summary sentence is created and displayed on the display area 220.
[0095]
In the document processing apparatus, instead of reading the entire sentence of the document, the user can understand the outline of the sentence by reading the summary sentence and determine whether the sentence is the desired information. it can.
[0096]
In the document processing apparatus, as a method of assigning importance to elements in a document, it is not always necessary to use active diffusion as described above. For example, the sum of the weights of words appearing in a document is used. A method of setting the importance of the document may be used. Further, methods other than these methods can be used as the method of assigning importance. Furthermore, by inputting a keyword to the keyword input unit 192 of the display area 200, the importance level can be set based on the keyword.
[0097]
As shown in FIG. 5, the document processing apparatus can expand the display range of the display area 220 of the window 190 displayed on the display unit 31, but the created summary sentence is displayed in the display area 220. When the display range of the display area 220 is changed in the displayed state, the information amount of the summary sentence can be changed according to the display range.
[0098]
In this case, the document processing apparatus changes the display range of the display area 220 of the window 190 displayed on the display unit 31 in response to the user operating the input unit 20 under the control of the CPU 13. Wait until. Then, when the display range of the display area 220 is changed, the document processing apparatus performs the same processing as the series of processes shown in FIG. 6 under the control of the CPU 13 to display the display range of the display area 220. Create a summary sentence corresponding to.
[0099]
In this way, the document processing apparatus can newly create a summary sentence corresponding to the display range of the display area 220. For example, when the user expands the display range of the display area 220 by dragging the mouse of the input unit 20, the document processing apparatus newly creates a more detailed summary sentence and creates a new summary as shown in FIG. The summary sentence is displayed in the display area 220 of the window 190.
[0100]
In other words, the document processing device said, “The history of TCP / IP cannot be told without ARPANET. ARPANET is an experiment and research project that was built with the US Department of Defense DOD's Defense Advanced Research Projects Department sponsored. ARPANET departed from a very small network that connected the host computers of four universities and research institutes on the west coast of North America via a 50kbps line in 1969. It was the world's first computer in 1945. A certain ENIAC was developed at the University of Pennsylvania, and in 1964, a mainframe general-purpose computer series with ICs mounted as theoretical elements was developed for the first time, and computers were finally born. Such a project in anticipation of communication was truly unique to the United States. A summary sentence “I can say it” is created and displayed in the display area 220.
[0101]
As described above, in the document processing apparatus, when the displayed summary sentence is too simple to grasp the outline of the document, the user can increase the amount of information by expanding the display range of the display area 220. More detailed summaries can be referred to.
[0102]
Here, when creating a summary sentence of a document in this way, the document processing device performs replacement if the summary sentence includes pronouns or qualifying clauses, as well as subject, object, etc. When is omitted, the above-described zero anaphoric processing is performed to supplement the corresponding subject or object.
[0103]
First, specific examples of replacement of pronouns and zero anaphora will be described with reference to the following documents.
[0104]
“About work.
[0105]
I don't like my current job very much. But you have to do it. "
An example of tagging this document is as follows:
[0106]
<Document>
<Title> <Adjective Verb Phrase Relationship = "Object"> <Noun Phrase> Work </ Noun Phrase> </ Adjective Verb Phrase> </ Title>
<Paragraph>
<Sentence> <adjective verb phrase relation = “subject”> <noun phrase identifier = “identifier 1”> i <// noun phrase phrase> </ adjective verb phrase> <adjective verb phrase relation = “object”> <noun phrase
Identifier = "identifier 0"> <adjective verb phrase> now </ adjective verb phrase> work </ noun phrase> </ adjective verb phrase> <verb> <adjective verb phrase relation = "degree"> too much </ adjective Verb phrase> <Verb> I don't like </ Verb> </ Verb>. </ Sentence> <sentence> <adjective verb phrase> but </ adjective verb phrase> <verb> <adjective verb phrase relation = “object”> <noun phrase reference = “identifier 0”> it </ noun phrase> </ Adject verb phrase> <verb subject = “identifier 1”> must be done </ verb> </ verb>. </ Sentence>
</ Paragraph>
</ Document>
[0107]
In this document, the noun phrase “it” refers to the noun phrase “current job” which has the attribute “reference =“ identifier 0 ”” and is an element including “identifier =“ identifier 0 ””. ing. That is, the reference element that is the antecedent corresponding to the pronoun “it” that is the referenced element is “current work”. Therefore, if the noun phrase “current job” is not included even though the noun phrase “it” is included in the summary sentence, the “current job” is replaced with “current job” in the summary sentence. .
[0108]
Further, in the above document, the verb “must be done” has the attribute “subject =“ identifier 1 ””, so the semantic subject has the attribute “identifier =“ identifier 1 ””. It turns out that it is the noun phrase "I" which is an element. That is, the zero anaphoric element of the element “must do” is “I”. Therefore, if the semantic subject “I” is not included in spite of the fact that “I have to do” is included in the summary, then “(I have to do)” in the summary. As shown in FIG.
[0109]
Such replacement processing of pronouns and zero anaphoric processing is performed subsequent to the above-described summary sentence creation or simultaneously with the summary sentence creation. Specific examples of these processing contents are shown in FIGS. 12 and 13. This will be described with reference to a flowchart.
[0110]
That is, FIG. 12 is a flowchart for explaining the processing when a pronoun or a limited clause is included in the summary sentence. The processing shown in FIG. 12 is performed following, for example, steps S30 and S31 in FIG. . In the processing shown in FIG. 12, the referenced list RBList is used to list the referenced elements such as pronouns in the reference / referenced relationship in the document, and the i-th element of the lexical element array for summarization is used. T_i And this vocabulary element t_i Reference element of r_i It is said.
[0111]
In the first step S71 in FIG. 12, the document processing apparatus empties the referenced list RBList under the control of the CPU 13 in FIG. In the next step S72, the document processing apparatus sets a count value i of a counter for counting summary vocabulary elements in order of arrangement to 1 (i = 1).
[0112]
In the next step S73, the document processing apparatus determines that the i-th element t in the array of summary vocabulary elements is t._i The vocabulary element t_i Is added to the reference list RBList. The vocabulary element t_i Reference element of r_i And The processing in this step S73 is the element t_i Is added to the referenced list RBList by adding another pronoun to the element t_i Is a pronoun or the like and refers to another element (preceding), the referenced element (preceding) is referred to as the reference element r._i It is what.
[0113]
In the next step S74, the document processing apparatus determines that the vocabulary element t._i Reference element r_i It is determined whether or not exists. In step S74, the document processing apparatus determines YES, that is, r._i If it is determined that there is, the process proceeds to step S75. If NO, the process proceeds to step S76. That is, the element t is determined by the determination in step S74._i Is a pronoun and the reference element r_i Only when is present, the process proceeds to step S75.
[0114]
In step S75, the document processing apparatus determines that the vocabulary element t._i Is an element of the referenced list RBList, the process proceeds to step S76 if YES, and proceeds to step S77 if NO. In step S76, the document processing apparatus determines the vocabulary element t._i Is added to the summary sentence, and the process proceeds to step S79. In step S77, the document processing apparatus_i Reference element r_i Is added to the summary sentence, and the process proceeds to step S78._i Is added to the referenced list RBList, and the process proceeds to step S79.
[0115]
The processing in these steps S75 to S78 is performed by the element t._i Is a pronoun and refers to another element, the element t_i Is an element of the referenced list RBList, that is, if an antecedent is already included in the summary sentence, the element t_i To the summary without replacing the antecedent with element t_i If is not an element of the referenced list RBList, the antecedent has not yet been included in the summary sentence, so the element t_i Is the antecedent reference element r_i Is added to the summary sentence.
[0116]
In step S79, the document processing apparatus determines whether or not the processing from step S73 onward has been completed for all the lexical element arrays for summarization. If NO, the count value i is incremented by 1 in step S80. After (i = i + 1), the process returns to step S73, and if YES, the process ends.
[0117]
As a result of the above-described replacement process, when a referenced element such as a pronoun or a restricted clause exists in a summary sentence but does not include a corresponding reference element as an antecedent, it appears first. Referenced element t_i Is the reference element r_i And this reference element r_i Is added to the reference list RBList, so that the same reference element r thereafter_i The referenced element corresponding to is added as it is to the summary without being replaced.
[0118]
Note that the specific example of the replacement process shown in FIG. 12 has been described as being performed following the summary sentence creation process shown in FIG. 6, but may be performed simultaneously with the summary sentence creation.
[0119]
Next, FIG. 13 is a flowchart for explaining the zero anaphoric process as described above when there is a sentence including a subject, an object, etc. omitted in the summary sentence. The process shown in FIG. For example, following steps S30 and S31 in FIG. 6, the processing in FIG. 12 is performed before, after, or simultaneously. In the processing shown in FIG. 13, the zero anaphoric list ZAList is used to list the zero anaphoric elements such as omitted subjects and objects in the document, and the i th element of the lexical element array for summarization is used. T_i And this vocabulary element t_i Z of zero anaphoric elements_i It is said.
[0120]
In the first step S81 in FIG. 13, the document processing apparatus empties the zero-adaptive list ZAList under the control of the CPU 13 in FIG. In the next step S82, the document processing apparatus sets a count value i of a counter for counting summary vocabulary elements in order of arrangement to 1 (i = 1).
[0121]
In the next step S83, the document processing apparatus sets the i-th element of the array of summary vocabulary elements to t._i The vocabulary element t_i If there is a zero anaphoric element_i And In the next step S84, the document processing apparatus determines that the vocabulary element t._i Zero-aluminating element z_i It is determined whether or not exists. In this step S84, YES (Zero Axial Element z_i If it is determined that there is a), the process proceeds to step S85. If NO, the process proceeds to step S86.
[0122]
In step S85, the document processing apparatus determines the vocabulary element t._i Is an element of the zero anaphora list ZAList, the process proceeds to step S86 if YES, and proceeds to step S87 if NO. In step S86, the vocabulary element t_i Is added to the summary sentence, and the process proceeds to step S89. In step S87, the document processing apparatus detects the zero anaphoric element z._i Is already included in the summary sentence, and if it is YES, the zero anaphoric element z is added to the zero anaphoric list ZAList._i The process proceeds to step S86, and if NO, the process proceeds to step S89. In step S89, the document processing apparatus determines that the vocabulary element t._i Zero-aluminating element z_i Punctuate the vocabulary string of vocabulary element t_i Together with the summary sentence, the process proceeds to step S90, and the vocabulary element t_i And zero anaphoric element z_i Is added to the zero anaphora list ZAList, and the process proceeds to step S91. In steps S89 and S90, the document processing apparatus determines whether or not the zero response element z._i If is the subject, add "ga" to the vocabulary string and enclose it in parentheses. If it is the object, add "" to the vocabulary string and enclose it in parentheses._i Place before or after. In the case of Japanese, the zero anaphoric element is placed before the vocabulary element.
[0123]
In step S91, the document processing apparatus determines whether or not the processing from step S83 has been completed on all the lexical element arrays for summarization. If NO, the count value i is incremented by 1 in step S92. After (i = i + 1), the process returns to step S83, and the process ends when YES.
[0124]
By the zero anaphoric processing as described above, the element t in which the subject and the object are omitted in the summary sentence_i The corresponding zero-lit element z in spite of the presence of_i Is not included in the summary sentence, the element t_i With zero anaphoric element_i Is added in parentheses and this element t_i And zero anaphoric element z_i Is added to the zero anaphoric list ZAList, and then the same zero anatomical element z_i For the elements corresponding to_i Will be added as is to the summary without being added.
[0125]
By the way, when the replacement process of pronouns or the like shown in FIG. 12 or the zero anaphoric process shown in FIG. 13 is performed following the above-described summary sentence creation shown in FIG. 6, the number of characters in the summary sentence changes, The maximum number of characters of the summary sentence described above, that is, the maximum number of characters that can be displayed depending on the size of the summary sentence display area may be exceeded or may be less than the maximum number of characters. Therefore, in order to make the number of characters in the summary sentence the maximum number of characters within the above-mentioned maximum number of characters, it is necessary to finally adjust the number of characters. If the number of characters in the summary sentence exceeds the maximum number of characters by replacing the pronouns or performing zero anaphora processing, the elements in the summary sentence that are less important, that is, the elements with the lower central activity values are sequentially Delete it so that the number of characters in the summary is within the maximum number of characters. If the number of characters is less than the maximum number of characters, elements that are not included in the summary sentence in the original document are added to the summary sentence in order from the element with the highest central activity value, and the maximum number of characters By stopping adding elements immediately before exceeding the above, it is possible to obtain a summary sentence having the number of characters closest to the maximum number of characters and within the maximum number of characters.
[0126]
In the present embodiment, an example of a tagging method for a document has been described, but the present invention is not limited to this tagging method. In the present embodiment, the document is transmitted from the outside to the communication unit 22 of the document processing apparatus via a telephone line. However, the present invention is not limited to this. For example, the present invention can be applied to a case where a document is transmitted via a satellite or the like, and even when the recording / reproducing unit 31 reads the document from the recording medium 32 or the document is written in the ROM 13 of the document processing apparatus. Good.
[0127]
Further, in the embodiment of the present invention, it is possible to easily provide a disk-shaped recording medium, a tape-shaped recording medium or the like on which the document processing program is written as the recording medium 32 of FIG. Further, the above-described document processing program can be easily supplied via a transmission medium such as a communication line.
[0128]
In the above-described embodiment, the mouse is exemplified as a device for selecting a desired element from the document displayed on the display unit 30 of the document processing apparatus. However, it goes without saying that the present invention is not limited to this. . Other devices such as a tablet and a light pen can be used to input elements in the document processing apparatus.
[0129]
Furthermore, although the Japanese sentence was illustrated in the above-mentioned embodiment, this invention is not limited to Japanese, English, German, French, Russian, Italian, Spanish, Chinese, Korean Needless to say, the present invention can be applied to various languages.
[0130]
【The invention's effect】
As is clear from the above description, according to the present invention, when a summary sentence of a document is created and an omitted subject or object in the created summary sentence is not included in the summary sentence, By adding the corresponding subject or object in the document to the summary sentence, the zero-anaphoric element will always appear once in the summary sentence, automatically generating a summary sentence that is easy to understand and accurate for the user. be able to.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a schematic configuration of a document processing apparatus to which an exemplary embodiment is applied.
FIG. 2 is a diagram illustrating an example of an internal structure by tagging a document.
FIG. 3 is a diagram showing a window displaying an internal structure by tagging a document.
FIG. 4 is a diagram showing a window displaying a document.
5 is a view showing a window displaying a document, and is a view showing a state in which a display area for displaying a summary sentence is enlarged as compared with a display area shown in FIG. 13;
FIG. 6 is a flowchart for describing a series of processes when creating a summary sentence.
FIG. 7 is a flowchart for explaining a series of processes when active diffusion is performed.
FIG. 8 is a diagram showing an element connection structure for explaining active diffusion processing;
FIG. 9 is a flowchart for explaining a series of processes when link processing for active diffusion is performed.
FIG. 10 is a diagram showing a window displaying a document and its summary sentence.
FIG. 11 is a diagram showing a window displaying a document and its summary sentence, and shows a state in which the summary sentence is displayed in the window shown in FIG. 5;
FIG. 12 is a flowchart for explaining replacement processing at a reference element when a referenced element is included in a summary sentence;
FIG. 13 is a flowchart for explaining a zero anaphoric process in a summary sentence;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Main body of document processing apparatus, 11 Control part, 12 Interface, 13 CPU, 20 Input part, 22 Communication part, 30 Display part, 31 Recording / reproducing part, 32 Recording medium, 33 Hard disk apparatus

Claims

In a document processing method of a document processing apparatus that processes a document in the form of an electronic document,
The document processing apparatus is
A summary creation process for creating a summary of the above document;
If there is an abbreviated or replaced word in the created summary sentence, it has a complementing step to complement it,
The document has an internal structure in which a plurality of elements are hierarchized, and tag information indicating the internal structure is given in advance.
In the summary sentence creating step, the importance of the element based on the internal structure indicated by the tag information is calculated and given to the element, and in the complementing step, the document processing for performing the complementing based on the tag information Method.

The above summary sentence creation process
A setting step for variably setting the size of the summary sentence display area in which the summary sentence of the document is displayed;
A determination step for determining the length of the summary sentence of the document based on the size of the display area set in the setting step;
The document processing method according to claim 1, further comprising: creating a summary sentence of the document having a length that fits in the summary sentence display area based on the length of the summary sentence determined in the determining step.

In the above supplementary process,
Zero response that adds the corresponding subject or object in the original document to the summary sentence when the omitted subject or object in the summary sentence created based on the tag information is not included in the summary sentence The document processing method according to claim 1, wherein the processing is performed.

In the zero anaphoric process, whether or not the omitted subject or object is included in the summary sentence is determined based on the tag information, and when not included, the omitted subject or object is parenthesized. The document processing method according to claim 3, wherein the document processing is added to the summary sentence.

In the above supplementary process,
When the reference element corresponding to the referenced element included in the summary sentence created based on the tag information is not included in the summary sentence, the corresponding reference element in the original document is replaced with the reference element in the summary sentence. The document processing method according to claim 1, wherein the reference element is replaced.

In a document processing apparatus that processes a document in the form of an electronic document,
A summary sentence creating means for creating a summary sentence of the document;
If there is an abbreviated or replaced word in the created summary sentence, it has a complement means for complementing it,
The document has an internal structure in which a plurality of elements are hierarchized, and tag information indicating the internal structure is given in advance.
The summary sentence creating means calculates the importance of the element based on the internal structure indicated by the tag information and assigns the importance to the element. In the complementing step, document processing for performing the complementing based on the tag information apparatus.

In a computer-readable recording medium recorded with a document processing program for causing a computer to execute document processing for processing a document in the form of an electronic document,
The above document processing program
A summary creation process for creating a summary of the above document;
If there is an abbreviated or replaced word in the created summary sentence, it has a complementing step to complement it,
The document has an internal structure in which a plurality of elements are hierarchized, and tag information indicating the internal structure is given in advance.
In the summary sentence creating step, the importance of the element based on the internal structure indicated by the tag information is calculated and given to the element, and in the complementing step, the complement is performed based on the tag information. .