JP2005011301A

JP2005011301A - Document processor and document processing program

Info

Publication number: JP2005011301A
Application number: JP2003202076A
Authority: JP
Inventors: Takaaki Yamaoka; 敬章山岡
Original assignee: Individual
Current assignee: Individual
Priority date: 2003-06-20
Filing date: 2003-06-20
Publication date: 2005-01-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document processor by which what context employs a noticeable phrase in each part in a document is efficiently checked with a simple operation. <P>SOLUTION: When a user specifies a keyword, a document analysis section extracts a given range containing the keyword for each part emerging the keyword in the document as a noticeable portion. The noticeable portion is divided into a foregoing part, the keyword, and a later part for display on an image plane. The user selects the second keyword on the image plane and performs processing including the concentration of a result, keyword connection, and keyword change, and the like. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、自然言語で記述された文書の中から重要な記述部分を効率よく検索、抽出するための文書処理装置及びプログラムに関する。
【０００２】
【従来の技術】
前記のような文書処理装置の一例として特許文献１に記載の情報抽出装置が挙げられる。この装置による文書処理の概略手順は次の通りである。
・主題情報抽出部が、文書の形態素解析で得られる各単語から主題提示語句を抽出する。
・単語間類似度解析部が、前記各単語の類似度を計算する。
・重要語句抽出部が、前記主題提示語句を構成する単語に対して類似度の高い単語を重要語句として抽出する。
・出力部が、前記重要語句を前記主題提示語句を構成する単語とリンク付けして表示する。
【０００３】
また、特許文献１には、前記処理において主題提示語句をユーザに指定させる構成の装置も開示されている。この装置では、まず、前記手順で処理対象文書から主題提示語句及びそれに関連する重要語句を抽出し、それらの語句に下線を付した状態で文書の内容を表示する。ユーザは、下線を付した語句の中からいずれかの語句を新たな主題提示語句として選択することができる。ユーザが語句を選択すると、装置がその語句を構成する単語に対して類似度の高い単語を新たな重要語句として抽出し、それらの語句に下線を付した状態で文書の内容を再表示する。
【特許文献１】特開２０００−２９８６７３号公報
【０００４】
【発明が解決しようとする課題】
文書の内容を検討する際、注目語句が実際にどのような文脈で用いられているのかを調べたい場合がある。そのためには、文書中で注目語句が出現する個所毎にその語句の前後の記述を調べる必要がある。このような観点から特許文献１に記載の装置を評価すると、次のようなことが言える。まず、主題提示語句及び重要語句をリンク付けして表示する方法では、文書全体から抽出された複数の単語又は語句の間の関係が示されるだけであるから、各語句が実際にどのような文脈で用いられているのかを知ることはできない。一方、重要語句に下線を付した状態で文書の内容を表示する方法では、注目語句の前後の記述だけでなく、他の部分の記述も表示される。従って、例えば、文書が長い場合、文書中で注目語句が出現する全ての個所についてその語句の前後の記述を調べる作業に相当な手間がかかる。
【０００５】
本発明は以上のような課題を解決するために成されたものであり、その目的とするところは、注目語句が文書中の各個所でどのような文脈で用いられているかを簡単な操作で効率よく調べることができる文書処理装置を提供することにある。
【０００６】
【課題を解決するための手段】
上記課題を解決するために成された本発明に係る第一の文書処理装置は、
ユーザに第一キーワードを指定させるためのキーワード指定部、
処理対象の文書中における前記第一キーワードの出現個所毎に、該第一キーワードを含む所定範囲の記述を注目部分として抽出する注目部分抽出部、
各注目部分を、第一キーワードに先行する前段部分、第一キーワード、及び、第一キーワードに続く後段部分に分離して画面に表示する注目部分表示部、
前記画面上で前記前段部分又は後段部分からいずれかの語句を第二キーワードとしてユーザに選択させるためのキーワード選択部、及び、
前記第二キーワードが前記前段部分から選択された場合には該第二キーワードを含む各注目部分の後段部分を選択的に表示し、前記第二キーワードが前記後段部分から選択された場合には該第二キーワードを含む各注目部分の前段部分を選択的に表示する表示更新部、
を備えることを特徴としている。
【０００７】
また、本発明に係る第二の文書処理装置は、
ユーザに第一キーワードを指定させるためのキーワード指定部、
処理対象の文書中における前記第一キーワードの出現個所毎に、該第一キーワードを含む所定範囲の記述を注目部分として抽出する注目部分抽出部、
各注目部分を、第一キーワードに先行する前段部分、第一キーワード、及び、第一キーワードに続く後段部分に分離して画面に表示する注目部分表示部、
前記画面上で前記前段部分又は後段部分からいずれかの語句を第二キーワードとしてユーザに選択させるためのキーワード選択部、及び、
前記キーワード選択部を通じてユーザにより選択された第二キーワードを新たな第一キーワードとして注目部分を抽出するよう前記注目部分抽出部に指示するキーワード変更部、
を備えることを特徴としている。
【０００８】
また、本発明に係る文書処理プログラムは、コンピュータを前記第一又は第二の文書処理装置として機能させることを特徴とする。
【０００９】
【発明の実施の形態】
本発明に係る第一の文書処理装置による文書処理の手順は次の通りである。
【００１０】
（第一キーワードの指定）
ユーザがキーワード指定部を通じて第一キーワードを指定する。例えば、キーワード指定部が、画面上にキーワード入力用テキストボックスを表示し、ユーザがそのテキストボックスに任意の語句を第一キーワードとして入力する。
【００１１】
（注目部分の抽出）
注目部分抽出部が、第一キーワードを含む所定範囲の記述を文書中から注目部分として抽出する。注目部分は、文書中において第一キーワードが出現する個所毎に抽出する。抽出範囲は例えば次のようなルールに従って決定する。
例１：第一キーワードより前に出現する直近の句点の次の文字を始端とし、第一キーワードの後に出現する直近の句点を終端とする範囲。
例２：第一キーワードを基準として前後にそれぞれ所定数（例：１０個）の形態素を含む範囲。
例３：例１と例２の論理積として規定される範囲。
【００１２】
（画面への注目部分の表示）
注目部分表示部が、上記のようにして抽出された各注目部分を、第一キーワードに先行する前段部分、第一キーワード、及び、第一キーワードに続く後段部分とに分離して画面に表示する。表示形式の例としては以下のようなものが挙げられる。
例１：単純に前段部分、第一キーワード及び後段部分に３分割する形式（以下、「単純３分割表示」と呼ぶ）。
例２：前段部分及び後段部分をそれぞれ形態素に分解して表示する形式（以下、「形態素分解表示」と呼ぶ）。この表示形式としては、例えば、ツリー形式（各ノードが形態素に対応）、表形式（各セルが形態素に対応）、分かち書き形式（スラッシュ、空白等の分離文字を用いて分かち書き）等が挙げられる。
【００１３】
（第二キーワードの選択）
ユーザがキーワード選択部を通じて画面上で前記前段部分又は後段部分から第二キーワードを選択する。例えば、単純３分割表示の場合、ユーザが、マウス等の入力装置を適宜操作して、前段部分又は後段部分中の任意の語句を選択（ハイライト）する。また、形態素分解表示の場合、ユーザが、画面に表示された多数の形態素の中から任意の形態素を選択する。
【００１４】
（前段部分又は後段部分の選択表示）
表示更新部が、第二キーワードの選択に応じて各注目部分の前段部分又は後段部分を選択的に表示する。第二キーワードが前段部分で選択された場合、そのキーワードを含む注目部分の後段部分を選択的に表示し、第二キーワードが後段部分で選択された場合、そのキーワードを含む注目部分の前段部分を選択的に表示する。ここで、「選択的に表示する」とは、第二キーワードを含む注目部分の前段部分又は後段部分（以下、「該当部分」と呼ぶ）を第二キーワードを含まない注目部分の前段部分又は後段部分（以下、「非該当部分」と呼ぶ）と区別して表示することをいう。以下に具体例を示す。
例１：該当部分のみ画面に表示し、非該当部分を非表示にする。
例２：該当部分と非該当部分を異なる色で表示する。
【００１５】
本発明に係る第二の文書処理装置による文書処理の手順は次の通りである。
【００１６】
まず、第一の文書処理装置と同様に、ユーザが第一キーワードを指定すると、注目部分抽出部が前記第一キーワードに対応する注目部分を抽出し、更に、注目部分表示部が各注目部分を前段部分、第一キーワード及び後段部分に分離して表示する。次に、ユーザが、画面に表示されたいずれかの注目部分の前段部分又は後段部分から第二キーワードを選択して所定のキーワード変更操作を実行すると、キーワード変更部がその第二キーワードを新たな第一キーワードとして注目部分抽出部に渡す。これを受けた注目部分抽出部は、新たな第一キーワードに対応する注目部分を抽出する。また、注目部分表示部は、抽出された各注目部分を、第一キーワードに先行する前段部分、第一キーワード、及び、第一キーワードに続く後段部分とに分離して画面に表示する。こうしてユーザは、必要に応じて何度でもキーワードを変更することができる。
【００１７】
本発明に係る文書処理装置において、ユーザが前記キーワード選択部を通じて第二キーワードを選択し、所定のキーワード連結操作を実行すると、前記第二キーワードから第一キーワードまでの範囲に含まれる語句を連結して成る文字列を新たな第一キーワードとして前記注目部分表示部に渡すキーワード連結部を更に設けてもよい。
【００１８】
本発明に係る文書処理装置において、ユーザが前記キーワード選択部を通じて第二キーワードを選択し、所定の原文参照操作を実行すると、前記第一キーワード及び第二キーワードを含む注目部分が出現する原文中の記述部分を表示する原文表示部を更に設けてもよい。
【００１９】
本発明に係る文書処理装置において、処理対象文書が複数の部分から成る構造化文書である場合、該文書中の特定部分だけを対象とする処理対象限定部を更に設けてもよい。構造化文書の例としては、特許出願書類や学会用論文のように規定様式に従って作成された文書、ＨＴＭＬやＸＭＬ等のマークアップ言語で記述された文書が挙げられる。
【００２０】
【発明の効果】
以上のように、本発明に係る文書処理装置によれば、注目語句（第一キーワード）を含む原文中の各記述部分がその前段部分と後段部分とともに表示される。これにより、ユーザは、その語句が文中でどのような意味合いで用いられているかを簡単に確認することができる。また、注目語句の前段又は後段にある任意の語句を第二キーワードとして選択することにより、検索結果の絞り込みや新たなキーワードでの検索を簡単な操作で実行することができる。
【００２１】
【実施例】
本発明に係る文書処理装置の一実施例である特許公報分析装置について図面を参照しながら説明する。図１は本実施例の特許公報分析装置（以下、本装置と呼ぶ）の概略構成図である。本装置１は、入力装置１０、記憶装置１２、表示装置１４、本体ユニット１６等を備えるコンピュータ上で所定プログラムを動作させることにより構成された文書解析部１８を備えている。記憶装置１２には、公開特許公報（以下、単に文書と呼ぶ）のファイルが保存されている。本体ユニット１６にはＣＰＵやメモリ（いずれも図示せず）等が含まれる。
【００２２】
本装置の動作について図２のフローチャートを参照しながら説明する。
【００２３】
まず、ユーザが、入力装置１０の所定操作により、記憶装置１２に保存された目的の文書ファイルを指定する（ステップＳ１０）。この操作を受けて、文書解析部１８は、指定の文書ファイルに保存された文章を読み込み（ステップＳ１２）、形態素解析によりその文章を多数の形態素に分解する（ステップＳ１４）。これらの形態素は双方向リストの形でメモリに保持される。
【００２４】
次に、文書解析部１８は、表示装置１４にキーワード分析画面を表示する（スデップＳ１６）。図３にキーワード分析画面の概略構成を示す。キーワード分析画面２０には、文書情報表示領域２０２、分析対象項目指定領域２０４、分析条件設定領域２０６及び分析結果表示領域２０８が設けられている。文書情報表示領域２０２には文書の基本属性情報が表示される。分析対象項目指定領域２０４には、文書の各構成部分に対応するチェックボックスが設けられている。このチェックボックスにより、ユーザは、文書中の所望の項目のみをキーワード分析対象として指定することができる。分析条件設定領域２０６には、「分析キー」テキストボックス２１０、「深さ」テキストボックス２１２及び「分析実行」ボタン２１４が設けられている。「分析キー」テキストボックス２１０には、本発明の第一キーワードとなる語句を入力する。「深さ」テキストボックス２１２には、注目部分の大きさを指定する数値を入力する。すなわち、「深さ」は、各注目部分の前段部分及び後段部分がそれぞれ含みうる形態素数の最大値を示す。
【００２５】
ユーザが「分析キー」テキストボックス２１０にキーワード（第一キーワード）を入力して「分析実行」ボタン２１４を押下すると、文書解析部１８は、前記形態素リストを走査して前記キーワードを検索し（ステップＳ１８）、前記キーワードの出現個所毎にそのキーワードに対応する注目部分（すなわち前段部分及び後段部分）の範囲を決定する（ステップＳ２０）。注目部分の範囲は次のルールに従って決定される。
・前段部分の決定ルール：キーワードに先行する「深さ」数分の形態素から成る範囲に句点が含まれている場合は、その句点の次の形態素からキーワードの直前の形態素までを前段部分と定める。前記範囲に句点が含まれていない場合は、その範囲をそのまま前段部分と定める。
・後段部分の決定ルール：キーワードに続く「深さ」数分の形態素から成る範囲に句点が含まれている場合は、キーワードの直後の形態素からその句点までを後段部分と定める。前記範囲に句点が含まれていない場合は、その範囲をそのまま後段部分と定める。
【００２６】
次に、文書解析部１８は、キーワードの出現個所毎に抽出した注目部分（前段部分及び後段部分）を形態素に分離し、ツリー形式で分析結果表示領域２０８に表示する（ステップＳ２４）。この形態素ツリーの生成手順について図４〜図６を参照しながら説明する。いま、図４に示したような内容のサンプル文書を本装置１で処理することを考える。キーワードは「形態素」、深さは「８」、分析対象項目は「発明の名称」、「要約」及び「特許請求の範囲」とする。なお、サンプル文書中の下線は説明のために付したものであって、文書の内容を構成するものではない。
【００２７】
図４を見ると、キーワード「形態素」は文書中の６個所に出現している（サンプル文書中の下線部参照）。従って、この文書は分析対象範囲中に６個の注目部分を含むことになる。各注目部分毎の内容を図５の表Ａに示す。符号Ｐ１〜Ｐ６は６個のキーワードの出現個所に対応する前段部分、符号Ｓ１〜Ｓ６は同じく後段部分を示す。６番目の出現個所に対応する前段部分Ｐ６は、句点の次の形態素からキーワードの直前の形態素までの４個の形態素で構成されている。また、５番目の出現個所に対応する後段部分Ｓ５は、キーワードの直後の形態素から句点までの５個の形態素で構成されている。その他の前段部分及び後段部分はいずれも８個の形態素で構成されている。
【００２８】
次に、「前１」「前２」…「前８」をソートキーにして前段部分Ｐ１〜Ｐ６を並べ替えると図５の表Ｂのようになり、更に、「後１」「後２」…「後８」をソートキーとして後段部分Ｓ１〜Ｓ６を並べ替えると図５の表Ｃのようになる。表Ｃの同一列内に連続して出現する複数の同一語句を１つにまとめると、図６の表Ｄのようになる。そして、キーワード及び各語句をツリーのノードとみなすと、図３に示したようなツリー構造が得られる。
【００２９】
分析結果表示領域２０８に表示されたいずれかの語句（形態素）を第二キーワードとして選択し、所定の操作を実行すると、ポップアップメニュー２０９が画面に表示される。このメニュー２１５から、ユーザは以下に説明するいずれかの処理を選択し、文書解析部１８に実行させることができる。
【００３０】
（絞り込み）
絞り込み処理は、第一キーワード及び第二キーワードの両方を含む注目部分だけを選択的に表示する処理である。本装置１では、第二キーワードが前段部分で選択された場合は後段部分の選択的表示を行い、第二キーワードが後段部分で選択された場合は前段部分の選択的表示を行う。例として、図３において符号２１６で示した「文書」という語を選択し、メニュー２１５の「絞り込み」を選択した場合を考える。図３を見ると、符号２１６で示した語はキーワード（「形態素」）より形態素４個分だけ前に出現している。図５の表Ａを見ると、「前４」列に「文書」という語を含むのは前段部分Ｐ１及びＰ４の２つである。そして、前段部分Ｐ１及びＰ４に対応する後段部分Ｓ１及びＳ４の内容は全く同じである。そこで、文書解析部１８は、図７に示したように、形態素ツリーの後段部分ではＳ１及びＳ４の内容に相当する枝だけを表示する。
【００３１】
（キーワード連結）
キーワード連結処理は、第一キーワードから第二キーワードまでの範囲に含まれる形態素を連結して成る文字列を新たな第一キーワードに設定して文書解析を実行する処理である。第二キーワードは前段部分及び後段部分のいずれで選択してもよい。例として、図３において符号２１８で示した「強調」という語を選択し、メニュー２１５の「キーワード連結」を選択した場合を考える。この場合、文書解析部１８は、第一キーワード「形態素」から第二キーワード「強調」までに含まれる形態素を連結することにより「形態素を強調」という語句を生成し、それを新たな第一キーワードとして「分析キー」テキストボックス２１０に設定する。その後、ユーザが「分析実行」ボタン２１２を押下すると、文書解析部１８が先に説明した手順で文書を解析し、図８に示したようなツリーを分析結果表示領域２０８に表示する。
【００３２】
（キーワード変更）
キーワード変更処理は、分析結果表示領域２０８で選択された語句を新たな第一キーワードに設定して文書解析を実行する処理である。例えば、図３において符号２１８で示した「強調」という語を選択し、メニュー２１５の「キーワード変更」を選択した場合を考える。この場合、文書解析部１８は、現在のキーワード「形態素」に代えて「強調」を新たなキーワードとして「分析キー」テキストボックス２１０に設定する。その後、ユーザが「分析実行」ボタン２１２を押下すると、文書解析部１８が先に説明した手順で文書を解析し、新たなツリー（図示せず）を分析結果表示領域２０８に表示する。
【００３３】
（原文参照）
原文参照処理は、第一キーワードから第二キーワードまでの語句を含む原文の記載範囲を別ウィンドウに表示する処理である。例えば、図３において符号２１６で示した「文書」という語を選択し、メニュー２１５の「原文参照」を選択した場合を考える。この操作を受けた文書解析部１８は、原文参照ウィンドウ２２０を生成し、第一キーワードから第二キーワードまでの形態素を連結して成る語句「文書の内容を形態素」を強調表示した形で原文を表示する（図３の例では該当語句に下線を付している）。該当語句が原文中の複数個所に出現している場合、全ての該当語句を強調表示する。ユーザは「前へ」ボタン２２２及び「次へ」ボタン２２４を使うことにより、複数個所の該当語句を順次ウィンドウ２２０内に表示させることができる。
【００３４】
以上、本発明に係る文書処理装置の一実施例について説明したが、上記実施例はあくまで一例に過ぎない。以下に幾つかの変形例を挙げる。
（例１）分析結果表示領域２０８に分析結果を表示する形式をツリー形式ではなく図５〜図６のような表形式にしてもよい。
（例２）処理対象文書の種類は特許公報に限らず、他の種類の文書でもよい。
（例３）文書の読み込みはネットワーク経由で行ってもよい。
（例４）入力装置及び表示装置と本体ユニットはネットワーク経由で通信するようにしてもよい。
【００３５】
例４の具体的形態として、本発明をインターネットの情報検索システム（以下、検索サーバと呼ぶ）に応用したものが考えられる。この場合のシステム動作は次のようになる。まず、ユーザが自分の端末（図１の入力装置１０及び表示装置１４に相当）を用いて検索サーバ（本体ユニット１６の文書解析部１８に相当）に検索キーワードを送信する。これを受けた検索サーバは、記憶装置に保存された各ＷＥＢサイトのコンテンツのうち自然言語で記述されたもの（ＨＴＭＬ文書等）を先に説明した手順で分析し、結果をユーザの端末に送信する。この結果を受けた端末の画面には、前記キーワードを含む各文書からの抽出部分がツリー形式、表形式等の所定形式で表示される。更に、この画面には、各文書の公開元であるＷＥＢサイトへのリンクも表示される。ユーザが画面上で第二キーワードを選択し、「絞り込み」「キーワード連結」等のボタンを押下すると、押下されたボタンに対応する処理要求が端末からサーバに送信される。この要求を受けたサーバは、要求された処理を実行し、その結果を端末に返す。このようにしてユーザは、各文書において前記キーワードがどのような文脈で用いられているかを確認しながら、目的に合った文書の候補を絞り込むことができる。目的に合った文書が見つかったら、上記リンクを通じて簡単にその文書の発信元であるＷＥＢサイトを閲覧することができる。
【００３６】
上記の他にも、本発明の精神及び範囲内で様々な形態の実施例が考えられる。
【図面の簡単な説明】
【図１】本発明に係る文書処理装置の一実施例である特許公報分析装置の概略構成図。
【図２】図１の装置の動作を示すフローチャート。
【図３】キーワード分析画面の概略構成図。
【図４】サンプル文書の内容を示す図。
【図５】形態素ツリーの生成手順を説明するための図。
【図６】形態素ツリーの生成手順を説明するための図。
【図７】絞り込み処理により得られた形態素ツリーの一例。
【図８】キーワード連結処理により得られた形態素ツリーの一例。
【符号の説明】
１０…入力装置
１２…記憶装置
１４…表示装置
１６…本体ユニット
１８…文書解析部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document processing apparatus and program for efficiently retrieving and extracting important description parts from a document described in a natural language.
[0002]
[Prior art]
An example of the document processing apparatus as described above is an information extraction apparatus described in Patent Document 1. The outline procedure of document processing by this apparatus is as follows.
The subject information extraction unit extracts a subject presentation phrase from each word obtained by morphological analysis of the document.
A word-to-word similarity analysis unit calculates the similarity of each word.
The important phrase extracting unit extracts words having high similarity to the words constituting the subject presentation phrase as important phrases.
-An output part displays the said important phrase linked with the word which comprises the said subject presentation phrase.
[0003]
Patent Document 1 also discloses an apparatus configured to allow a user to specify a subject word / phrase in the process. In this apparatus, first, subject presentation words and related important phrases are extracted from the processing target document by the above procedure, and the contents of the document are displayed with these words underlined. The user can select one of the underlined phrases as a new subject presentation phrase. When the user selects a phrase, the device extracts words having high similarity to the words constituting the phrase as new important phrases, and redisplays the contents of the document with these phrases underlined.
[Patent Document 1] JP 2000-298673 A
[Problems to be solved by the invention]
When examining the contents of a document, there are cases where it is desired to examine in what context the attention word is actually used. For this purpose, it is necessary to examine the descriptions before and after each word where the word of interest appears in the document. When the apparatus described in Patent Document 1 is evaluated from such a viewpoint, the following can be said. First, the method of linking and displaying the subject words and key words only shows the relationship between multiple words or phrases extracted from the entire document, so what context each phrase actually means It is not possible to know what is being used in. On the other hand, in the method of displaying the contents of the document with the underlined important words / phrases, not only the descriptions before and after the attention words / phrases but also the descriptions of other parts are displayed. Therefore, for example, when the document is long, it takes a considerable amount of time to examine the descriptions before and after the word / phrase at all locations where the word / phrase appears in the document.
[0005]
The present invention has been made to solve the above-mentioned problems, and the object of the present invention is to determine in what context the attention word is used in each part of the document by a simple operation. An object of the present invention is to provide a document processing apparatus capable of efficiently examining.
[0006]
[Means for Solving the Problems]
The first document processing apparatus according to the present invention, which has been made to solve the above problems,
A keyword specification section for allowing the user to specify the first keyword,
An attention part extraction unit that extracts a description of a predetermined range including the first keyword as an attention part for each occurrence of the first keyword in the document to be processed;
An attention part display unit that separates each attention part into a preceding part preceding the first keyword, a first keyword, and a subsequent part following the first keyword, and displays it on the screen;
A keyword selection unit for allowing the user to select any of the phrases from the front part or the rear part on the screen as a second keyword; and
When the second keyword is selected from the preceding part, the subsequent part of each target part including the second keyword is selectively displayed, and when the second keyword is selected from the succeeding part, the second part is selectively displayed. A display update unit for selectively displaying the preceding part of each target part including the second keyword,
It is characterized by having.
[0007]
The second document processing apparatus according to the present invention is
A keyword specification section for allowing the user to specify the first keyword,
An attention part extraction unit that extracts a description of a predetermined range including the first keyword as an attention part for each occurrence of the first keyword in the document to be processed;
An attention part display unit that separates each attention part into a preceding part preceding the first keyword, a first keyword, and a subsequent part following the first keyword, and displays it on the screen;
A keyword selection unit for allowing the user to select any of the phrases from the front part or the rear part on the screen as a second keyword; and
A keyword changing unit for instructing the target part extracting unit to extract a target part using the second keyword selected by the user through the keyword selecting unit as a new first keyword;
It is characterized by having.
[0008]
A document processing program according to the present invention causes a computer to function as the first or second document processing apparatus.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
The procedure of document processing by the first document processing apparatus according to the present invention is as follows.
[0010]
(Specify the first keyword)
The user designates the first keyword through the keyword designation unit. For example, the keyword specifying unit displays a keyword input text box on the screen, and the user inputs an arbitrary phrase as the first keyword in the text box.
[0011]
(Extract attention part)
The attention part extraction unit extracts a description of a predetermined range including the first keyword from the document as the attention part. The attention portion is extracted for each location where the first keyword appears in the document. The extraction range is determined according to the following rules, for example.
Example 1: A range starting from the next character after the nearest phrase that appears before the first keyword and ending at the nearest phrase that appears after the first keyword.
Example 2: A range including a predetermined number (for example, 10) of morphemes before and after the first keyword as a reference.
Example 3: Range defined as the logical product of Example 1 and Example 2.
[0012]
(Display the attention part on the screen)
The attention part display unit displays each attention part extracted as described above on the screen by separating it into a first part preceding the first keyword, a first keyword, and a subsequent part following the first keyword. . Examples of display formats include the following.
Example 1: A format that is simply divided into three parts, the first part, the first keyword, and the latter part (hereinafter referred to as “simple three-part display”).
Example 2: A format in which the former part and the latter part are disassembled into morphemes and displayed (hereinafter referred to as “morpheme decomposition display”). Examples of the display format include a tree format (each node corresponds to a morpheme), a table format (each cell corresponds to a morpheme), a split writing format (split using a separating character such as a slash or a blank), and the like.
[0013]
(Selection of second keyword)
The user selects the second keyword from the preceding part or the succeeding part on the screen through the keyword selection unit. For example, in the case of simple three-division display, the user appropriately operates an input device such as a mouse to select (highlight) an arbitrary phrase in the front part or the rear part. In the case of morpheme decomposition display, the user selects an arbitrary morpheme from a large number of morphemes displayed on the screen.
[0014]
(Selection display of front part or rear part)
The display update unit selectively displays the front part or the rear part of each part of interest according to the selection of the second keyword. When the second keyword is selected in the preceding part, the subsequent part of the attention part including the keyword is selectively displayed. When the second keyword is selected in the latter part, the preceding part of the attention part including the keyword is displayed. Selectively display. Here, “selectively display” means that the preceding part or the following part of the attention part including the second keyword (hereinafter referred to as “corresponding part”) is the preceding part or the following part of the attention part not including the second keyword. This is to distinguish and display from a part (hereinafter referred to as “non-applicable part”). Specific examples are shown below.
Example 1: Display only relevant parts on the screen and hide non-relevant parts.
Example 2: Display the relevant part and the non-relevant part in different colors.
[0015]
The procedure of document processing by the second document processing apparatus according to the present invention is as follows.
[0016]
First, similarly to the first document processing apparatus, when the user designates the first keyword, the attention part extraction unit extracts the attention part corresponding to the first keyword, and the attention part display unit displays each attention part. Separately display the first part, the first keyword, and the second part. Next, when the user selects a second keyword from the front part or the rear part of any target part displayed on the screen and executes a predetermined keyword changing operation, the keyword changing unit newly sets the second keyword. It passes to the attention part extraction part as the first keyword. Receiving this, the attention part extraction part extracts the attention part corresponding to the new first keyword. The attention part display unit separates each extracted attention part into a front part preceding the first keyword, a first keyword, and a rear part following the first keyword, and displays them on the screen. Thus, the user can change the keyword as many times as necessary.
[0017]
In the document processing apparatus according to the present invention, when the user selects the second keyword through the keyword selection unit and executes a predetermined keyword linking operation, the words included in the range from the second keyword to the first keyword are linked. There may be further provided a keyword linking unit for passing the character string formed as a new first keyword to the target portion display unit.
[0018]
In the document processing apparatus according to the present invention, when a user selects a second keyword through the keyword selection unit and executes a predetermined original text reference operation, an attention part including the first keyword and the second keyword appears in the original text. You may further provide the original text display part which displays a description part.
[0019]
In the document processing apparatus according to the present invention, when the processing target document is a structured document composed of a plurality of parts, a processing target limiting unit that targets only a specific part of the document may be further provided. Examples of structured documents include documents created according to a prescribed format, such as patent application documents and academic papers, and documents written in a markup language such as HTML or XML.
[0020]
【The invention's effect】
As described above, according to the document processing apparatus of the present invention, each description part in the original sentence including the attention word (first keyword) is displayed together with the preceding part and the subsequent part. As a result, the user can easily confirm the meaning of the word used in the sentence. In addition, by selecting an arbitrary word or phrase before or after the attention word or phrase as the second keyword, it is possible to narrow down the search result or perform a search with a new keyword with a simple operation.
[0021]
【Example】
A patent publication analysis apparatus which is an embodiment of a document processing apparatus according to the present invention will be described with reference to the drawings. FIG. 1 is a schematic configuration diagram of a patent publication analysis apparatus (hereinafter referred to as the present apparatus) of the present embodiment. The apparatus 1 includes a document analysis unit 18 configured by operating a predetermined program on a computer including an input device 10, a storage device 12, a display device 14, a main unit 16, and the like. The storage device 12 stores a file of a published patent publication (hereinafter simply referred to as a document). The main unit 16 includes a CPU, a memory (none of which are shown), and the like.
[0022]
The operation of this apparatus will be described with reference to the flowchart of FIG.
[0023]
First, the user designates a target document file stored in the storage device 12 by a predetermined operation of the input device 10 (step S10). In response to this operation, the document analysis unit 18 reads the sentence saved in the designated document file (step S12), and decomposes the sentence into a large number of morphemes by morphological analysis (step S14). These morphemes are held in memory in the form of a bidirectional list.
[0024]
Next, the document analysis unit 18 displays a keyword analysis screen on the display device 14 (step S16). FIG. 3 shows a schematic configuration of the keyword analysis screen. The keyword analysis screen 20 includes a document information display area 202, an analysis target item designation area 204, an analysis condition setting area 206, and an analysis result display area 208. In the document information display area 202, basic attribute information of the document is displayed. The analysis target item designation area 204 is provided with check boxes corresponding to the respective components of the document. With this check box, the user can specify only desired items in the document as keyword analysis targets. In the analysis condition setting area 206, an “analysis key” text box 210, a “depth” text box 212, and an “execute analysis” button 214 are provided. In the “analysis key” text box 210, a phrase that is the first keyword of the present invention is entered. In the “depth” text box 212, a numerical value for designating the size of the target portion is input. That is, the “depth” indicates the maximum value of the number of morphemes that can be included in the front part and the rear part of each target part.
[0025]
When the user inputs a keyword (first keyword) in the “analysis key” text box 210 and presses the “execute analysis” button 214, the document analysis unit 18 scans the morpheme list to search for the keyword (step S1). S18) For each occurrence of the keyword, the range of the portion of interest corresponding to the keyword (that is, the front part and the rear part) is determined (step S20). The range of interest is determined according to the following rules.
・ Rule for determining the first part: When a punctuation point is included in the range of morphemes for the number of “depths” preceding the keyword, the part from the morpheme following the punctuation point to the morpheme immediately before the keyword is defined as the previous part. . If no punctuation is included in the range, the range is defined as the previous part as it is.
Rules for determining the latter part: When a punctuation point is included in the range of morphemes for the number of “depths” following the keyword, the part from the morpheme immediately after the keyword to the punctuation point is determined as the subsequent part. When the range does not include a punctuation mark, the range is determined as the latter part as it is.
[0026]
Next, the document analysis unit 18 separates the attention part (the front part and the rear part) extracted for each occurrence of the keyword into morphemes and displays them in the analysis result display area 208 in a tree format (step S24). A procedure for generating the morpheme tree will be described with reference to FIGS. Now, consider processing of a sample document having the contents as shown in FIG. The keyword is “morpheme”, the depth is “8”, and the analysis target items are “name of invention”, “summary”, and “claims”. The underline in the sample document is given for explanation and does not constitute the content of the document.
[0027]
Referring to FIG. 4, the keyword “morpheme” appears in six places in the document (see the underlined portion in the sample document). Therefore, this document includes six attention portions in the analysis target range. Table A in FIG. 5 shows the contents of each target portion. Reference numerals P1 to P6 indicate the former part corresponding to the appearance locations of the six keywords, and reference signs S1 to S6 indicate the latter part. The preceding part P6 corresponding to the sixth appearance location is composed of four morphemes from the morpheme next to the punctuation point to the morpheme immediately before the keyword. Further, the rear portion S5 corresponding to the fifth appearance location is composed of five morphemes from the morpheme immediately after the keyword to the punctuation point. The other front and rear parts are each composed of eight morphemes.
[0028]
Next, when the front part P1 to P6 is rearranged using “front 1”, “front 2”,..., “Front 8” as sort keys, the result is as shown in Table B of FIG. When the rear portions S1 to S6 are rearranged using “rear 8” as a sort key, the result is as shown in Table C of FIG. When a plurality of identical phrases appearing continuously in the same column of Table C are combined into one, Table D in FIG. 6 is obtained. If the keywords and each phrase are regarded as nodes of the tree, a tree structure as shown in FIG. 3 is obtained.
[0029]
When any word (morpheme) displayed in the analysis result display area 208 is selected as a second keyword and a predetermined operation is executed, a pop-up menu 209 is displayed on the screen. From this menu 215, the user can select one of the processes described below and cause the document analysis unit 18 to execute it.
[0030]
(Narrow down)
The narrowing-down process is a process for selectively displaying only a portion of interest including both the first keyword and the second keyword. In the present apparatus 1, when the second keyword is selected in the preceding part, the subsequent part is selectively displayed, and when the second keyword is selected in the succeeding part, the preceding part is selectively displayed. As an example, consider the case where the word “document” indicated by reference numeral 216 in FIG. 3 is selected and “narrow down” in the menu 215 is selected. Referring to FIG. 3, the word denoted by reference numeral 216 appears four morphemes before the keyword (“morpheme”). Referring to Table A in FIG. 5, two parts P1 and P4 include the word “document” in the “front 4” column. The contents of the rear stage portions S1 and S4 corresponding to the front stage portions P1 and P4 are exactly the same. Therefore, as shown in FIG. 7, the document analysis unit 18 displays only the branches corresponding to the contents of S1 and S4 in the latter part of the morpheme tree.
[0031]
(Keyword link)
The keyword linking process is a process of executing document analysis by setting a character string formed by linking morphemes included in the range from the first keyword to the second keyword as a new first keyword. The second keyword may be selected in either the front part or the rear part. As an example, consider a case where the word “emphasis” indicated by reference numeral 218 in FIG. 3 is selected and “keyword concatenation” in the menu 215 is selected. In this case, the document analysis unit 18 generates a phrase “emphasize morpheme” by concatenating morphemes included from the first keyword “morpheme” to the second keyword “enhancement”, and generates the new first keyword. Is set in the “analysis key” text box 210. Thereafter, when the user presses the “analysis execution” button 212, the document analysis unit 18 analyzes the document by the procedure described above, and displays a tree as shown in FIG. 8 in the analysis result display area 208.
[0032]
(Keyword change)
The keyword changing process is a process of executing document analysis by setting the word / phrase selected in the analysis result display area 208 as a new first keyword. For example, consider a case where the word “emphasis” indicated by reference numeral 218 in FIG. 3 is selected and “change keyword” in the menu 215 is selected. In this case, the document analysis unit 18 sets “emphasis” as a new keyword in the “analysis key” text box 210 instead of the current keyword “morpheme”. After that, when the user presses the “execute analysis” button 212, the document analysis unit 18 analyzes the document according to the procedure described above, and displays a new tree (not shown) in the analysis result display area 208.
[0033]
(See original)
The original text reference process is a process of displaying the description range of the original text including words from the first keyword to the second keyword in a separate window. For example, consider a case where the word “document” indicated by reference numeral 216 in FIG. 3 is selected and “reference original text” is selected from the menu 215. In response to this operation, the document analysis unit 18 generates a source text reference window 220, and displays the source text in a form in which the phrase “document content is a morpheme” formed by concatenating morphemes from the first keyword to the second keyword is highlighted. Displayed (in the example of FIG. 3, the corresponding word is underlined). If the corresponding phrase appears in multiple places in the original text, all the corresponding phrases are highlighted. By using the “Previous” button 222 and the “Next” button 224, the user can sequentially display a plurality of corresponding words / phrases in the window 220.
[0034]
As mentioned above, although one Example of the document processing apparatus concerning this invention was described, the said Example is only an example to the last. Some modifications are given below.
(Example 1) The format of displaying the analysis result in the analysis result display area 208 may be a table format as shown in FIGS.
(Example 2) The type of document to be processed is not limited to a patent publication, but may be another type of document.
(Example 3) Reading of a document may be performed via a network.
(Example 4) The input device, the display device, and the main unit may communicate with each other via a network.
[0035]
As a specific form of Example 4, an application of the present invention to an Internet information search system (hereinafter referred to as a search server) can be considered. The system operation in this case is as follows. First, the user transmits a search keyword to a search server (corresponding to the document analysis unit 18 of the main unit 16) using his / her terminal (corresponding to the input device 10 and the display device 14 in FIG. 1). Upon receiving this, the search server analyzes the contents of each WEB site stored in the storage device in the natural language (such as an HTML document) according to the procedure described above, and transmits the result to the user terminal. To do. On the screen of the terminal that receives the result, the extracted portion from each document including the keyword is displayed in a predetermined format such as a tree format or a table format. Further, on this screen, a link to the WEB site from which each document is published is also displayed. When the user selects the second keyword on the screen and presses a button such as “narrow down” or “keyword link”, a processing request corresponding to the pressed button is transmitted from the terminal to the server. Upon receiving this request, the server executes the requested process and returns the result to the terminal. In this way, the user can narrow down candidate documents suitable for the purpose while confirming in which context the keyword is used in each document. When a document suitable for the purpose is found, the WEB site that is the source of the document can be easily browsed through the link.
[0036]
In addition to the above, various forms of embodiments are conceivable within the spirit and scope of the present invention.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram of a patent publication analysis apparatus which is an embodiment of a document processing apparatus according to the present invention.
FIG. 2 is a flowchart showing the operation of the apparatus shown in FIG.
FIG. 3 is a schematic configuration diagram of a keyword analysis screen.
FIG. 4 is a diagram showing the contents of a sample document.
FIG. 5 is a diagram for explaining a procedure for generating a morpheme tree;
FIG. 6 is a diagram for explaining a procedure for generating a morpheme tree.
FIG. 7 shows an example of a morpheme tree obtained by a narrowing process.
FIG. 8 shows an example of a morpheme tree obtained by keyword concatenation processing.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Input device 12 ... Memory | storage device 14 ... Display apparatus 16 ... Main body unit 18 ... Document analysis part

Claims

A keyword specification section for allowing the user to specify the first keyword,
An attention part extraction unit that extracts a description of a predetermined range including the first keyword as an attention part for each occurrence of the first keyword in the document to be processed;
An attention part display unit that separates each attention part into a preceding part preceding the first keyword, a first keyword, and a subsequent part following the first keyword, and displays it on the screen;
A keyword selection unit for allowing the user to select any of the phrases from the front part or the rear part on the screen as a second keyword; and
When the second keyword is selected from the preceding part, the subsequent part of each target part including the second keyword is selectively displayed, and when the second keyword is selected from the succeeding part, the second part is selectively displayed. A display update unit for selectively displaying the preceding part of each target part including the second keyword,
A document processing apparatus comprising:

A keyword specification section for allowing the user to specify the first keyword,
An attention part extraction unit that extracts a description of a predetermined range including the first keyword as an attention part for each occurrence of the first keyword in the document to be processed;
An attention part display unit that separates each attention part into a preceding part preceding the first keyword, a first keyword, and a subsequent part following the first keyword, and displays it on the screen;
A keyword selection unit for allowing the user to select any of the phrases from the front part or the rear part on the screen as a second keyword; and
A keyword changing unit for instructing the target part extracting unit to extract a target part using the second keyword selected by the user through the keyword selecting unit as a new first keyword;
A document processing apparatus comprising:

The attention part display unit disassembles and displays each of the front part and the rear part into morphemes,
The document processing apparatus according to claim 1, wherein the keyword selection unit causes the user to select any morpheme included in the front part and the rear part as the second keyword.

A document processing program for causing a computer to function as the document processing apparatus according to claim 1.