JP2004070376A

JP2004070376A - Document display device and method therefor

Info

Publication number: JP2004070376A
Application number: JP2002169129A
Authority: JP
Inventors: Akio Yamashita; 山下　明男; Takeshi Nagamine; 永峯　猛志; Katsunori Yoshiji; 芳地　克典
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2002-06-10
Filing date: 2002-06-10
Publication date: 2004-03-04

Abstract

PROBLEM TO BE SOLVED: To display a key phrase in a document by a display attribute according to a category. SOLUTION: A document storage 11 stores a document of reading object. A text extraction part 12 extracts a content text from each page of the document of the reading object. The extracted text is stored in a text storage 13 by page as a file. A unique expression extraction part 14 receives a text file for each page from the text storage part 13; extracts a unique expression included in the text file; outputs unique expression information including the category, a unique expression character string, positional information, or the like of the unique expression; and stores the unique expression information in a unique expression storage part 15. A display control data storage part 17 stores a color for displaying for each category. A display data generation part 16 under the control of a display control part 18 generates display data accompanying classification-by-color information to be displayed on a display unit 19. The display unit 19 displays the document by page and distinguishes a unique expression appearance part by color according to the category thereof. COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、文章の読解を支援する技術に関し、とくに重要な語句をカテゴリ分けし、カテゴリごとに表示属性を変えて表示するようにしたものである。
【０００２】
【背景技術】
大量の文書が電子化されるにともない膨大な量の文書が利用可能になってきている。利用者は、このような膨大な量の文書が有用なものかどうか、目的に合致したものかどうか等を、判断する必要がある。文書自体を読むか要約を作成してそれを読むかは別にして、最終的には、人間が、文書やその要約を読解して、その価値を判断する必要がある。そのような読解作業を容易に行えるようにすることが望まれる。
【０００３】
本発明者らは、鋭意研究を重ね、大量の電子文書を比較的容易かつ高速に読解する支援技術を開発するに至った。
【０００４】
すなわち、利用者が文書を読む際に、５Ｗ１Ｈ（誰が、何を、何時、どこで、なぜ、どのように）の語句または同様に重要な語句の出現箇所が強調表示されていると、大量の文書があっても、比較的容易に関連する文書かどうか判断できるという知見に到達した。なお、以下では、人名、地名、組織名等の固有名詞や、日時、価格（通貨）等の、５Ｗ１Ｈに相当する語句を固有表現と呼ぶこともある。本発明者らはこのような固有表現をカテゴリに分け表示分けすることが決めて有効であるという知見に到達した。Ｗｈｏに相当するカテゴリの固有表現としては、人名や組織名（企業名）があり、Ｗｈｅｒｅに相当するカテゴリの固有表現としては、地名があり、Ｗｈｅｎに相当するカテゴリの固有表現としては日時表現がある。
【０００５】
文章を表示する際に、重要語句（固有表現）の出現箇所をカテゴリに応じて色分けし、カテゴリ一覧表示からヒット箇所を表示することが望まれる。
【０００６】
なお、この発明と関連する先行技術としてはつぎのものが知られている。
（１）検索キーの出現箇所を同じ色で表示する検索システム（Ｎａｍａｚｕ。Ｎａｍａｚｕプロジェクトにより開発されている全文検索システム）や、検索キーごとに色を変えて表示する検索システム（Ｇｏｏｇｌｅ。商標）が知られている。
（２）特開２０００−９９５２６公報には、文書データから抽出された情報を限られた表示領域に表示する際に、ユーザにとって読みやすいスクロール速度となるようにする装置が開示されており、抽出する対象として固有名詞を含む文書に言及している。
【０００７】
従来技術（１）のように、検索キーの出現箇所の色を変えて表示させる場合には、どのようなカテゴリの語句かは一瞥して識別できず、人間が注意深く読んで判断する必要がある。
【０００８】
また、従来技術（２）のように、固有名詞を含む文書を抽出して表示する場合にも、カテゴリに応じた表示の識別までは言及されておらず、どのような語句かはやはり人間が判断する必要があった。
【０００９】
なお、関連する文書かどうかを判断する際に、文書全体を通読するのは効率的が悪い。従来の検索エンジンで検索されるのは文書であり、文書が複数ページからなる場合には全部のページに目を通す必要があった。より限定された範囲を読解するのみでよいようにすることも望まれる。
【００１０】
【発明が解決する課題】
この発明は、以上の事情を考慮してなされたものであり、文書に含まれる重要語句（固有表現）の出現箇所をカテゴリに応じた表示属性で表示する技術を提供することを目的としている。また、この発明は、読解する範囲を限定する技術を提供することも目的としている。なお、この発明は、上述の目的のみに限定されない。この発明の他の目的は以下の説明から理解される。
【００１１】
【課題を解決するための手段】
この発明によれば、上述の目的を達成するために、特許請求の範囲に記載のとおりの構成を採用している。ここでは、発明を詳細に説明するのに先だって、特許請求の範囲の記載について補充的に説明を行なっておく。
【００１２】
すなわち、この発明の一側面によれば、上述の目的を達成するために、文書表示装置に：複数の所定の属性のグループに含まれる任意の属性を有する語句を文書から抽出する手段と；抽出した語句に、それぞれの属性に応じて対応する表示属性を関連づける手段と；上記文書の少なくとも一部を表示し、かつ、上記文書の少なくとも一部に含まれる上記語句を対応する表示属性で表示する文書表示手段とを設けるようにしている。
【００１３】
この構成においては、語句の属性に応じた表示属性で語句の出現箇所をハイライトできるので、利用者はその部分のみを即座に理解して大凡の内容を理解し、さらに熟読する必要があるかどうかを判別し、熟読の価値がない文書については読解をスキップすることができる。
【００１４】
所定の属性のグループとは、固有名詞等の文脈を理解する上で重要な語句（５Ｗ１Ｈのカテゴリ）あるいは同等に重要な語句である。どのような語句が該当するかは文書処理等の内容、目的に依存し、限定されない。
【００１５】
また、この発明の他の側面によれば、文書表示装置に：複数の所定の属性のグループに含まれる任意の属性を有する語句を文書から抽出する手段と；抽出された語句を上記属性毎に分類して表示する語句表示手段と；利用者の操作に基づいて、抽出された語句または抽出された語句が有する属性を選択する手段と；上記文書の少なくとも一部を表示し、かつ、上記文書の少なくとも一部に含まれ、選択された上記語句または選択された上記属性を有する語句を、対応する表示属性で表示する文書表示手段とを設けるようにしている。
【００１６】
この構成においては、抽出された語句をその属性に応じて分類して表示しているので、その表示内容から文書の大凡の骨格を把握でき、また、必要に応じて、その骨格中の語句等を選択して文書中でハイライト表示して詳細な文脈中でその語句の内容を確認できる。
【００１７】
また、この発明のさらに他の側面によれば、文書検索装置に：複数の文書を記憶する手段と；文書を文書セグメントに分割する手段と；上記文書セグメントの各々について、索引を抽出する手段と；上記文書セグメントと上記索引との関連づけを記憶する索引格納手段と；文書セグメント単位で上記索引格納手段を参照してキーワード検索を行う検索手段と；上記検索手段による検索結果を表示する手段とを設けている。
【００１８】
この構成においては、文書単位でなく、文書セグメント単位で検索を行えるので、キーワードに合致した文脈を絞り込んでレビューすることができ、従来のように、ヒットした文書の全部のページに目を通すという煩わしさがなくなる。
【００１９】
文書セグメントは、例えばページである。その他、章や、節等の任意の文書構成要素でよい。
【００２０】
複数の所定の属性のグループに含まれる任意の属性を有する語句を文書セグメントから抽出して索引としてもよい。ヒットした文書セグメントを表示する際に抽出した語句を属性に応じた表示属性でハイライトして表示してもよい。
【００２１】
なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。
【００２２】
この発明の上述の側面およびこの発明の他の側面は特許請求の範囲に記載され、以下実施例を用いて詳細に説明される。
【００２３】
【発明の実施の形態】
以下、この発明の実施例について説明する。
【００２４】
［第１の実施例］
図１は、ファイルサーバ等の記憶部に記憶されている多数の文書を閲覧して読解するのに適した文書処理装置に、この発明を適用した第１の実施例を示しており、この図において、文書処理装置は、文書格納部１１、テキスト抽出部１２、テキスト格納部１３、固有表現抽出部１４、固有表現格納部１５、表示データ生成部１６、表示制御データ格納部１７、表示制御部１８および表示部１９等を含んで構成されている。
【００２５】
文書格納部１１は、読解対象の文書を記憶するものである。文書格納部１１は逐次に１つの文書を指定して取り出せればよく、複数の文書を記憶していてもよい。テキスト抽出部１２は、読解対象の文書中の各ページから内容テキストを抽出する。抽出されたテキストはページごとに１つのファイルとしてテキスト格納部１３に記憶される。後に説明するように文書名に、ページ番号を表すサフィックスを付して、どの文書のどのページかがわかるようにすることが好ましい（例えば「ｆｏｏ＿０１」）。
【００２６】
固有表現抽出部１４は、ページ毎のテキストファイルをテキスト格納部１３から受け取ってその中に含まれる固有表現を抽出する。この例では、形態素解析と所定の抽出ルールを用いる。これについては後に一例を挙げて説明する。固有表現抽出部１４は、固有表現を抽出し、そのカテゴリ、固有表現文字列、位置情報等を含む固有表現情報を出力する。固有表現情報は固有表現格納部１５に記憶される。
【００２７】
表示制御データ格納部１７は、カテゴリ毎に表示する色を記憶する。表示制御部１８は、表示に関する種々の制御を行う。例えば、どのカテゴリについて色分けをイネーブルにするか、どのような表示形態で表示するかを制御する。表示データ生成部１６は、表示制御部１８の制御の下で、表示部１９において表示する表示データを生成する。例えば、色制御、ページ選択、フレームを含むＨＴＭＬ（ＸＭＬ等でもよい）形式のデータを生成する。
【００２８】
表示部１９は、文書のページ等を表示する。この際、カテゴリに応じた固有表現出現箇所の色分けを行ったりする。表示部１９は、例えば、ＨＴＭＬファイル等をラスタライズするいわゆるウェブブラウザである。
【００２９】
つぎに具体例を挙げてこの実施例の動作を説明する。
【００３０】
テキスト抽出部１２により抽出された１ページ分のテキストは例えば図２に示すようなものである。固有表現抽出部１４は、このテキストから図３に示すような固有表現情報を抽出する。抽出結果は例えばＸＭＬで記述されるが、これに限定されない。固有表現情報はそれぞれ「ｅｎｔｉｔｙ」のタグを用いて記述され、文字列、カテゴリ、位置情報（テキストの最初からのオフセットおよび語句の長さ）を含む。文字列は、例えば、「富士ゼロックス」（商標）、「山田」、「１００ドル」等である。カテゴリは「ＣＯＭＰＡＮＹ」、「ＰＥＲＳＯＮ」、「ＣＵＲＲＥＮＣＹ」等である。位置情報はバイト数で表される。
【００３１】
表示データ生成部１６は、テキストを用いてＨＴＭＬファイルを生成し、また、固有表現情報を用いて固有表現の出現位置に色分けするためにタグを付ける。例えば、「＜ＦＯＮＴ　ＣＯＬＯＲ＝”色コード”＞」と「＜／ＦＯＮＴ＞」とで色分けする固有表現部分を囲む。固有表現の色分けはカテゴリ毎に決められており、その情報が表示制御データ格納部１７に記憶されている。表示制御データ格納部１７に記憶されている表示制御データは例えば図４に示すようなものである。表示データ生成部１６は、表示制御部１８の制御の下で、表示制御データ格納部１７の表示制御データを参照して色分け用のタグをＨＴＭＬファイルに付加する。
【００３２】
このように生成された表示データに基づいて表示部１９に図５に示すような表示が行われる。図５の例では、「ＣＯＭＰＡＮＹ」（企業名）のカテゴリの固有表現が「赤」（矢印Ａで示す）で表示され、「ＰＥＲＳＯＮ」（人名）のカテゴリの固有表現が「青」（矢印Ｂで示す）で表示され、「ＣＵＲＲＥＮＣＹ」（通貨）のカテゴリの固有表現が「紫」（矢印Ｃで示す）で表示され、「ＤＡＴＥ」（日時）のカテゴリの固有表現が「緑」（矢印Ｄで示す）で表示される。表示画面（右側の主たるフレーム）の上部には、各カテゴリのボタンが表示され、これによりカテゴリ毎の色分けのオンオフを制御していもよい。
【００３３】
なお、図５の左側のフレームの「ツリー表示」のボタンは後に詳述するように固有表現の階層を表現するためのものであり、「要約表示」は、要約を表示するためのものである。テキストから自動的に要約を生成してもよい。「ダウンロード」ボタンは、文書やテキストを利用者の端末等にダウンロードするためのものである。
【００３４】
図５の状態で、左側の「ツリー表示」のボタンを操作すると、図６に示すように、固有表現のツリーが表示される。このツリー構造は、一番上の階層（ルートの次。ルートは表示しない）は、「ＣＯＭＰＡＮＹ」等の各カテゴリの名前であるである。次の階層が各カテゴリに属する個別の固有表現、例えば「富士ゼロックス」（商標）である。次の階層（最後の階層）は、その固有表現が出現したページ番号である。図６の例では１ページのみの文書であったのでページ番号は現れないが、文書に複数のページがある場合には、固有表現の「＋」を操作すれば、対応するページ番号がその次の階層として現れる。
【００３５】
ツリー構造の情報は、図３に示したような固有表現情報から生成できる。例えば、各エンティティにページ番号の情報を付加すればよい。
【００３６】
図６の例では、左側のフレーム中のツリー構造内の個々のカテゴリ名や、個々の固有表現を操作することにより、色分け対象を指定できる。例えば、これらカテゴリ名や個々の固有表現にそれらの色分けを択一的にイネーブルするための記号を埋め込んでおき、クリック等の操作により、ウェブサーバ側のＣＧＩやサーブレットに記号を送り、色分けのイネーブルを切り換えるようにしてもよい。ウェブページにオブジェクトを組み込んでもよい。
【００３７】
図７は、文書が複数ページある場合の表示例を示している。この場合、右下のフレームのページ選択部のページ番号をクリック操作等してページの切り替えを行うようにできる。
【００３８】
つぎに、固有表現抽出部１４の詳細な動作について説明する。この例では、テキストなす入力文字列に対して形態素解析を行い、その結果に対して固有表現抽出ルールを適用して、固有表現を抽出する。
【００３９】
例えば、図８に示すような入力例を考える。形態素解析に用いる辞書は例えば図９に示すようなものである。形態素解析では、単語の接続の可否を規定する接続可否テーブルを用いて、連続する単語を品詞情報から切り出していく。ここでは、接続可否テーブルの説明は省略する。形態素解析の解析結果を図１０に示す。「／」は形態素間の区切りを示す。「＜＞」の中には形態素解析から得た品詞情報が記述される。各形態素の開始位置、長さも記述されるが、図では省略されている。
【００４０】
固有表現抽出ルールは図１１に示すようなものである。このルールを適用して図１２に示すような固有表現情報を取得する。
【００４１】
この実施例では、固有表現をカテゴリ毎に色分けして表示するので一瞥して文書やその一部の内容の大凡を把握でき、必要に応じて熟読すればよい。また、出現している固有表現をカテゴリ毎に階層表示しているので、出現する固有表現はツリー表示だけで把握でき、しかも、色分けしたい項目（ツリー構造のノード、個々のカテゴリや個々の固有表現）をクリック等操作して注目しているもののみ色分け表示することができ、大変に便利である。また、ページについても階層表示され、また、ページ選択を行うこともできるので、注目したい項目があれば、直ちにその項目がページを表示でき、しかも、注目項目のみ色分け表示できるので、適切な箇所に直ちに到達して、内容を即座に把握できる。
【００４２】
［第２の実施例］
つぎに、第２の実施例について説明する。この実施例は第１の実施例に関連文書検索機能を付加したものである。
【００４３】
図１３は第２の実施例の文書処理装置を示しており、この図において、図１と対応する箇所には対応する符号を付した。図１３において、索引生成部２０は、テキスト格納部１３のページ毎のテキストに関して関連文書検索用の索引データを作成する。ページ毎のテキストは、先に述べたように文書名とページを表すサフィックスとを連結したファイル名を有する。索引は、例えば、固有表現抽出部１４により抽出した固有表現から構成する。索引名とページファイルとが関連づけられる。索引データは索引格納部２１に格納される。検索部２２は、利用者からのキーワードを受け取って索引格納部２１を参照して検索を行い、検索結果を表示制御部１８に渡し、この結果、検索結果の表示データが図１４に示すように表示データ生成部１６により生成され、表示部１９に表示される。検索結果はページ単位である。したがって該当するページを選択すると図５や図６のような表示が行われる。ページ単位で検索結果が表示されるので目的の箇所を直ちに把握できる。
【００４４】
他の構成は第１の実施例と同様であるので説明を省略する。
【００４５】
なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、固有表現をどのようなものに選定するかを文書処理の内容に依存し、この実施例で説明したものに限定されない。要するに文書を把握する上で重要な語句であればよい。また固有表現抽出も上述の実施例の例に限定されない。また、固有表現等をハイライトするために色分けを行ったが、ブリンク属性としたり、下線表示としたり、種々の表示属性を用いることができる。
【００４６】
【発明の効果】
以上説明したように、この発明によれば、文書または文書セグメント中の重要な語句をカテゴリごとの色分け等により即座に把握できるようにしたので、文書または文書セグメントの大凡を簡易に確認できる。また、文書セグメント単位で検索を行うようにすれば確認する範囲を絞り込むことができる。
【図面の簡単な説明】
【図１】この発明の第１の実施例の構成を示すブロック図である。
【図２】上述実施例の入力テキストの例を説明する図である。
【図３】上述実施例の固有表現抽出結果の例を説明する図である。
【図４】上述実施例の表示制御データの例を説明する図である。
【図５】上述実施例の文書の表示例を説明する図である。
【図６】上述実施例のツリー表示の例を説明する図である。
【図７】上述実施例の複数ページの文書を表示する例を説明する図である。
【図８】上述実施例の固有表現抽出の動作を説明する図である。
【図９】上述実施例の固有表現抽出の動作を説明する図である。
【図１０】上述実施例の固有表現抽出の動作を説明する図である。
【図１１】上述実施例の固有表現抽出の動作を説明する図である。
【図１２】上述実施例の固有表現抽出の動作を説明する図である。
【図１３】この発明の第２の実施例の構成を示すブロック図である。
【図１４】第２の実施例の検索結果の表示例を示す図である。
【符号の説明】
１１　　　文書格納部
１２　　　テキスト抽出部
１３　　　テキスト格納部
１４　　　固有表現抽出部
１５　　　固有表現格納部
１６　　　表示データ生成部
１７　　　表示制御データ格納部
１８　　　表示制御部
１９　　　表示部
２０　　　索引生成部
２１　　　索引格納部
２２　　　検索部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a technique for assisting in reading a sentence, in which particularly important words and phrases are classified into categories and displayed with different display attributes for each category.
[0002]
[Background Art]
As a large number of documents are digitized, an enormous amount of documents have become available. The user needs to determine whether such an enormous amount of documents is useful, meets the purpose, or the like. Regardless of whether you read the document itself or create a summary and read it, ultimately humans need to read the document and its summary to determine its value. It is desired that such reading work can be easily performed.
[0003]
The present inventors have conducted intensive research and have developed a support technology for reading a large amount of electronic documents relatively easily and at high speed.
[0004]
In other words, when a user reads a document, if a word of 5W1H (who, what, when, where, why, how) or an occurrence of a similarly important word is highlighted, a large number of documents are required. Despite the fact, it has been found that it is relatively easy to determine whether a document is relevant. In the following, proper nouns such as person names, place names, and organization names, and words equivalent to 5W1H such as date and time and price (currency) may be referred to as proper expressions. The present inventors have arrived at the knowledge that it is effective to divide and display such named expressions into categories. Specific names of the category corresponding to Who include person names and organization names (company names), specific names of the category corresponding to Where include place names, and specific names of the category corresponding to When include date and time expressions. is there.
[0005]
When displaying a sentence, it is desirable to color-code the occurrence of an important word (specific expression) according to the category, and display the hit location from the category list display.
[0006]
The following is known as a prior art related to the present invention.
(1) A search system (Namazu, a full-text search system developed by the Namaz project) that displays the appearance of the search key in the same color, and a search system (Google trademark) that displays the search key in different colors for each search key. Are known.
(2) Japanese Patent Laying-Open No. 2000-99526 discloses an apparatus for displaying information extracted from document data in a limited display area at a scroll speed that is easy for a user to read. It refers to documents that contain proper nouns as the subject of action.
[0007]
In the case where the search key is displayed in a different color as in the prior art (1), the words in which category cannot be identified at a glance, and must be carefully read and determined by a human. .
[0008]
Further, even in the case of extracting and displaying a document including proper nouns as in the related art (2), the identification of the display according to the category is not mentioned. I needed to judge.
[0009]
It is not efficient to read through the entire document when determining whether the document is related. Documents are searched by a conventional search engine, and when a document is composed of a plurality of pages, it is necessary to read all pages. It is also desirable to be able to read only a more limited range.
[0010]
[Problems to be solved by the invention]
The present invention has been made in view of the above circumstances, and has as its object to provide a technique for displaying an occurrence of an important word (specific expression) included in a document with a display attribute corresponding to a category. Another object of the present invention is to provide a technique for limiting the range of reading. In addition, this invention is not limited only to the above-mentioned object. Other objects of the present invention will be understood from the following description.
[0011]
[Means for Solving the Problems]
According to the present invention, in order to achieve the above object, a configuration as described in the claims is adopted. Here, before describing the invention in detail, the description of the claims will be supplementarily described.
[0012]
That is, according to one aspect of the present invention, in order to achieve the above object, a document display device includes: means for extracting a phrase having an arbitrary attribute included in a plurality of predetermined attribute groups from a document; Means for associating a display attribute corresponding to each attribute with each of the words; displaying at least a part of the document, and displaying the word included in at least a part of the document with a corresponding display attribute Document display means is provided.
[0013]
In this configuration, the appearance of the phrase can be highlighted with the display attribute corresponding to the attribute of the phrase. Therefore, does the user need to immediately understand only that part, understand the general contents, and further peruse it? It is possible to determine whether the document is not worth reading and skip reading comprehension.
[0014]
The predetermined attribute group is a word (5W1H category) that is important in understanding the context of a proper noun or the like or a word that is equally important. Which words are applicable depends on the contents and purpose of document processing and the like, and is not limited.
[0015]
According to another aspect of the present invention, a document display device includes: means for extracting, from a document, a phrase having an arbitrary attribute included in a plurality of predetermined attribute groups; Means for displaying words that are categorized and displayed; means for selecting an extracted word or an attribute of the extracted words based on a user operation; and displaying at least a part of the document and the document And document display means for displaying the selected phrase or the phrase having the selected attribute with a corresponding display attribute.
[0016]
In this configuration, the extracted words are classified and displayed according to their attributes, so that the general skeleton of the document can be grasped from the displayed contents, and the words and the like in the skeleton can be grasped as necessary. To highlight it in the document to see the content of the phrase in a detailed context.
[0017]
According to still another aspect of the present invention, a document search device includes: means for storing a plurality of documents; means for dividing a document into document segments; means for extracting an index for each of the document segments. Index storage means for storing the association between the document segment and the index; search means for performing a keyword search by referring to the index storage means for each document segment; and means for displaying a search result by the search means. Provided.
[0018]
In this configuration, the search can be performed not by the document unit but by the document segment unit, so that the context matching the keyword can be narrowed down and reviewed, and all pages of the hit document can be read as in the past. No more hassle.
[0019]
The document segment is, for example, a page. In addition, any document component such as a chapter or a section may be used.
[0020]
A phrase having an arbitrary attribute included in a plurality of predetermined attribute groups may be extracted from a document segment and used as an index. A phrase extracted when displaying a hit document segment may be highlighted and displayed with a display attribute corresponding to the attribute.
[0021]
The present invention can be realized not only as a device or a system but also as a method. In addition, it goes without saying that a part of such an invention can be configured as software. Also, it goes without saying that a software product used for causing a computer to execute such software is also included in the technical scope of the present invention.
[0022]
The above aspects of the present invention and other aspects of the present invention are set forth in the following claims, and will be described in detail below with reference to embodiments.
[0023]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
[0024]
[First Embodiment]
FIG. 1 shows a first embodiment in which the present invention is applied to a document processing apparatus suitable for browsing and reading a large number of documents stored in a storage unit such as a file server. In the document processing device, the document storage unit 11, the text extraction unit 12, the text storage unit 13, the named expression extraction unit 14, the named expression storage unit 15, the display data generation unit 16, the display control data storage unit 17, the display control unit 18 and a display unit 19.
[0025]
The document storage unit 11 stores a document to be read. The document storage unit 11 only needs to specify and retrieve one document one by one, and may store a plurality of documents. The text extracting unit 12 extracts a content text from each page in a document to be read. The extracted text is stored in the text storage unit 13 as one file for each page. As described later, it is preferable to attach a suffix indicating a page number to a document name so that the user can know which page of which document (for example, “foo — 01”).
[0026]
The named entity extraction unit 14 receives a text file for each page from the text storage unit 13 and extracts named entities included therein. In this example, morphological analysis and a predetermined extraction rule are used. This will be described later with an example. The named entity extraction unit 14 extracts named entities and outputs named entity information including the category, named entity character string, position information, and the like. The named entity information is stored in the named entity storage unit 15.
[0027]
The display control data storage unit 17 stores colors to be displayed for each category. The display control unit 18 performs various controls related to display. For example, it controls which category is enabled for color coding and in what display form. The display data generation unit 16 generates display data to be displayed on the display unit 19 under the control of the display control unit 18. For example, data in an HTML (or XML) format including color control, page selection, and frames is generated.
[0028]
The display unit 19 displays a page of a document or the like. At this time, the appearance of the unique expression is classified by color according to the category. The display unit 19 is, for example, a so-called web browser for rasterizing an HTML file or the like.
[0029]
Next, the operation of this embodiment will be described with a specific example.
[0030]
The text for one page extracted by the text extraction unit 12 is, for example, as shown in FIG. The named entity extracting unit 14 extracts named entity information from the text as shown in FIG. The extraction result is described in, for example, XML, but is not limited to this. Each entity expression information is described using a tag of “entity” and includes a character string, a category, and position information (offset from the beginning of the text and length of a phrase). The character string is, for example, “Fuji Xerox” (trademark), “Yamada”, “$ 100”, or the like. The categories are "COMPANY", "PERSON", "CURRENCY", and the like. The position information is represented by the number of bytes.
[0031]
The display data generation unit 16 generates an HTML file using the text, and attaches a tag for color-coding the appearance position of the named entity using the named entity information. For example, “” and “” enclose a specific expression portion that is color-coded. The coloring of the unique expression is determined for each category, and the information is stored in the display control data storage unit 17. The display control data stored in the display control data storage 17 is, for example, as shown in FIG. The display data generation unit 16 adds a tag for color classification to the HTML file with reference to the display control data in the display control data storage unit 17 under the control of the display control unit 18.
[0032]
The display as shown in FIG. 5 is performed on the display unit 19 based on the display data thus generated. In the example of FIG. 5, the entity of the category of “COMPANY” (company name) is displayed in “red” (indicated by an arrow A), and the entity of the category of “PERSON” (person name) is displayed in “blue” (arrow B). ), The entity of the category of “CURRENCY” (currency) is displayed in “purple” (indicated by arrow C), and the entity of the category of “DATE” (date and time) is “green” (arrow D) ). Buttons for each category are displayed at the top of the display screen (the main frame on the right side), so that the on / off of color classification for each category may be controlled.
[0033]
Note that the "tree display" button in the left frame of FIG. 5 is for expressing the hierarchy of the unique expression as described later in detail, and the "summary display" is for displaying the summary. . A summary may be automatically generated from the text. The “download” button is for downloading a document or text to a user terminal or the like.
[0034]
When the "tree display" button on the left side is operated in the state of FIG. 5, a tree of the unique expression is displayed as shown in FIG. In this tree structure, the top hierarchy (below the root; the root is not displayed) is the name of each category such as “COMPANY”. The next layer is an individual entity expression belonging to each category, for example, “Fuji Xerox” (trademark). The next level (last level) is the page number where the unique expression appears. In the example of FIG. 6, the page number does not appear because the document has only one page. However, if the document has a plurality of pages, the user can operate the unique expression “+” to change the corresponding page number to the next page. Appear as a hierarchy of
[0035]
The tree structure information can be generated from the entity expression information as shown in FIG. For example, page number information may be added to each entity.
[0036]
In the example of FIG. 6, the color classification target can be specified by operating each category name and each unique expression in the tree structure in the left frame. For example, a symbol for selectively enabling the color coding is embedded in the category name or each unique expression, and the symbol is sent to the CGI or the servlet on the web server side by an operation such as a click to enable the color coding. May be switched. Objects may be embedded in web pages.
[0037]
FIG. 7 shows a display example when a document has a plurality of pages. In this case, the page can be switched by clicking the page number in the page selection section of the lower right frame.
[0038]
Next, a detailed operation of the named entity extraction unit 14 will be described. In this example, a morphological analysis is performed on an input character string that is a text, and a named entity is extracted by applying a named entity extraction rule to the result.
[0039]
For example, consider an input example as shown in FIG. The dictionary used for the morphological analysis is, for example, as shown in FIG. In the morphological analysis, continuous words are cut out from the part of speech information using a connection availability table that defines the availability of words. Here, the description of the connection availability table is omitted. FIG. 10 shows the result of the morphological analysis. “/” Indicates a break between morphemes. Part of speech information obtained from morphological analysis is described in “<>”. The starting position and length of each morpheme are also described, but are omitted in the figure.
[0040]
The named entity extraction rules are as shown in FIG. By applying this rule, unique expression information as shown in FIG. 12 is obtained.
[0041]
In this embodiment, the unique expressions are displayed in different colors for each category, so that the user can grasp at a glance the general contents of the document and a part of the document, and read the document as necessary. In addition, since the appearing specific expressions are displayed in a hierarchy for each category, the appearing specific expressions can be grasped only by the tree display, and the items to be color-coded (nodes in the tree structure, individual categories and individual specific expressions) are displayed. ) Can be displayed in different colors by clicking or the like, which is very convenient. Also, pages are displayed in a hierarchy, and pages can be selected. If there is an item that you want to pay attention to, you can immediately display the page for that item, and you can color-code only the item of interest, so Arrive immediately and understand the content immediately.
[0042]
[Second embodiment]
Next, a second embodiment will be described. In this embodiment, a related document search function is added to the first embodiment.
[0043]
FIG. 13 shows a document processing apparatus according to the second embodiment, in which parts corresponding to those in FIG. 1 are denoted by corresponding reference numerals. In FIG. 13, the index generation unit 20 generates index data for related document search with respect to the text of each page in the text storage unit 13. The text for each page has a file name obtained by concatenating the document name and the suffix indicating the page as described above. The index is composed of, for example, the named entities extracted by the named entity extraction unit 14. The index name is associated with the page file. The index data is stored in the index storage unit 21. The search unit 22 receives the keyword from the user, performs a search by referring to the index storage unit 21, and passes the search result to the display control unit 18. As a result, the display data of the search result is changed as shown in FIG. It is generated by the display data generation unit 16 and displayed on the display unit 19. Search results are page-by-page. Therefore, when the corresponding page is selected, the display as shown in FIGS. 5 and 6 is performed. Search results are displayed on a page-by-page basis, so you can immediately grasp the target location.
[0044]
The other configuration is the same as that of the first embodiment, and the description is omitted.
[0045]
It should be noted that the present invention is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present invention. For example, what kind of entity expression is selected depends on the content of the document processing, and is not limited to the one described in this embodiment. In short, any words or phrases that are important in grasping the document may be used. The named entity extraction is not limited to the example of the above embodiment. Although color coding is performed for highlighting the unique expression and the like, various display attributes such as a blink attribute, an underline display, and the like can be used.
[0046]
【The invention's effect】
As described above, according to the present invention, important words and phrases in a document or a document segment can be immediately grasped by, for example, color coding for each category, so that the approximate document or document segment can be easily confirmed. If the search is performed in document segment units, the range to be confirmed can be narrowed.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a first embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of an input text according to the embodiment.
FIG. 3 is a diagram illustrating an example of a named entity extraction result according to the embodiment.
FIG. 4 is a diagram illustrating an example of display control data according to the embodiment.
FIG. 5 is a diagram illustrating a display example of a document according to the above embodiment.
FIG. 6 is a diagram illustrating an example of a tree display according to the above embodiment.
FIG. 7 is a diagram illustrating an example of displaying a document of a plurality of pages according to the embodiment.
FIG. 8 is a diagram illustrating an operation of extracting a named entity in the above embodiment.
FIG. 9 is a diagram illustrating an operation of extracting a named entity according to the embodiment.
FIG. 10 is a diagram illustrating an operation of extracting a named entity in the embodiment.
FIG. 11 is a diagram illustrating an operation of extracting a named entity in the embodiment.
FIG. 12 is a diagram illustrating an operation of extracting a named entity according to the embodiment.
FIG. 13 is a block diagram showing a configuration of a second exemplary embodiment of the present invention.
FIG. 14 is a diagram illustrating a display example of a search result according to the second embodiment.
[Explanation of symbols]
11 Document storage unit 12 Text extraction unit 13 Text storage unit 14 Named expression extraction unit 15 Named expression storage unit 16 Display data generation unit 17 Display control data storage unit 18 Display control unit 19 Display unit 20 Index generation unit 21 Index storage unit 22 Search Department

Claims

Means for extracting, from a document, a phrase having an arbitrary attribute included in a plurality of predetermined attribute groups;
Means for associating the extracted words with corresponding display attributes according to each attribute;
Document display means for displaying at least a part of the document and displaying the words included in at least a part of the document with corresponding display attributes.

Means for dividing the document into document segments;
Means for extracting, from each of the document segments, a phrase having an arbitrary attribute included in a plurality of predetermined attribute groups;
A document segment display means for displaying at least a part of the document segment and displaying the words included in at least a part of the document segment with corresponding display attributes.

3. The document display device according to claim 2, wherein said document segment is a page constituting a document.

Extracted phrase means for classifying and displaying the extracted phrases for each of the attributes,
Means for selecting an extracted phrase or an attribute of the extracted phrase based on a user operation,
The display means displays at least a part of the document and displays the selected phrase or the phrase having the selected attribute included in at least a part of the document with a corresponding display attribute. The document display device according to claim 1.

Means for extracting, from a document, a phrase having an arbitrary attribute included in a plurality of predetermined attribute groups;
Term display means for classifying and displaying the extracted terms for each of the attributes,
Means for selecting an extracted phrase or an attribute of the extracted phrase based on a user operation;
Document display means for displaying at least a part of the document, and displaying the selected phrase or the phrase having the selected attribute included in at least a part of the document with a corresponding display attribute. A document display device comprising:

Means for dividing the document into document segments;
Means for extracting, from each of the document segments, a phrase having an arbitrary attribute included in a plurality of predetermined attribute groups;
Term display means for classifying and displaying the extracted terms for each of the attributes,
Means for selecting an extracted phrase or an attribute of the extracted phrase based on a user operation;
A document segment display for displaying at least a part of the document segment and displaying the selected word or the word having the selected attribute included in at least a part of the document segment with a corresponding display attribute And a document display device.

Means for storing a plurality of documents;
Means for dividing the document into document segments;
Means for extracting an index for each of said document segments;
Index storage means for storing an association between the document segment and the index;
Search means for performing a keyword search by referring to the index storage means in document segment units;
Means for displaying a search result by the search means.

Means for extracting, from each of the document segments, a phrase having an arbitrary attribute included in a plurality of predetermined attribute groups;
8. The retrieval apparatus according to claim 7, further comprising: document segment display means for displaying at least a part of the document segment and displaying the words included in at least a part of the document segment with corresponding display attributes.

Extracting from the document a phrase having any attribute included in the plurality of predetermined attribute groups;
Associating each extracted attribute with a corresponding display attribute according to each attribute;
A display step of displaying at least a part of the document and displaying the words included in at least a part of the document with corresponding display attributes.

Extracting from the document a phrase having any attribute included in the plurality of predetermined attribute groups;
Associating each extracted attribute with a corresponding display attribute according to each attribute;
Displaying at least a part of the document and displaying the words included in at least a part of the document with corresponding display attributes by a computer. Computer program.

Extracting from the document a phrase having any attribute included in the plurality of predetermined attribute groups;
A phrase display step of classifying and displaying the extracted phrases for each attribute,
Selecting an extracted phrase or an attribute of the extracted phrase based on a user operation;
A document display step of displaying at least a part of the document, and displaying the selected phrase or the phrase having the selected attribute included in at least a part of the document with a corresponding display attribute. A document display method comprising:

Extracting from the document a phrase having any attribute included in the plurality of predetermined attribute groups;
A phrase display step of classifying and displaying the extracted phrases for each attribute,
Selecting an extracted phrase or an attribute of the extracted phrase based on a user operation;
A document display step of displaying at least a part of the document, and displaying the selected phrase or the phrase having the selected attribute included in at least a part of the document with a corresponding display attribute. A computer program for displaying a document, which is used to cause a computer to execute the following.