JP4259179B2

JP4259179B2 - Document analysis method and apparatus, document analysis program, and storage medium storing document analysis program

Info

Publication number: JP4259179B2
Application number: JP2003146322A
Authority: JP
Inventors: 準二富田; 民雄木原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-23
Filing date: 2003-05-23
Publication date: 2009-04-30
Anticipated expiration: 2023-05-23
Also published as: JP2004348555A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書分析方法及び装置及び文書分析プログラム及び文書分析プログラムを格納した記憶媒体に係り、特に、大量の文書を分析し、文書の内容の概観を作成する文書分析方法及び装置及び文書分析プログラム及び文書分析プログラムを格納した記憶媒体に関する。
【０００２】
【従来の技術】
情報技術の進歩によって、人々がアクセスすることができる電子化文書の量は急速に拡大している。しかしながら、人間の情報処理能力は限られているため、大量の文書を１つずつ読むことは事実上不可能となってきている。従って、大量の文書の内容の概観を作成し、人間がその内容を即座に正確に理解することを支援するための技術が必要となってきている。
【０００３】
このような技術を以下に示す。
【０００４】
（１）文書集合が蓄積されたデーターベースから各文書を取得する。
【０００５】
（２）各文書から単語を抽出し、単語をノード、単語間の関連をリンクとしたグラフを作成する。ここで、単語の出現頻度、単語の共出現頻度等を用いて単語の重要度、単語間の関連度を計算し、それぞれ、ノードの重み、リンクの重みとして設定する。
【０００６】
（３）ユーザによって指定されたグラフ操作を実行する。ここで、グラフ操作とは、グラフの集合を引数とし、グラフの集合を結果として出力する関数である。グラフ操作には、類似グラフ検索、類似グラフ分類、部分グラフ抽出、グラフ合成、グラフ差分等がある。
【０００７】
（４）引数として指定されたグラフ集合に対して、グラフ操作を実行し、実行結果であるグラフ集合を出力する。例えば、類似グラフ検索では、引数として検索条件グラフと、検索対象グラフの集合を与えると、検索対象グラフの中で、検索条件グラフに類似している上位ｎ件のグラフが出力として得られる。
【０００８】
（５）グラフ操作によって出力されたグラフを可視化する。
【０００９】
（６）ユーザは、可視化されたグラフを確認し、満足な結果が得られるまで、上記の（３）〜（５）を繰り返す。
【００１０】
以下に、「特許」を対象とした分析の例を用いて上記の処理を説明する。
【００１１】
あるユーザが、「情報システムを、医療分野に適用する」といった内容の特許Ａを書いたときに、特許Ａに関連する他社特許の内容を概観したいとする。まず、特許Ａをグラフに変換する。このグラフをＧＡとする（図１５）。次に、分析対象である各特許をグラフに変換する。ＧＡを検索条件グラフとし、各特許から作成されたグラフを検索対象グラフとして、類似する上位５０件のグラフを取得する。次に、これら５０件のグラフを分類対象のグラフ集合とし、類似グラフ分類操作を実行し、３つのカテゴリに分類する。分類された各カテゴリのグラフをグラフ合成操作によって合成し、可視化を行う。このようにして作成されたグラフの例を図１６に示す。図１６では、文字の大きさが単語の重要度に、リンクの太さが単語間の関連度に対応している。また、各カテゴリに分類された特許の件数と文書ＩＤが併せて表示してある。図１６から各カテゴリに分類された特許の内容は、それぞれ
・カテゴリ１：「医療データを検索管理するためのシステム」
・カテゴリ２：「医療費の控除を計算するためのシステム」
・カテゴリ３：「医療画像を撮影するためのシステム」
であることが読み取れる。即ち、「情報システムを、医療分野に適用する」という内容の特許に対して、カテゴリ１〜３のような内容の関連特許が出願されていることがわかる。
【００１２】
このように、いくつかのグラフ操作を組み合わせ、結果として出力される結果グラフを可視化することによって、その文書集合に含まれる重要な単語がどれなのか、また、各単語と関連の強い単語がどれなのかを即座に判断することができる。そのため、対象となる文書を一つずつ読まなくても、文書や文書集合の内容を即座に把握することができる。
【００１３】
【発明が解決しようとする課題】
前述した従来の技術では、文書集合の中で、特に重要な単語や、それらの単語と関連が強い単語がどれなのかわかる。しかしながら、これらの単語間にどのような関係があるかまではわからないという問題がある。
【００１４】
例えば、図１６のカテゴリ１のグラフには、「患者」「データ」「検索」の間に太いリンクがあるため、これらの単語間には強い関連があることがわかる。しかし、これらの単語間の関係が、以下の（ａ），（ｂ）のどちらであるのかは判断できない。
【００１５】
（ａ）「検索の対象が患者データである」という関係
例文：「医療スタッフが患者データを即座に検索できるシステム」
（ｂ）「患者が何らかのデータを検索する」という関係
例文：「患者自身が、医療データを検索できるシステム」
このように、可視化されたグラフを見ただけでは、単語間にどのような関係があるのかがわからず、文書や文書集合の内容を正確に把握できないという問題がある。
【００１６】
本発明は、上記の点に鑑みなされたもので、単語間の関係がどのようなものであるかを正確に判断でき、大量の文書集合の内容を、即座に正確に把握することが可能な文書分析方法及び装置及び文書分析プログラム及び文書分析プログラムを格納した記憶媒体を提供することを目的とする。
【００１７】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
【００１８】
本発明（請求項１）は、文書内の文単位の要約表現を表示する文書分析方法において、
分析実行装置が、分析対象の文書を蓄積する分析対象文書データベースから文書を読み出して、当該文書に対して単語をノード、単語間の関連をリンクとした結果グラフを作成し、作成された結果グラフをユーザインタフェースに出力し（ステップ１）、
条件グラフ生成装置が、ユーザインタフェースによって表示された結果グラフからユーザによって選択された１つまたは、複数の単語からなる選択単語リストを取得し、選択単語リストにある単語を含む結果グラフのノードを条件グラフのノードとし、結果グラフ上でこれらのノード間に存在するリンクをリンクとして条件グラフを生成し（ステップ２）、
文単位グラフ生成装置が、分析実行装置から取得した文書ＩＤリストに基づいて、分析対象文書データベースから文書ＩＤで示される文書を取得して、取得した各文書から条件グラフに含まれる単語を１つでも含む文単位を抽出し、抽出した各文単位に対して文単位グラフの生成を行い（ステップ３）、
文単位選択装置が、条件グラフと文単位グラフとの類似度を計算し、類似度が高い所定数の文単位を選択してユーザインタフェースに出力し（ステップ４）、
ユーザインタフェースが、選択された文単位を要約表現として表示する（ステップ５）。
【００２０】
また、本発明（請求項２）は、条件グラフ生成装置において、条件グラフを生成する際に（ステップ２）、
ホップ数および最低関連度を設定し、
グラフのリンクに単語間の関連度が設定された結果グラフ上で、各選択単語から指定されたホップ数以内であり、最低関連度以上の関連度を持つリンクのみを用いて、到達可能なノードが持つ単語を周辺単語として取得し、周辺単語リストとし、
選択単語リストと周辺単語リストとを条件単語リストとし、
条件単語リストの各単語を持つ結果グラフ上のノードを、条件グラフのノードとし、結果グラフ上のこれらのノード間に存在するリンクを条件グラフのリンクとして条件グラフを生成する。
【００２１】
また、本発明（請求項３）は、文書単位グラフ生成装置において、文単位グラフを生成する際に（ステップ３）、
文書ＩＤを用いて分析対象文書データベースから文書を取得し、
文書を、段落、文、文節、連続する文字列、または、連続する単語列からなる文単位に分割し、
条件グラフに含まれる単語を１つでも含む各文単位のみに対して、文単位グラフを生成する。
【００２３】
図２は、本発明の原理構成図である。
【００２４】
本発明（請求項４）は、文書内の文単位の要約表現を表示する文書分析装置であって、
分析対象の文書を蓄積する分析対象データベース６０と、
分析対象文書データベース６０から文書を読み出して、当該文書に対して単語をノード、単語間の関連をリンクとした結果グラフを作成し、作成された結果グラフをユーザインタフェース２０に出力する分析実行手段１０と、
ユーザインタフェース２０によって表示された結果グラフからユーザによって選択された１つまたは、複数の単語からなる選択単語リストを取得し、該選択単語リストにある単語を含む結果グラフのノードを条件グラフのノードとし、結果グラフ上でこれらのノード間に存在するリンクをリンクとして条件グラフを生成する条件グラフ生成手段３０と、
分析実行手段１０から取得した文書ＩＤリストに基づいて、分析対象文書データベースから文書ＩＤで示される文書を取得して、取得した各文書から条件グラフに含まれる単語を１つでも含む文単位を抽出し、抽出した各文単位に対して文単位グラフの生成を行う文単位グラフ生成手段４０と、
条件グラフと文単位グラフとの類似度を計算し、類似度が高い所定数の文単位を選択してユーザインタフェースに出力する文単位選択手段５０と、を有する。
【００２６】
また、本発明（請求項５）は、条件グラフ生成手段３０において、
ホップ数および最低関連度を設定する手段と、
グラフのリンクに単語間の関連度が設定された結果グラフ上で、各選択単語から指定されたホップ数以内であり、最低関連度以上の関連度を持つリンクのみを用いて、到達可能なノードが持つ単語を周辺単語として取得し、周辺単語リストとする手段と、
選択単語リストと周辺単語リストとを条件単語リストとし、該条件単語リストの各単語を持つ結果グラフ上のノードを、条件グラフのノードとし、該結果グラフ上のこれらのノード間に存在するリンクを該条件グラフのリンクとして条件グラフを生成する手段と、を有する。
【００２７】
また、本発明（請求項６）は、文書単位グラフ生成手段４０において、
文書ＩＤを用いて分析対象文書データベースから文書を取得する手段と、
文書を、段落、文、文節、連続する文字列、または、連続する単語列からなる文単位に分割する手段と、
条件グラフに含まれる単語を１つでも含む各文単位のみに対して、文単位グラフを生成する手段と、を有する。
【００２９】
本発明（請求項７）は、請求項４乃至６のいずれか１項に記載の文書分析装置を構成する各手段としてコンピュータを機能させるためのプログラムである。
【００３４】
本発明（請求項８）は、請求項７記載のプログラムを格納したコンピュータ読み取り可能な記憶媒体である。
【００３９】
上述のように、本発明は、可視化されたグラフ上の１つまたは、複数の単語を選択すると、選択された単語及びその周辺の単語から構成される条件グラフを自動的に生成し、原文書の中で、条件グラフに類似する文単位（文節、文、段落、連続するｎ単語や、ｎ文字）を要約表現として出力することにより、単語間の関係がどのようなものであるかを正確に判断でき、大量の文書集合の内容を即座に正確に把握することが可能となる。
【００４０】
【発明の実施の形態】
以下、図面と共に、本発明の実施の形態を説明する。
【００４１】
図３は、本発明の一実施の形態における文書分析装置の構成を示す。
【００４２】
同図に示す文書分析装置は、分析実行装置１０、ユーザインタフェース２０、条件グラフ生成装置３０、文単位グラフ生成装置４０、文単位選択装置５０、及び分析対象文書データーベース６０から構成される。
【００４３】
分析実行装置１０は、分析対象文書データーベース６０から文書を取得し、各文書をグラフに変換する。グラフ操作の実行を行い、結果グラフと文書ＩＤリストをユーザインタフェース２０に送る。以下にグラフ操作について説明する。
【００４４】
最初に類似グラフ検索操作（serch:GA,GS）について説明する。
【００４５】
図４は、本発明の一実施の形態における類似グラフ検索操作を説明するための図である。当該類似グラフ検索操作における入力は、検索条件グラフ（Ｇａ）と、ｎ個の検索対象グラフ（ＧＳ）であり、以下の処理により類似度の高いｍ個のグラフ集合が出力される。
【００４６】
（１）ＧａとＧＳの各グラフとの類似度を計算する。なお、グラフ間の類似度計算手法としては、既存の技術である、例えば、特願平１０−２９７３２１を利用することができる。
【００４７】
（２）類似度の降順にＧＳをソートする。
【００４８】
（３）類似度の高いｍ個のグラフを出力する。
【００４９】
次に、類似グラフ分析操作（clustering(GS)）について説明する。
【００５０】
図５は、本発明の一実施の形態における類似グラフ分類操作を説明するための図である。
【００５１】
当該類似グラフ分析装置における入力は、ｎ個の分類対象グラフ（ＧＳ）であり、以下の処理により、ｋ個のクラスタに分けられた分類対象グラフが出力される。
【００５２】
（１）ＧＳに含まれるグラフ間の類似度を計算する。なお、当該類似度の計算には、類似グラフ検索操作と同様の既存の技術を利用することができる。
【００５３】
（２）類似度に基づき、グラフをｋのクラスタに分類する。
【００５４】
（３）ｋのクラスタを出力する。
【００５５】
次に、部分グラフ抽出操作（extract(Ga,GS)について説明する。
【００５６】
図６は、本発明の一実施の形態における部分グラフ抽出操作を説明するための図である。
【００５７】
当該グラフ抽出操作における入力は、抽出条件グラフ（Ｇａ）とｎ個の抽出対象グラフ（ＧＳ）であり、以下の処理により、ｎ個の抽出されたグラフが出力される。
【００５８】
（１）ＧＳの各グラフからＧａに基づき部分グラフを抽出する。
【００５９】
（２）抽出された部分グラフの集合を出力する。
【００６０】
図６の例では、Ｇａに含まれるノード（単語‘Ａ’，‘Ｂ’）から１ホップ以内のノードからなる部分グラフを抽出している。部分グラフの抽出アルゴリズムは既存技術による。例えば、特願２０００−６２５６１が利用できる。
【００６１】
次に、グラフ合成操作（merge(GS)）について説明する。
【００６２】
グラフ合成における入力は、ｎ個の合成対象グラフであり、以下の処理により、合成されたグラフが出力される。図７は、本発明の一実施の形態におけるグラフ合成操作を説明するための図である。
【００６３】
（１）ＧＳの中の同じ単語を持つノードを見つけ、その重要度を加算する。
【００６４】
（２）ＧＳの中の同じ単語を両端に持つリンクを見つけ、その関連度を加算する。
【００６５】
（３）このようにして作成されたグラフを出力する。
【００６６】
次に、グラフ差分抽出操作（substrct(Ga,Gb)）について説明する。
【００６７】
グラフ差分抽出操作の入力は、差分対象グラフ（Ｇａ）と、差分抽出条件グラフ（Ｇｂ）であり、以下の処理により、差分グラフが抽出される。図８は、本発明の一実施の形態におけるグラフ差分抽出操作を説明するための図である。
【００６８】
（１）ＧａからＧｂの重要度の減算を行う。
【００６９】
（２）ＧａからＧｂの関連度の減算を行う。
【００７０】
（３）減算された重要度、関連度を持つグラフを出力する。
【００７１】
なお、ここで、減算とは、同じノード（リンク）がある場合には、重要度、関連度を減算し、同じノード（リンク）がない場合には、何も行わない。また、減算した結果、負数になる場合には、そのノード（リンク）を削除する。
【００７２】
これらの操作の入出力は共にグラフリストであるため、任意の順序で組み合わせることが可能である。また、上記のグラフ操作以外でも入出力が共にグラフリストであれば本発明に組み込むことができる。
【００７３】
ユーザインタフェース２０は、分析実行装置１０から取得したグラフを、単語をノード、単語間の関連をリンクとしたグラフによって可視化する。ユーザは可視化されたグラフ上の任意の単語を複数選択することができる。選択された単語のリストを、結果グラフ、文書ＩＤリストと共に、条件グラフ生成装置３０に送る。また、文単位選択装置５０が選択した文単位リストを要約表現として表示する。ここで、文単位とは、文、段落、文節といった文書の論理的な単位か、連続するｎ文字、ｎ単語である。文単位は、その文単位が出現する文書ＩＤと、その文書の中での出現位置を持つ、文単位の中で、ユーザによって選択された単語は、フォントや色を変えて表示する。また、選択単語間の関連に該当する箇所に線を記述する。図９は、本発明の一実施の形態におけるユーザインタフェースの表示例を示す。同図における詳細な説明は、後述する。
【００７４】
条件グラフ生成装置３０は、ユーザインタフェース２０上で選択された単語リストと、分析実行装置１０が出力した結果グラフから条件グラフを生成する。条件グラフの生成方法は、以下の通りである。
【００７５】
（ａ）ユーザが選択した単語を含む結果グラフ上のノードを条件グラフのノードとし、結果グラフ上で、これらのノード間に存在するリンクをリンクとして、条件グラフを作成する方法：
（ｂ）ユーザが選択した単語を含む結果グラフ上のノード、及び、選択された単語の周辺のノードを条件グラフのノードとし、これらのノード間に存在するリンクを条件グラフのリンクとして、条件グラフを生成する方法：
ここで、周辺のノードとは、「選択された単語を持つノードからｍホップ（最短経路のパス長がｍ）以内のノード」や「選択された単語を持つノードからの関連度がｋ以上のノード」等である。ここで、ｍ，ｋは０以上の定数である。
【００７６】
このようにして生成した条件グラフを文単位選択装置５０、文単位グラフ生成装置４０に送る。文単位グラフ生成装置４０には、文書ＩＤリストを合わせて送る。
【００７７】
文単位グラフ生成装置４０は、条件グラフ生成装置３０から取得した条件グラフと文書ＩＤリストを用いて、文単位グラフを以下のステップによって生成する。
【００７８】
（１）文書ＩＤに対応する文書を分析対象文書データーベース６０から取得し、各文書から条件グラフに含まれる単語を１つでも含む文単位を抽出する。
【００７９】
（２）各文単位をグラフに変換する。文単位への変換方法は、次の通りである。
【００８０】
ｉ）文単位から単語を抽出し、出現頻度を計算し、重要度を割り当てる。
【００８１】
ｉｉ）文単位の中の規定区間の中での共出現頻度を計算し、関連度を割り当てる。ここで、規定区間は、文単位または、文単位より短い予め定めた区間である。例えば、文単位が文の場合には、文節や連続する規定数の文字等である。
【００８２】
文単位グラフ、文単位、文書ＩＤ、文単位の出現位置を１つのセットとし、文単位グラフリストに追加する。文書ＩＤリストに対応する全ての文書に対して上記の処理を行い、生成した文単位グラフリストを文単位選択装置５０に送る。
【００８３】
文単位選択装置５０は、条件グラフ生成装置３０から取得した条件グラフ、文単位グラフ生成装置４０から取得した文単位グラフリストを用いて、以下の処理によって最適な文単位を選択する。
【００８４】
（１）条件グラフと各文単位グラフの類似度を計算する。類似度の計算は、例えば、分析実行装置１０において説明した方法を用いることが可能である。つまり、同じ単語が同程度の重要度で使用され、同じ単語間に同程度の関連があるグラフ同士に大きな類似度を割り当てる。
【００８５】
（２）類似度の大きい順にｎ個の文単位を選択する。
【００８６】
上位ｎ個の文単位、各文単位が出現する文書ＩＤ、出現位置をまとめて文単位リストとし、これをユーザインタフェース２０に送る。
【００８７】
以下、上記の構成における動作を説明する。
【００８８】
図１０は、本発明の一実施の形態における文書分析処理の動作のフローチャートである。
【００８９】
ステップ１００）分析実行装置１０が以下の処理を行う。
【００９０】
（ａ）分析対象文書データーベース６０から文書を取得する。
【００９１】
（ｂ）各文書をグラフに変換する。
【００９２】
（ｃ）グラフ集合に対して、ユーザの指定する複数のグラフ操作を実行する。
【００９３】
（ｄ）結果グラフとそのグラフの基となった文書の文書ＩＤをユーザインタフェース２０に送信し、ユーザインタフェース２０が、このようにして得られた結果グラフを、可視化して表示する。
【００９４】
ステップ２００）ユーザインタフェース２０を通して、ユーザが可視化された結果グラフ上の１つまたは、複数の単語を選択し、「要約表現作成」ボタンを押す。
【００９５】
ステップ３００）条件グラフ生成装置３０が、選択された単語リストと結果グラフから、条件グラフを生成する。詳細については後述する。
【００９６】
ステップ４００）文単位グラフ生成装置４０が、条件グラフと文書ＩＤリストに基づき、分析対象文書データーベース６０からの文書の取得、文単位の抽出、文単位グラフの生成を行う。詳細は、後述する。
【００９７】
ステップ５００）文単位選択装置５０が、各分単位グラフと条件グラフとの類似度を計算し、類似度の高いｎ個の文単位を選択する。
【００９８】
ステップ６００）文単位選択装置５０が出力した文単位を、単語及び単語間の関連を色、フォント、線によって示し，出力する。各文単位の出現する文書ＩＤと出現位置を併せて表示する。
【００９９】
次に、上記のステップ３００における条件グラフ生成処理について詳細に説明する。
【０１００】
図１１は、本発明の一実施の形態における条件グラフ生成処理のフローチャートである。
【０１０１】
ステップ３０１）分析実行装置１０の出力した結果グラフとユーザによって選択された選択単語リストを取得する。
【０１０２】
ステップ３０２）単語を格納するための条件単語リスト、追加単語リスト、基本単語リストを空の状態で生成する。周辺単語を取得する際のホップ数ｍと、関連度の最低値ｋを設定する。ここで、前述の条件グラフ生成装置３０の説明で示した（ａ）の方法（選択単語のみで条件グラフを生成）の場合は、ｍ＝０とする。
【０１０３】
ステップ３０３）選択単語リストを、条件単語リスト及び追加単語リストに設定する。
【０１０４】
ステップ３０４）ホップ数ｍの値を判定し、以下の分岐処理を行う。
【０１０５】
ｍ＞０の場合は、ステップ３０５に移行する。
【０１０６】
ｍ≦０の場合は、ステップ３１１に移行する。
【０１０７】
ステップ３０５）追加単語リストを基本単語リストに設定する。また、追加単語リストを空にする。
【０１０８】
ステップ３０６）基本単語リストが空かどうかを判定し、以下の分岐処理を行う。
【０１０９】
空でない場合は、ステップ３０７に移行する。
【０１１０】
空である場合は、ステップ３０９に移行する。
【０１１１】
ステップ３０７）基本単語リストから単語を１つ取り出し、結果グラフ上で、この単語ｉからの関連度がｋ以上であり、これまでの処理で一度も抽出されていないリンク集合を抽出する。
【０１１２】
ステップ３０８）ステップ３０７で抽出した各リンクの単語ｉの逆端の単語を取得する。この単語が、条件単語リスト、追加単語リストのいずれにも存在しない場合、追加単語リストに追加する。
【０１１３】
ステップ３０９）ホップ数ｍを１減算する。
【０１１４】
ステップ３１０）追加単語リストを条件単語リストに追加する。
【０１１５】
ステップ３１１）条件単語リストの中のすべての２つの単語間のリンクを、結果グラフから取得する。
【０１１６】
ステップ３１２）条件単語リストの各単語を持つ結果グラフ上のノード、及びステップ３１１で得られたリンクの集合からグラフを作成する。これを条件グラフとして出力する。
【０１１７】
次に、上記のステップ４００の文単位グラフ生成処理について説明する。
【０１１８】
図１２は、本発明の一実施の形態における文単位グラフ生成処理のフローチャートである。
【０１１９】
ステップ４０１）ステップ３００で生成した条件グラフ、ステップ１００で生成した文書ＩＤリストを取得する。
【０１２０】
ステップ４０２）空の文単位グラフリストを生成する。文単位グラフリストには、文単位グラフ、文単位、文書ＩＤ、文単位の出現位置のセットをリストとして持つことができる。
【０１２１】
ステップ４０３）文書ＩＤリストが空かを判定し、以下の分岐処理を行う。
【０１２２】
空でない場合は、ステップ４０４に移行する。
【０１２３】
空である場合は、ステップ４１２に移行する。
【０１２４】
ステップ４０４）文書ＩＤリストから文書ＩＤを１つ取り出し、文書ＩＤに対応した文書を分析対象文書データーベース６０から取得する。
【０１２５】
ステップ４０５）文書から文単位を抽出し、文単位リストに設定する。
【０１２６】
ステップ４０６）文単位リストが空かを判定し、以下の分岐処理を行う。
【０１２７】
文単位リストが空でない場合は、ステップ４０７に移行する。
【０１２８】
文単位リストが空である場合は、ステップ４０３に移行する。
【０１２９】
ステップ４０７）文単位リストから文単位ｊを１つ取り出し、文単位ｊが条件グラフの単語を１つでも含むかを判定し、以下の分岐処理を行う。
【０１３０】
含む場合は、ステップ４０８に移行する。
【０１３１】
含まない場合は、ステップ４０６に移行する。
【０１３２】
ステップ４０８）文単位ｊから単語を抽出し、各単語の出現頻度を計算することによって、各単語の重要度を計算する。
【０１３３】
ステップ４０９）文単位ｊを規定区間に分割し、各規定区間内での共出現頻度を計算することによって、各単語間の関連度を計算する。
【０１３４】
ステップ４１０）ステップ４０８で計算した単語の重要度を、ノードの重みに設定し、ステップ４０９で計算した単語間の関連度をリンクの重みに設定した文単位グラフｊを生成する。
【０１３５】
ステップ４１１）文単位グラフｊ、文単位ｊ、文書ＩＤ、文単位ｊの出現位置を１つのセットとし、文単位グラフリストへ追加する。
【０１３６】
ステップ４１２）文単位グラフリストを出力する。
【０１３７】
【実施例】
以下、図面と共に本発明の実施例を説明する。
【０１３８】
以下では、前述の図１０のフローチャートに基づいて説明するものとし、上記のステップ１００の分析実行処理によって、図１６のカテゴリ１のグラフが可視化され、ユーザインタフェース２０に表示されているとして説明する。
【０１３９】
ステップ２００）ユーザは、このように可視化されたグラフの中から「検索」「患者」「データ」を選択し、「要約表現作成」ボタンを押す。
【０１４０】
ステップ３００）ユーザの選択したこれらの単語及びその周辺の単語である「医療」「管理」から条件グラフを生成する。ここで、周辺は、１ホップ（ｎ＝１）、最低関連度は、関連度が大きいもの（太いリンク）としてある。ここで作成された条件グラフは、図１３である。
【０１４１】
ステップ４００）カテゴリ１に含まれる各文書の中で、条件グラフの単語を１つでも含む文単位が抽出される。図１４では、文書３、文書５、文書６からそれぞれ、２個、３個、１個の文単位が抽出されている。各文単位の先頭の数字は、その文単位の出現位置（文書の先頭の文単位を１とした連番）を示す。次に、文単位を文単位グラフに変換する。各文単位から単語を抽出し、文節の中で共出現関係を抽出し、グラフを生成している（図１４の右に示す）。
【０１４２】
ステップ５００）図１３の条件グラフと図１４の各文単位グラフの類似度計算が行われ、類似度の高い順に文単位が抽出される。
【０１４３】
ステップ６００）ステップ５００で抽出された文単位をユーザインタフェース２０に表示する。表示例を図９に示す。結果グラフ可視化ウィンドウでは、分析実行装置１０の出力した結果グラフを表示している。ここで、ユーザが選択した単語をノードの色を変えて表示している。要約表現表示ウィンドウでは、ユーザの選択した単語間の関係を最も良く表す文単位上位３個が、文書ＩＤ、出現位置と共に表示されている。各文単位では、ユーザの選択した単語を斜字体・太字、ユーザが選択した単語間の関連が文単位内にある場合には線によって表示している。
【０１４４】
ユーザは、この要約表現から、可視化された結果グラフの「患者」「検索」「データ」の関連が、「患者が何らかのデータを検索する」という関係に対応することを理解することができる。すなわち、可視化されたグラフの一部を指定するだけで、その指定部分を最も良く表す原文書の箇所を即座に確認することができるので、文書や文書集合の内容を正確に判断することができる。
【０１４５】
また、前述の図１０、図１１、図１２に示すフローチャートの動作をプログラムとして構築し、文書分析装置として利用されるコンピュータにインストールし、ＣＰＵ等の制御手段により実行する、または、ネットワークを介して流通させることが可能である。
【０１４６】
また、構築されたプログラムを、文書分析装置として利用されるコンピュータに接続されるハードディスク装置や、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施する際に、コンピュータにインストールして実行することも可能である。
【０１４７】
なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内に置いて、種々変更・応用が可能である。
【０１４８】
【発明の効果】
上述のように、本発明によれば、可視化された単語のグラフの一部を選択するだけでその部分を最も良く表す要約表現（原文書の箇所）を確認することができるので、単語間の関係がどのようなものであるかを理解でき、正確に、即座に文書や文書集合の内容を把握することができる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の原理構成図である。
【図３】本発明の一実施の形態における文書分析装置の構成図である。
【図４】本発明の一実施の形態における類似グラフ検索操作を説明するための図である。
【図５】本発明の一実施の形態における類似グラフ分類操作を説明するための図である。
【図６】本発明の一実施の形態における部分グラフ抽出操作を説明するための図である。
【図７】本発明の一実施の形態におけるグラフ合成操作を説明するための図である。
【図８】本発明の一実施の形態におけるグラフ差分抽出操作を説明するための図である。
【図９】本発明の一実施の形態におけるユーザインタフェースの表示例である。
【図１０】本発明の一実施の形態における文書分析処理の動作のフローチャートである。
【図１１】本発明の一実施の形態における条件グラフ生成処理のフローチャートである。
【図１２】本発明の一実施の形態における文単位グラフ生成処理のフローチャートである。
【図１３】本発明の一実施例の条件グラフ生成処理により生成された条件グラフの例である。
【図１４】本発明の一実施例の文単位グラフ生成処理により生成された文単位グラフの例である。
【図１５】従来の分析処理により特許Ａから作成されたグラフ（ＧＡ）の例である。
【図１６】従来の分析処理により得られた結果グラフの例である。
【符号の説明】
１０分析実行手段、分析実行装置
２０ユーザインタフェース
３０条件グラフ生成手段、条件グラフ生成装置
４０文単位グラフ生成手段、文単位生成装置
５０文単位選択手段、文単位選択装置
６０分析対象文書データーベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document analysis method and apparatus, a document analysis program, and a storage medium storing the document analysis program, and more particularly to a document analysis method and apparatus and document analysis for analyzing a large amount of documents and creating an overview of the contents of the documents. The present invention relates to a storage medium storing a program and a document analysis program.
[0002]
[Prior art]
With the advancement of information technology, the amount of electronic documents that people can access is rapidly expanding. However, since human information processing capabilities are limited, it has become virtually impossible to read a large number of documents one by one. Therefore, there is a need for a technique for creating an overview of the contents of a large number of documents and assisting humans to understand the contents immediately and accurately.
[0003]
Such techniques are shown below.
[0004]
(1) Each document is acquired from the database in which the document set is accumulated.
[0005]
(2) Extract a word from each document, and create a graph with the word as a node and the relation between words as a link. Here, using the word appearance frequency, the word co-occurrence frequency, and the like, the importance of the word and the degree of association between the words are calculated, and set as the node weight and the link weight, respectively.
[0006]
(3) The graph operation specified by the user is executed. Here, the graph operation is a function that takes a set of graphs as an argument and outputs the set of graphs as a result. The graph operation includes similar graph search, similar graph classification, partial graph extraction, graph synthesis, graph difference, and the like.
[0007]
(4) A graph operation is performed on the graph set specified as an argument, and a graph set as an execution result is output. For example, in a similar graph search, if a search condition graph and a set of search target graphs are given as arguments, the top n graphs similar to the search condition graph in the search target graph are obtained as outputs.
[0008]
(5) Visualize the graph output by the graph operation.
[0009]
(6) The user confirms the visualized graph, and repeats the above (3) to (5) until a satisfactory result is obtained.
[0010]
The above processing will be described below using an example of analysis for “patent”.
[0011]
Suppose that a user wants to give an overview of the content of another company's patent related to Patent A when he writes Patent A with the content “Applying Information System to Medical Field”. First, patent A is converted into a graph. This graph is GA (FIG. 15). Next, each patent to be analyzed is converted into a graph. Using GA as a search condition graph and a graph created from each patent as a search target graph, the top 50 similar graphs are acquired. Next, these 50 graphs are set as a graph set to be classified, and a similar graph classification operation is executed to classify them into three categories. The classified graphs of each category are synthesized by a graph synthesis operation and visualized. An example of the graph created in this way is shown in FIG. In FIG. 16, the size of characters corresponds to the importance of words, and the thickness of the link corresponds to the degree of association between words. In addition, the number of patents classified in each category and the document ID are displayed together. The contents of patents classified into each category from FIG.
・ Category 1: “System for searching and managing medical data”
・ Category 2: “System for calculating medical expenses deduction”
Category 3: “System for taking medical images”
It can be read that. That is, it can be seen that related patents having contents such as categories 1 to 3 have been filed for patents having the contents of “applying an information system to the medical field”.
[0012]
In this way, by combining several graph operations and visualizing the resulting result graph, it is possible to identify which important words are included in the document set, and which words are strongly related to each word. It is possible to judge immediately. Therefore, the contents of a document or a document set can be immediately grasped without reading the target documents one by one.
[0013]
[Problems to be solved by the invention]
In the conventional technique described above, it is possible to identify which words are particularly important in the document set and which are strongly related to those words. However, there is a problem that it is not known what kind of relationship there is between these words.
[0014]
For example, in the category 1 graph of FIG. 16, since there is a thick link between “patient”, “data”, and “search”, it can be seen that there is a strong association between these words. However, it cannot be determined whether the relationship between these words is the following (a) or (b).
[0015]
(A) Relationship that “target of search is patient data”
Example sentence: “A system that allows medical staff to search patient data instantly”
(B) The relationship that “the patient searches for some data”
Example sentence: “A system that allows patients to search for medical data”
As described above, there is a problem in that it is difficult to know the relationship between words simply by looking at a visualized graph, and the contents of a document or a document set cannot be accurately grasped.
[0016]
The present invention has been made in view of the above points, can accurately determine the relationship between words, and can quickly and accurately grasp the contents of a large collection of documents. An object of the present invention is to provide a document analysis method and apparatus, a document analysis program, and a storage medium storing the document analysis program.
[0017]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
[0018]
  The present invention(Claim 1)IsDisplay a sentence-by-sentence summary representation in a documentIn document analysis method,
  The analysis execution device reads the document from the analysis target document database that stores the analysis target document,For the documentResult graph with words as nodes and associations between words as linksCreated and createdResult graphFOutput to the user interface (step 1)
  A selected word list consisting of one or more words selected by the user from the result graph displayed by the condition graph generation device by the user interfaceAs a result graph node containing a word in the selected word list as a condition graph node, and a link existing between these nodes on the result graph as a linkGenerate a condition graph (step 2)
  Based on the document ID list acquired from the analysis execution device by the sentence unit graph generation device, from the analysis target document databaseIndicated by document IDGet the documentA sentence unit including at least one word included in the condition graph is extracted from each acquired document, and for each sentence unit extractedGenerate a sentence unit graph (step 3),
  The sentence unit selectorCalculate the similarity between the condition graph and the sentence unit graph, and select a predetermined number of sentence units with high similarityOutput to the user interface (step 4),
  The user interface displays the selected sentence unit as a summary expression (step 5).
[0020]
  In addition, the present invention(Claim 2)When generating a condition graph in the condition graph generation device (step 2),
  Set the number of hops and minimum relevance,
  Relevance between words is set for graph linksIn the result graph, the words of reachable nodes are acquired as peripheral words using only links that are within the specified number of hops from each selected word and have a relevance level equal to or higher than the minimum relevance level. age,
  The selected word list and the surrounding word list are used as the conditional word list,
  A condition graph is generated with a node on the result graph having each word in the condition word list as a node of the condition graph and a link existing between these nodes on the result graph as a link of the condition graph.
[0021]
  In addition, the present invention(Claim 3)When generating a sentence unit graph in the document unit graph generation apparatus (step 3),
  Obtain a document from the document database to be analyzed using the document ID,
  Document, paragraph, sentence,Clause,Divide into sentence units consisting of continuous character strings or continuous word strings,
  For each sentence unit containing at least one word included in the condition graphAndGenerate a sentence unit graph.
[0023]
FIG. 2 is a principle configuration diagram of the present invention.
[0024]
  The present invention(Claim 4)IsDisplay a sentence-by-sentence summary representation in a documentA document analyzer,
  An analysis target database 60 for storing documents to be analyzed;
  Read the document from the analysis target document database 60,For the documentResult graph with words as nodes and associations between words as linksCreated and createdResult graphOutput to the user interface 20Analysis execution means 10 and,
  YuFrom the result graph displayed by the user interface 20By userA selected word list consisting of one or more selected wordsThe node of the result graph including the words in the selected word list is used as the node of the condition graph, and the link existing between these nodes on the result graph is used as the link.A condition graph generating means 30 for generating a condition graph;
  Obtained from the analysis execution means 10From the document database to be analyzed based on the document ID listIndicated by document IDGet the documentA sentence unit including at least one word included in the condition graph is extracted from each acquired document, and for each sentence unit extractedSentence unit graph generation means 40 for generating a sentence unit graph;
  Calculate the similarity between the condition graph and the sentence unit graph, and select a predetermined number of sentence units with high similaritySentence unit selection means 50 for outputting to the user interface.
[0026]
  In addition, the present invention(Claim 5)Condition graph generation means 30In
  A means to set the number of hops and minimum relevance;
  Relevance between words is set for graph linksIn the result graph, the words of reachable nodes are acquired as peripheral words using only links that are within the specified number of hops from each selected word and have a relevance level equal to or higher than the minimum relevance level. And means to
  The selected word list and the neighboring word list are used as a conditional word list, and a node on the result graph having each word of the conditional word list is used as a node of the conditional graph, and links existing between these nodes on the result graph are displayed.TheGenerating a condition graph as a link of the condition graph.
[0027]
  In addition, the present invention(Claim 6)Document unit graph generation means 40In
  Means for obtaining a document from the analysis target document database using the document ID;
  Document, paragraph, sentence,Clause,A means of dividing a sentence unit consisting of a continuous character string or a continuous word string;,
  Only for each sentence unit that contains at least one word included in the condition graph,SentenceMeans for generating a unit graph.
[0029]
  According to the present invention (Claim 7), each means constituting the document analysis apparatus according to any one of Claims 4 to 6 is provided.ComputerIs a program for making
[0034]
  The present invention(Claim 8)IsA computer-readable program storing the program according to claim 7.It is a storage medium.
[0039]
As described above, when one or more words on the visualized graph are selected, the present invention automatically generates a condition graph composed of the selected word and its surrounding words, and the original document , By outputting sentence units (sentences, sentences, paragraphs, consecutive n words and n characters) similar to the condition graph as summary expressions, it is possible to accurately determine the relationship between words Therefore, it is possible to immediately and accurately grasp the contents of a large amount of document set.
[0040]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0041]
FIG. 3 shows the configuration of the document analysis apparatus according to the embodiment of the present invention.
[0042]
The document analysis apparatus shown in FIG. 1 includes an analysis execution apparatus 10, a user interface 20, a condition graph generation apparatus 30, a sentence unit graph generation apparatus 40, a sentence unit selection apparatus 50, and an analysis target document database 60.
[0043]
The analysis execution apparatus 10 acquires a document from the analysis target document database 60 and converts each document into a graph. The graph operation is executed, and the result graph and the document ID list are sent to the user interface 20. The graph operation will be described below.
[0044]
First, a similar graph search operation (serch: GA, GS) will be described.
[0045]
FIG. 4 is a diagram for explaining a similar graph search operation according to the embodiment of the present invention. Inputs in the similar graph search operation are a search condition graph (Ga) and n search target graphs (GS), and m graph sets having high similarity are output by the following processing.
[0046]
(1) The similarity between each graph of Ga and GS is calculated. As a method for calculating the similarity between graphs, for example, Japanese Patent Application No. 10-297321, which is an existing technique, can be used.
[0047]
(2) Sort GS in descending order of similarity.
[0048]
(3) Output m graphs with high similarity.
[0049]
Next, a similar graph analysis operation (clustering (GS)) will be described.
[0050]
FIG. 5 is a diagram for explaining a similar graph classification operation according to the embodiment of the present invention.
[0051]
The input to the similar graph analyzer is n classification target graphs (GS), and a classification target graph divided into k clusters is output by the following processing.
[0052]
(1) The similarity between graphs included in the GS is calculated. Note that the existing technique similar to the similarity graph search operation can be used for the calculation of the similarity.
[0053]
(2) Classify the graph into k clusters based on the similarity.
[0054]
(3) Output k clusters.
[0055]
Next, the subgraph extraction operation (extract (Ga, GS) will be described.
[0056]
FIG. 6 is a diagram for explaining the subgraph extraction operation according to the embodiment of the present invention.
[0057]
Inputs in the graph extraction operation are an extraction condition graph (Ga) and n extraction target graphs (GS), and n extracted graphs are output by the following processing.
[0058]
(1) A partial graph is extracted from each graph of GS based on Ga.
[0059]
(2) Output a set of extracted subgraphs.
[0060]
In the example of FIG. 6, a subgraph composed of nodes within one hop from the nodes (words 'A', 'B') included in Ga is extracted. The subgraph extraction algorithm is based on existing technology. For example, Japanese Patent Application No. 2000-62561 can be used.
[0061]
Next, the graph composition operation (merge (GS)) will be described.
[0062]
Inputs in graph synthesis are n synthesis target graphs, and a synthesized graph is output by the following processing. FIG. 7 is a diagram for explaining a graph composition operation according to an embodiment of the present invention.
[0063]
(1) Find a node having the same word in GS and add its importance.
[0064]
(2) Find links having the same word at both ends in the GS, and add their relevance.
[0065]
(3) The graph created in this way is output.
[0066]
Next, the graph difference extraction operation (substrct (Ga, Gb)) will be described.
[0067]
The inputs of the graph difference extraction operation are the difference target graph (Ga) and the difference extraction condition graph (Gb), and the difference graph is extracted by the following processing. FIG. 8 is a diagram for explaining the graph difference extraction operation according to the embodiment of the present invention.
[0068]
(1) The importance of Gb is subtracted from Ga.
[0069]
(2) The degree of association of Gb is subtracted from Ga.
[0070]
(3) A graph having the subtracted importance and relevance is output.
[0071]
Here, subtraction means subtracting the importance and relevance when there is the same node (link), and nothing if there is no same node (link). If the subtraction results in a negative number, the node (link) is deleted.
[0072]
Since both inputs and outputs of these operations are graph lists, they can be combined in any order. In addition to the above graph operations, if both input and output are graph lists, they can be incorporated into the present invention.
[0073]
The user interface 20 visualizes the graph acquired from the analysis execution device 10 using a graph in which words are nodes and associations between words are links. The user can select a plurality of arbitrary words on the visualized graph. The selected word list is sent to the condition graph generating device 30 together with the result graph and the document ID list. Further, the sentence unit list selected by the sentence unit selecting device 50 is displayed as a summary expression. Here, the sentence unit is a logical unit of a document such as a sentence, paragraph, or clause, or a continuous n character or n word. The sentence unit has a document ID in which the sentence unit appears and an appearance position in the document, and the word selected by the user in the sentence unit is displayed with different fonts and colors. In addition, a line is described at a location corresponding to the relationship between the selected words. FIG. 9 shows a display example of the user interface according to the embodiment of the present invention. Detailed description in the figure will be described later.
[0074]
The condition graph generation device 30 generates a condition graph from the word list selected on the user interface 20 and the result graph output by the analysis execution device 10. The method for generating the condition graph is as follows.
[0075]
(A) A method for creating a condition graph using nodes on the result graph including the word selected by the user as nodes of the condition graph and using links existing between these nodes on the result graph as links:
(B) A condition graph including a node on the result graph including the word selected by the user and nodes around the selected word as nodes of the condition graph, and links existing between these nodes as links of the condition graph How to generate:
Here, the surrounding nodes are “nodes within m hops (the path length of the shortest path is m) from the node having the selected word” or “relevance degree from the node having the selected word is k or more. Node "and the like. Here, m and k are constants of 0 or more.
[0076]
The condition graph generated in this way is sent to the sentence unit selection device 50 and the sentence unit graph generation device 40. A document ID list is also sent to the sentence unit graph generation device 40.
[0077]
  Sentence graph generator40Uses the condition graph acquired from the condition graph generation device 30 and the document ID list to generate a sentence unit graph according to the following steps.
[0078]
(1) A document corresponding to the document ID is acquired from the analysis target document database 60, and a sentence unit including at least one word included in the condition graph is extracted from each document.
[0079]
(2) Convert each sentence unit into a graph. The method of conversion to sentence units is as follows.
[0080]
i) Extract words from sentence units, calculate appearance frequency, and assign importance.
[0081]
  ii) Calculate the co-occurrence frequency in the specified section in the sentence unit, and assign the relevance. Here, the specified section isSentenceIt is a predetermined section shorter than a unit or sentence unit. For example, when the sentence unit is a sentence, it is a clause, a specified number of consecutive characters, or the like.
[0082]
The sentence unit graph, the sentence unit, the document ID, and the appearance position of the sentence unit are set as one set and added to the sentence unit graph list. The above processing is performed on all documents corresponding to the document ID list, and the generated sentence unit graph list is sent to the sentence unit selection device 50.
[0083]
  The sentence unit selection device 50 uses the condition graph acquired from the condition graph generation device 30 and the sentence unit graph list acquired from the sentence unit graph generation device 40, and performs the following optimal processing.SentenceSelect the unit.
[0084]
(1) The similarity between the condition graph and each sentence unit graph is calculated. For example, the method described in the analysis execution apparatus 10 can be used for calculating the similarity. That is, the same word is used with the same degree of importance, and a large degree of similarity is assigned to graphs having the same degree of association between the same words.
[0085]
(2) Select n sentence units in descending order of similarity.
[0086]
The top n sentence units, the document ID in which each sentence unit appears, and the appearance position are collected into a sentence unit list, which is sent to the user interface 20.
[0087]
The operation in the above configuration will be described below.
[0088]
FIG. 10 is a flowchart of the operation of the document analysis process according to the embodiment of the present invention.
[0089]
Step 100) The analysis execution apparatus 10 performs the following processing.
[0090]
(A) A document is acquired from the analysis target document database 60.
[0091]
(B) Convert each document into a graph.
[0092]
(C) A plurality of graph operations specified by the user are executed on the graph set.
[0093]
(D) The result graph and the document ID of the document on which the graph is based are transmitted to the user interface 20, and the user interface 20 visualizes and displays the result graph thus obtained.
[0094]
Step 200) Through the user interface 20, the user selects one or more words on the visualized result graph and presses the “Create Summary Expression” button.
[0095]
Step 300) The condition graph generating device 30 generates a condition graph from the selected word list and the result graph. Details will be described later.
[0096]
Step 400) The sentence unit graph generation apparatus 40 acquires a document from the analysis target document database 60, extracts a sentence unit, and generates a sentence unit graph based on the condition graph and the document ID list. Details will be described later.
[0097]
Step 500) The sentence unit selection device 50 calculates the similarity between each minute unit graph and the condition graph, and selects n sentence units with high similarity.
[0098]
Step 600) The sentence unit output by the sentence unit selecting device 50 is output by indicating the word and the relationship between the words by color, font and line. The document ID in which each sentence unit appears and the appearance position are displayed together.
[0099]
Next, the condition graph generation process in step 300 will be described in detail.
[0100]
FIG. 11 is a flowchart of a condition graph generation process according to an embodiment of the present invention.
[0101]
Step 301) Obtain the result graph output from the analysis execution apparatus 10 and the selected word list selected by the user.
[0102]
Step 302) Generate a conditional word list, additional word list, and basic word list for storing words in an empty state. The number m of hops when acquiring the peripheral word and the minimum value k of the relevance are set. Here, in the case of the method (a) shown in the description of the condition graph generation device 30 (a condition graph is generated using only selected words), m = 0 is set.
[0103]
Step 303) The selected word list is set to the conditional word list and the additional word list.
[0104]
Step 304) The value of the hop number m is determined, and the following branch processing is performed.
[0105]
If m> 0, the process proceeds to step 305.
[0106]
If m ≦ 0, the process proceeds to step 311.
[0107]
Step 305) The additional word list is set as the basic word list. Also, the additional word list is emptied.
[0108]
Step 306) It is determined whether or not the basic word list is empty, and the following branch processing is performed.
[0109]
If not empty, the process proceeds to step 307.
[0110]
If it is empty, the process proceeds to step 309.
[0111]
Step 307) One word is taken out from the basic word list, and on the result graph, a link set whose relevance from the word i is k or more and has not been extracted in the process so far is extracted.
[0112]
Step 308) The word at the opposite end of the word i of each link extracted in Step 307 is acquired. If this word does not exist in either the conditional word list or the additional word list, it is added to the additional word list.
[0113]
Step 309) Subtract 1 from the number of hops m.
[0114]
Step 310) Add the additional word list to the conditional word list.
[0115]
Step 311) The links between all two words in the conditional word list are obtained from the result graph.
[0116]
Step 312) A graph is created from the node on the result graph having each word in the conditional word list and the set of links obtained in Step 311. This is output as a condition graph.
[0117]
Next, the sentence unit graph generation process in step 400 will be described.
[0118]
FIG. 12 is a flowchart of sentence unit graph generation processing according to an embodiment of the present invention.
[0119]
Step 401) The condition graph generated in Step 300 and the document ID list generated in Step 100 are acquired.
[0120]
Step 402) Generate an empty sentence unit graph list. The sentence unit graph list can have a list of sentence unit graphs, sentence units, document IDs, and sentence unit appearance positions.
[0121]
Step 403) It is determined whether the document ID list is empty, and the following branch processing is performed.
[0122]
If not empty, the process proceeds to step 404.
[0123]
If it is empty, the process proceeds to step 412.
[0124]
Step 404) One document ID is extracted from the document ID list, and a document corresponding to the document ID is acquired from the analysis target document database 60.
[0125]
Step 405) Extract sentence units from the document and set them in the sentence unit list.
[0126]
Step 406) It is determined whether the sentence unit list is empty, and the following branch processing is performed.
[0127]
If the sentence unit list is not empty, the process proceeds to step 407.
[0128]
If the sentence unit list is empty, the process proceeds to step 403.
[0129]
Step 407) One sentence unit j is extracted from the sentence unit list, and it is determined whether the sentence unit j includes at least one word in the condition graph, and the following branch processing is performed.
[0130]
If included, the process proceeds to step 408.
[0131]
If not included, the process proceeds to step 406.
[0132]
Step 408) Extract the word from the sentence unit j, and calculate the importance of each word by calculating the appearance frequency of each word.
[0133]
Step 409) Divide sentence unit j into prescribed sections, and calculate the co-occurrence frequency in each prescribed section, thereby calculating the degree of association between each word.
[0134]
Step 410) The sentence importance graph j in which the importance of the word calculated in Step 408 is set as the node weight and the relevance between the words calculated in Step 409 is set as the link weight is generated.
[0135]
Step 411) The appearance position of the sentence unit graph j, the sentence unit j, the document ID, and the sentence unit j is set as one set and added to the sentence unit graph list.
[0136]
Step 412) The sentence unit graph list is output.
[0137]
【Example】
Embodiments of the present invention will be described below with reference to the drawings.
[0138]
The following description is based on the flowchart of FIG. 10 described above, and it is assumed that the category 1 graph of FIG. 16 is visualized and displayed on the user interface 20 by the analysis execution process of step 100 described above.
[0139]
Step 200) The user selects “Search”, “Patient”, and “Data” from the graph visualized in this way, and presses the “Create summary expression” button.
[0140]
Step 300) A condition graph is generated from these words selected by the user and surrounding words “medical” and “management”. Here, the periphery is 1 hop (n = 1), and the minimum relevance is a high relevance (thick link). The condition graph created here is shown in FIG.
[0141]
Step 400) In each document included in category 1, sentence units including at least one word of the condition graph are extracted. In FIG. 14, two, three, and one sentence unit are extracted from document 3, document 5, and document 6, respectively. The number at the beginning of each sentence unit indicates the appearance position of the sentence unit (serial number with the first sentence unit of the document as 1). Next, the sentence unit is converted into a sentence unit graph. A word is extracted from each sentence unit, a co-occurrence relationship is extracted in the phrase, and a graph is generated (shown on the right in FIG. 14).
[0142]
Step 500) Similarity calculation between the condition graph of FIG. 13 and each sentence unit graph of FIG. 14 is performed, and sentence units are extracted in descending order of similarity.
[0143]
Step 600) The sentence unit extracted in Step 500 is displayed on the user interface 20. A display example is shown in FIG. In the result graph visualization window, the result graph output by the analysis execution apparatus 10 is displayed. Here, the word selected by the user is displayed by changing the color of the node. In the summary expression display window, the top three sentence units that best represent the relationship between the words selected by the user are displayed together with the document ID and the appearance position. In each sentence unit, the word selected by the user is displayed in italics and bold, and when the relation between the words selected by the user is in the sentence unit, it is displayed by a line.
[0144]
  From this summary expression, the user can understand that the relationship between “patient”, “search”, and “data” in the visualized result graph corresponds to the relationship “the patient searches for some data”. In other words, by simply specifying a part of the visualized graph, it is possible to immediately confirm the location of the original document that best represents the specified part.ContentCan be accurately determined.
[0145]
Further, the operations of the flowcharts shown in FIGS. 10, 11, and 12 are constructed as a program, installed in a computer used as a document analysis apparatus, and executed by a control means such as a CPU, or via a network. It can be distributed.
[0146]
Further, when the constructed program is stored in a hard disk device connected to a computer used as a document analysis device, a portable storage medium such as a flexible disk or a CD-ROM, and the present invention is carried out, It can also be installed on a computer and executed.
[0147]
The present invention is not limited to the above-described embodiments and examples, and various modifications and applications are possible within the scope of the claims.
[0148]
【The invention's effect】
As described above, according to the present invention, it is possible to confirm the summary expression (the portion of the original document) that best represents the portion by simply selecting a portion of the visualized word graph. It is possible to understand what the relationship is, and to accurately and immediately grasp the contents of the document or document set.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a principle configuration diagram of the present invention.
FIG. 3 is a configuration diagram of a document analysis apparatus according to an embodiment of the present invention.
FIG. 4 is a diagram for explaining a similar graph search operation according to the embodiment of the present invention.
FIG. 5 is a diagram for explaining a similar graph classification operation in one embodiment of the present invention;
FIG. 6 is a diagram for explaining a subgraph extraction operation according to an embodiment of the present invention.
FIG. 7 is a diagram for explaining a graph composition operation according to an embodiment of the present invention.
FIG. 8 is a diagram for explaining a graph difference extraction operation according to an embodiment of the present invention.
FIG. 9 is a display example of a user interface according to an embodiment of the present invention.
FIG. 10 is a flowchart of the operation of document analysis processing in one embodiment of the present invention.
FIG. 11 is a flowchart of a condition graph generation process according to an embodiment of the present invention.
FIG. 12 is a flowchart of sentence unit graph generation processing according to the embodiment of the present invention.
FIG. 13 is an example of a condition graph generated by a condition graph generation process according to an embodiment of the present invention.
FIG. 14 is an example of a sentence unit graph generated by a sentence unit graph generation process according to an embodiment of the present invention;
FIG. 15 is an example of a graph (GA) created from Patent A by conventional analysis processing.
FIG. 16 is an example of a result graph obtained by a conventional analysis process.
[Explanation of symbols]
10. Analysis execution means, analysis execution device
20 User interface
30 Condition graph generation means, condition graph generation device
40 sentence unit graph generation means, sentence unit generation apparatus
50 sentence unit selection means, sentence unit selection device
60 Analysis target database

Claims

In a document analysis method for displaying a summary representation of sentences in a document,
The analysis execution device reads the document from the analysis target document database that stores the analysis target document, creates a result graph with the word as a node and the relation between the words as a link to the document, and the generated result graph Is output to the user interface,
The condition graph generation device acquires a selected word list including one or a plurality of words selected by the user from the result graph displayed by the user interface,
A node of the result graph including a word in the selected word list is a node of the condition graph, and a condition graph is generated by using links existing between these nodes on the result graph as links,
The sentence unit graph generation device acquires the document indicated by the document ID from the analysis target document database based on the document ID list acquired from the analysis execution device, and the words included in the condition graph from each acquired document A sentence unit that includes at least one sentence, and generates a sentence unit graph for each extracted sentence unit.
The sentence unit selection device calculates a similarity between the condition graph and the sentence unit graph, selects a predetermined number of sentence units having a high similarity, and outputs the selected sentence unit to the user interface.
The document analysis method, wherein the user interface displays a selected sentence unit as a summary expression.

In the condition graph generation device, when generating the condition graph,
Set the number of hops and minimum relevance,
In the result graph in which the degree of association between words is set to the link of the graph, it can be reached by using only the links within the specified number of hops from each selected word and having the degree of association above the minimum degree of association. Get the word that the node has as a peripheral word, make a peripheral word list,
The selected word list and the surrounding word list as a conditional word list,
The condition graph is generated with a node on the result graph having each word in the condition word list as a node of the condition graph and a link existing between these nodes on the result graph as a link of the condition graph. The document analysis method according to 1.

In the document unit graph generation device, when generating the sentence unit graph,
Using the document ID to obtain a document from the analysis target document database;
Dividing the document into sentence units consisting of paragraphs, sentences, clauses, continuous character strings, or continuous word strings;
The document analysis method according to claim 1, wherein the sentence unit graph is generated only for each sentence unit including at least one word included in the condition graph.

A document analysis device that displays a summary representation of a sentence unit in a document,
An analysis target database for storing documents to be analyzed;
An analysis execution unit that reads a document from the analysis target document database, creates a result graph with a word as a node and a relation between words as a link to the document, and outputs the created result graph to a user interface;
A selected word list consisting of one or more words selected by the user from the result graph displayed by the user interface is acquired, and a node of the result graph including the words in the selected word list is displayed in a condition graph A condition graph generating means for generating a condition graph using nodes as links and links existing between these nodes on the result graph;
Based on the document ID list acquired from the analysis execution unit, a document indicated by a document ID is acquired from the analysis target document database, and at least one word included in the condition graph is acquired from each acquired document And a sentence unit graph generation means for generating a sentence unit graph for each extracted sentence unit;
Calculating a similarity between the condition graph and the sentence unit graph, selecting a predetermined number of sentence units having a high similarity, and outputting to the user interface a sentence unit selection unit;
A document analysis apparatus comprising:

The condition graph generation means includes:
A means to set the number of hops and minimum relevance;
In the result graph in which the degree of association between words is set to the link of the graph, it can be reached by using only the links within the specified number of hops from each selected word and having the degree of association above the minimum degree of association. Means for obtaining a word possessed by a node as a peripheral word and making it a peripheral word list;
The selected word list and the neighboring word list are used as a conditional word list, and a node on the result graph having each word of the conditional word list is used as a node of the conditional graph, and exists between these nodes on the result graph. The document analysis apparatus according to claim 4, further comprising: means for generating the condition graph using a link as a link of the condition graph.

The document unit graph generation means includes:
Means for obtaining a document from the analysis object document database using the document ID;
Means for dividing the document into sentence units composed of paragraphs, sentences, clauses, continuous character strings, or continuous word strings;
The document analysis apparatus according to claim 4, further comprising: means for generating the sentence unit graph only for each sentence unit including at least one word included in the condition graph.

The program for functioning a computer as each means which comprises the document analysis apparatus of any one of Claims 4 thru | or 6.

A computer-readable storage medium storing the program according to claim 7.