JP2004013745A

JP2004013745A - Device and method for extracting document dependence

Info

Publication number: JP2004013745A
Application number: JP2002169236A
Authority: JP
Inventors: Takeshi Nagamine; 永峯　猛志; Akio Yamashita; 山下　明男; Katsunori Yoshiji; 芳地　克典
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2002-06-10
Filing date: 2002-06-10
Publication date: 2004-01-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system in which it can be considered that dependence exists between documents when a word defined in a certain document is used in a different document and which extracts dependence between a plurality of documents and supports document understanding. <P>SOLUTION: A defined attribute giving part 11 gives defined attribute to each document of a document group. For instance, a morpheme analysis is conducted so as to divide the document into words, and "defined attribute" is given to the word or the like which corresponds to "A" extracted by a language pattern such as "A is B". The defined attribute giving part 11 gives a word having the extracted defined attribute and an original document given to a reference attribute giving part 12. The reference attribute giving part 12 examines whether a word which is the same character string as the word having the defined attribute received and has no defined attribute exists in each of six original documents or not. When the word exists, the reference attribute is given to the word. A dependence link generating part 13 pasts a link to the word of the same character string having the defined attribute from the word having the reference attribute, quantitatively evaluates the link to generate the dependance between the documents and preserves it in a dependence link data base 15 by a dependence link preserving part 14. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、文書間の依存関係を抽出する技術に関する。
【０００２】
【背景の技術】
ある文書中に定義されている単語を、別の文書で使用している場合その文書間に依存関係があると考えることができる。複数の文書間の依存関係を抽出し文書理解を支援することが望まれる。
【０００３】
すなわち、人間がある文書を読む場合、その文書を読むための前提となる文書がある場合がある。たとえば、ある装置のマニュアルのある個所を読もうとしても、あらかじめその装置に関する部位名称等の用語を知らなければならない。
【０００４】
また、あるプロジェクトのある資料を読む場合に、そのプロジェクトに精通している場合は問題ないが、余り詳しくない場合、そのプロジェクト内で定義されている用語などが使われていると、その資料以外にも目を通す必要が出てくる。
【０００５】
また、大学などの授業では学生が取得したい単位の前にあらかじめ学んだほうがよい単位などがある場合がある。つまり、ある授業で使用する教材文書を読む前に、別の授業の教材文書を読んだほうがよい場合がある。
【０００６】
このような場合、そのプロジェクトで使用されている全資料や、大学で使用している全教材の文書間の依存関係を調べて、ある文書を読む前に、読むべき文書を推薦できれば文書読解を支援できる。
【０００７】
この発明は以上の要望に対処してなされたものである。
【０００８】
なお、この発明と関連する先行文献としては以下のものがある。
（１）特開平７−３２５８２７号公報：「ハイパーテキスト自動生成装置」
（２）特開平５−２２５２４７号公報：「文書間構造表示方法」
（３）特開２０００−２５９６５７公報：「用語定義の検索／収集装置」
【０００９】
（１）では、文書に含まれる単語同士のマッチングまたはシソーラスを使って同義語に展開し、単語の文字ストリング同士のマッチングに基づいてリンク（ハイパーリンク）を生成したり、リンクもとの単語から、その単語が多く出現する節のタイトルへリンクを生成することを開示している。単語間のリンクであるので文書間の依存関係を抽出できない。
【００１０】
（２）では、２つの文書に含まれる共通の単語の数をもとに文書間の関連度を求めその値をもとにグラフに色付けやリンクの太さの設定を行って文書間の関連の強さをあらわしている。同じ単語が含まれる場合、２つの文書が関連していることはいえそうであるが、どのような関連かは分からないので読むか否か判断するには難しい。
【００１１】
（３）では与えられた文書から用語とその定義部を抽出し、データベースに登録することにより検索できるようにしてある。しかし、その用語が出てきた背景や、その用語に関する例などが、実際にその用語が定義されている資料には記述されている場合が多く、ユーザに定義だけ与えるよりも、定義されている資料やそのページを与えたほうがよい場合がある。
【００１２】
【発明が解決する課題】
この発明は、以上の事情を考慮してなされたものであり、文書間の依存関係を抽出する技術を提供することを目的としている。
【００１３】
【課題を解決するための手段】
この発明によれば、上述の目的を達成するために、例えば、与えられた文書群に含まれるそれぞれの文書を構成する各単語について、定義されているか、参照（使用）されているかの属性を「定義属性」、「参照属性」として与え、参照属性を持つ単語から、定義属性を持つ同じ文字ストリングの単語へリンクを張りそれらをもとに依存度を計算する。
【００１４】
すなわち、この発明の一側面に、上述の目的を達成するために、文書依存関係抽出装置に：それぞれ１塊の文書として他の文書と区別して扱うことができる複数の文書単位を記憶する文書記憶手段と；上記文書記憶手段に記憶された各文書単位において、単語が定義を伴って出現したこと（定義属性であること）を判別する第１の判別手段と；上記文書記憶手段に記憶された各文書単位において、上記第１の判別手段により定義を伴って出現したと判別された単語が定義を伴うことなく出現したこと（参照属性であること）を判別する第２の判別手段と；定義を伴って出現した単語と、当該単語に対応し定義を伴うことなく出現した単語との対応関係に基づいて、上記定義を伴って出現した単語が判別された文書単位と、上記当該単語に対応し定義を伴うことなく出現した単語が判別された文書単位との依存関係を決定する依存関係決定手段とを設けるようにしている。
【００１５】
この構成においては、異なる文書位置において現れる同一の文字ストリングの単語間に定義−参照関係がある蓋然性があるかどうかを判別し、これに基づいて文書単位間の依存関係を簡易に決定することができる。
【００１６】
なお、シソーラス等を用いて同義語との間の「定義属性」−「参照属性」の関係を考慮して依存度を測定してもよい。「単語」は、ひろく、文章の構成要素を指し、複合語も含まれる。
【００１７】
なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。
【００１８】
この発明の上述の側面およびこの発明の他の側面は特許請求の範囲に記載され、以下実施例を用いて詳細に説明される。
【００１９】
【発明の実施の形態】
以下、この発明の実施例について説明する。
【００２０】
［文書依存関係抽出手法］
まず、実施例で用いる文書依存関係抽出手法の原理的な説明を行う。
【００２１】
この実施例では、与えられた文書群に含まれるそれぞれの文書を構成する各単語について、定義されているか、参照（使用）されているかの属性を、「定義属性」、「参照属性」として与え、参照属性を持つ単語から、定義属性を持つ同じ文字ストリングの単語へリンクを張りそれらのもとに依存度を計算する。
【００２２】
定義属性の付与方法は、「〜とは〜である」等のように言語パタンを使用して定義されている単語を抽出し、それらに「定義属性」与えることができる。
【００２３】
また文書のレイアウト情報を用いて定義属性を付与してもよい。たとえば、文書や章のタイトルが抽出できる場合で、それが単語や複合語のみ場合は、その文書や節はその単語について説明をしている可能性が高いので定義されているとみなしそれらに「定義属性」与えてもよい。また、「項目名　文章」などのように、項目名があって、その説明文が続くようようなレイアウトがある場合はその項目名が説明されているとして定義属性を与えてもよい。
【００２４】
参照属性の付与方法は、上記の定義属性が付与された単語と同じ文字ストリングの単語で定義属性が付与されていない単語に「参照属性」を付与すればよい。参照属性を持つ単語で、定義属性を持つ同じ文字ストリングがある場合、参照属性を持つ単語から定義属性を持つ同じ文字ストリングの単語へリンクは張る。
【００２５】
ある文書の単語から別の文書の単語への上記リンクが貼られる場合、リンクの数等に基づいて文書間の依存度を表わす。ある文書間に依存度がある場合、依存度を属性値として持つ依存リンクを文書間にはる。そして上記の文書間の依存度をビジュアルに見せ文書の理解を支援する。
【００２６】
依存リンクには依存度が付与されており、ある文書を読む際に、その文書に依存する、ある依存度以上の文書のみを集めたり、依存度の高い順にランクしユーザへ提示することができる。
【００２７】
文書Ｄａから文書Ｄｂへの依存度Ｄｅｐ（Ｄａ，Ｄｂ）は以下の式で計算される。
【数１】
Ｄｅｐ（Ｄａ，Ｄｂ）＝Σｗ（Ｋａｂ［ｎ］）　（ａ≠ｂ）
【００２８】
ただし、Ｋａｂ：文書Ｄｂ中で定義属性を持つ単語で文書Ｄａ中において参照属性を持つ同じ文字ストリングの単語の集合。
Ｋａｂ［ｎ］：単語集合Ｋａｂのｎで示される単語（１≦ｎ≦Ｋａｂに含まれる単語の数）。
ｗ（Ｋａｂ［ｎ］）：単語Ｋａｂ［ｎ］の重み。
【００２９】
ｗ（Ｋａｂ［ｎ］）は単語Ｋａｂ［ｎ］が文書Ｄａに参照属性を伴って出現する回数と、その単語が文書Ｄｂ中で定義属性を伴って出現する回数から計算される。
【数２】
ｗ（Ｋａｂ［ｎ］）＝ｒｅｆ＿ｔｆ（Ｄａ，Ｋａｂ［ｎ］）＊ｒｅｆ＿ｗ＋ｄｅｆ＿ｔｆ（Ｄｂ，Ｋａｂ［ｎ］）＊ｄｅｆ＿ｗ
【００３０】
ｒｅｆ＿ｔｆ（Ｄａ，Ｋａｂ［ｎ］）は文書Ｄａ中に参照属性を持つＫａｂ［ｎ］の出現回数である。
ｄｅｆ＿ｔｆ（Ｄｂ，Ｋａｂ［ｎ］）は文書Ｄｂ中に定義属性を持つＫａｂ［ｎ］の出現回数である。
ｒｅｆ＿ｗ、ｄｅｆ＿ｗは重みで変更可能である。たとえば、単語の一般性を加味するためにｉｄｆ（ｉｎｖｅｒｔｅｄ　ｄｏｃｕｍｅｎｔ　ｆｒｅｑｕｅｎｃｙ）を与えられた文書群や、辞書やニュース記事等からあらかじめ計算しておいてもよい。下記のように重みを付加することにより、一般的な用語と思われる単語の重みを下げることができる。
【数３】
ｒｅｆ＿ｗ（Ｋａｂ［ｎ］）＝［単語Ｋａｂ［ｎ］の新聞記事１年分から得たｉｄｆ］
【００３１】
［文書登録装置］
つぎにこの実施例で用いる文書登録装置１００について説明する。この文書登録装置１００は、複数の文書を受け付けてそれら文書間の依存度を生成して登録するものである。文書登録装置１００は、例えば、スタンドアローンのパーソナルコンピュータで実現することもでき、またネットワーク上に配置されたサーバにより実現することもできる。
【００３２】
図１は、文書登録装置１００の構成を示しており、この図において、文書登録装置１００は、文書群受付部１０、定義属性付与部１１、参照属性付与部１２、依存リンク生成部１３、依存リンク保存部１４、および依存リンクデータベース１５を含んで構成されている。
【００３３】
文書群受付部１０は、ユーザが指定した文書群を受け取る。指定方法としてはあるディレクトリ以下に保存されているすべての文書などというものであるが、これに限定されない。文書群受付部１０が、図２に示すような７つの文書を受け取ったと仮定する。
【００３４】
文書群受付部１０は、受け取った文書群を定義属性付与部１１に渡す。定義属性付与部１１はそれぞれの文書に対して定義属性の付与を行う。
【００３５】
まず形態素解析を行い単語に分割する。「〜とは−である」等や「−のことを〜と呼びます」等の言語パタンで抽出された「〜」に相当する単語または複合語、またはタイトルや項目名として使用されている単語や複合語を定義されているとみなし「定義属性」を与える。
（１）文書１からは、タイトルとなっている「プロトコル」、また「〜とは−を定義したもの」の「〜」の部分にあたる「プロトコル」を抽出する。
（２）文書２からは、タイトルとなっている「ネットワーク」、「〜は−からなる」の「〜」にあたる「ネットワーク」、また「−として〜がある」の「〜」の部分にあたる「ＯＳＩ参照モデル」を抽出する。
（３）文書３からは、「〜とは−の一つで」の「〜」の部分にあたる「インターネット」を抽出する。
（４）文書４からは、タイトルとなっている「パケット通信」を抽出する。
（５）文書５からは、タイトルとなっている「パケット」を抽出する。
（６）文書６からは、タイトルとなっている「ＴＣＰ／ＩＰ」を抽出する。
（７）文書７からは、タイトルとなっている「メディア論」、「〜とは−である。」の「〜」の部分にあたる「メディア」を抽出する。
【００３６】
定義属性付与部１１は抽出した定義属性を持つ単語と、与えられた元の文書とを、参照属性付与部１２に渡す。参照属性付与部１２に渡される情報は図３に示すようなものである。なお、図３で「＊ｎ」は個数（ｎ）を表わす。
【００３７】
参照属性付与部１２は受け取った定義属性を持つ単語と同じ文字ストリングで定義属性を持たない単語が元の７つの文書にあるか否かを調べる。あった場合はその単語に参照属性を付与する。参照属性付与部１２の判別結果はつぎのようなものである
（１）文書１からはなし。
（２）文書２からは「プロトコル」が抽出される。
（３）文書３からはタイトルに含まれる「インターネット」、「ネットワーク」、「ＴＣＰ／ＩＰ」、「パケット通信」、「メディア」を抽出する。
（４）文書４からは「パケット」を抽出する。
（５）文書５からは「ＴＣＰ／ＩＰ」を抽出する。
（６）文書６からは「インターネット」、「プロトコル」、「ＯＳＩ参照モデル」、「パケット」が抽出される。
（７）文書７からはなし。
【００３８】
参照属性付与部１２は、抽出された上記の単語に参照属性を付与し、依存リンク生成部１３へ渡す。単語の参照属性は図４に示すようなものである。
【００３９】
依存リンク生成部１３は、受け取った定義属性と参照属性とからリンクを生成する。リンクの生成方法は参照属性を持つ単語から、定義属性をもつ同じ文字ストリングの単語へリンクを貼る。リンク先に自分自身が含まれるファイルとなる場合は無視する。もし、複数文書に定義属性をもつ同じ単語がある場合、それぞれにリンクを貼る。この例では図５に示すようにリンクが張られる。図５ではリンクを「−＞」で表わす。
【００４０】
リンクを生成したら次に、文書間の依存度を計算する。ここでは、ｒｅｆ＿ｗとｄｅｆ＿ｗはそれぞれ１とする。値が０となる場合は無視する。
（１）文書２から文書１へ依存度（Σｗ（Ｋａｂ［ｎ］））＝（１＊１＋２＊１）＝３
（２）文書３から文書２へ依存度＝（１＊１＋１＊１）＝２
（３）文書３から文書４へ依存度＝（１＊１＋１＊１）＝２
（４）文書３から文書６へ依存度＝（１＊１＋１＊１）＝２
（５）文書３から文書７へ依存度＝（１＊１＋１＊１）＝２
（６）文書４から文書５へ依存度＝（１＊１＋１＊１）＝２
（７）文書５から文書６へ依存度＝（２＊１＋１＊１）＝３
（８）文書６から文書１へ依存度＝（２＊１＋１＊１）＝３
（９）文書６から文書２へ依存度＝（１＊１＋１＊１）＝２
（１０）文書６から文書３へ依存度＝（１＊１＋１＊１）＝２
（１１）文書６から文書５へ依存度＝（３＊１＋１＊１）＝４
【００４１】
各文書間の依存度を依存リンク保存部１４へ渡す。依存リンク保存部１４は各文書間の依存度を依存リンクデータベース１５へ保存する。依存リンクデータベース１５に渡される依存度の情報は図６に示すようなものである。
【００４２】
以上のようにして文書登録装置１００により文書間の依存度が抽出・記憶される。
【００４３】
［依存関係提示装置］
つぎに図１の文書登録装置１００により抽出・記憶された依存度を用いて文書間の依存関係を提示する依存関係提示装置２００について説明する。この依存関係提示装置２００もスタンドアローンのパーソナルコンピュータやサーバにより構成される。文書登録装置１００と依存関係提示装置２００が１つの装置・システムを構成していてもよい。
【００４４】
図７は、この実施例の依存関係提示装置２００の構成を示しており、この図において、依存関係提示装置２００は、文書名受付部２０、依存リンク検索部２１、文書関係提示部２２、依存リンクデータベース１５等を含んで構成される。依存リンクデータベース１５は図１の依存リンクデータベースである。
【００４５】
ユーザは自分が読まなければならない（学習しなければならない）文書の文書名を文書名受付部２０に指示する。ここでは文書６を指示したとする。文書名受付部２０は依存リンク検索部２１に文書名を渡す。依存リンク検索部２１は文書６が依存している文書を依存リンクデータベース１５から検索し、検索結果を文書関係提示部２２へ渡す。文書関係提示部２２は、その結果を依存度の高い順にソートしてユーザへ提示する。例えば図８に示すように提示する。または、図９に示すように、文書６にリンクされる他の文書との関係を依存度によってリンクの線を太くするなどしてビジュアルに見せてもよい。
【００４６】
上述の文書登録装置１００および依存関係提示装置２００は、例えば、図１０に示すようにネットワーク３００上に配置されたサーバ装置４００で構成することができる。サーバ装置４００は、ウェブサーバ、アプリケーションサーバ等で構成することができる。クライアント装置５００からの要求により文書間の依存度を依存リンクデータベース１５に登録し、また提示要求に応じて依存関係を表示できる。また、学習支援やマニュアル文書等の閲覧用のアプリケーションプログラムのプロセスが依存リンクデータベース１５の情報を利用するようにしてもよい。サーバ装置３００のインストールには文書登録装置１００や依存関係提示装置２００に対応したプログラムを記録した記録媒体４０１を用いる。このプログラムはネットワークを介して外部から送られたものでもよい。
【００４７】
以上のように、この実施例によれば、ユーザがある文書を読む場合にその文書を読むための前提となる文書を提示することによりユーザの文書の読解支援を行うことができる。
【００４８】
なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、リンクを制限するために、文書群にグループ名を付与しそのグループのみにリンクを限定したり、個人設定を設け、個人が読んだ文書は、既に理解しているものとして依存度を下げることが考えられる。それはユーザによって手動で設定する場合やその文書を開いたか否かで自動で設定できる。
【００４９】
【発明の効果】
以上説明したように、この発明によれば、文書間の依存度を抽出して文書理解等の支援を簡易に行うことができる。
【図面の簡単な説明】
【図１】この発明の実施例の文書登録装置１００の構成例を示すブロック図である。
【図２】図１の文書登録装置１００の動作を説明する図である。
【図３】図１の文書登録装置１００の動作を説明する図である。
【図４】図１の文書登録装置１００の動作を説明する図である。
【図５】図１の文書登録装置１００の動作を説明する図である。
【図６】図１の文書登録装置１００の動作を説明する図である。
【図７】上述実施例の依存関係提示装置２００の構成例を示すブロック図である。
【図８】図７の依存関係提示装置２００の動作を説明する図である。
【図９】図７の依存関係提示装置２００の動作を説明する図である。
【図１０】上述文書登録装置１００および依存関係提示装置２００のサーバ装置により実装例を説明する図である。
【符号の説明】
１０　　　文書群受付部
１１　　　定義属性付与部
１２　　　参照属性付与部
１３　　　依存リンク生成部
１４　　　依存リンク保存部
１５　　　依存リンクデータベース
２０　　　文書名受付部
２１　　　依存リンク検索部
２２　　　文書関係提示部
１００　　　文書登録装置
２００　　　依存関係提示装置
３００　　　ネットワーク
４００　　　サーバ装置
４０１　　　記録媒体
５００　　　クライアント装置[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a technique for extracting a dependency between documents.
[0002]
[Background technology]
When words defined in one document are used in another document, it can be considered that there is a dependency between the documents. It is desired to extract dependencies between a plurality of documents to support document understanding.
[0003]
That is, when a human reads a certain document, there may be a document serving as a premise for reading the document. For example, in order to read a certain part of a manual of a certain device, it is necessary to know terms such as a part name related to the device in advance.
[0004]
Also, when reading certain materials of a project, there is no problem if you are familiar with the project, but if you are not very familiar, if the terms etc defined in the project are used, you will not be able to read other materials. You need to look over it.
[0005]
Also, in classes at universities and the like, there may be cases where there are credits that should be learned in advance before the credits that the student wants to acquire. In other words, it may be better to read a teaching material document of another class before reading a teaching material document used in one class.
[0006]
In such a case, examine the dependencies between all the materials used in the project and all the teaching materials used in the university, and if you can recommend the documents to be read before reading a certain document, read the document if you can recommend it. I can help.
[0007]
The present invention has been made in response to the above needs.
[0008]
Prior art documents related to the present invention include the following.
(1) Japanese Unexamined Patent Publication No. 7-325827: "Automatic hypertext generator"
(2) JP-A-5-225247: "Method of displaying structure between documents"
(3) Japanese Patent Application Laid-Open No. 2000-259657: "Term Definition Search / Collection Device"
[0009]
In (1), the words included in the document are expanded into synonyms using matching or a thesaurus, and a link (hyperlink) is generated based on the matching between the character strings of the words, or the link source word is generated. Discloses that a link is generated to the title of a section in which the word frequently appears. Since it is a link between words, the dependency between documents cannot be extracted.
[0010]
In (2), the degree of relevance between documents is determined based on the number of common words included in two documents, and the graph is colored and the thickness of the link is set based on the value to determine the relation between the documents. It shows the strength of. If the same word is included, it is likely that the two documents are related, but it is difficult to judge whether to read or not because the relation is not known.
[0011]
In (3), a term and its definition part are extracted from a given document, and registered in a database so that a search can be performed. However, the background where the term came out, examples of the term, etc. are often described in the material in which the term is actually defined, and are defined rather than giving the user only the definition. Sometimes it is better to give a document or its page.
[0012]
[Problems to be solved by the invention]
The present invention has been made in view of the above circumstances, and has as its object to provide a technique for extracting a dependency between documents.
[0013]
[Means for Solving the Problems]
According to the present invention, in order to achieve the above-mentioned object, for example, for each word constituting each document included in a given document group, an attribute of whether the word is defined or referenced (used) is set. Given as "definition attribute" and "reference attribute", a link is established from a word having the reference attribute to a word of the same character string having the definition attribute, and the degree of dependence is calculated based on the link.
[0014]
That is, according to one aspect of the present invention, in order to achieve the above-described object, a document dependency extracting apparatus includes: a document storage for storing a plurality of document units each of which can be treated as a single document and distinguished from other documents; Means; first discriminating means for discriminating that a word appears with a definition (being a definition attribute) in each document unit stored in the document storage means; and stored in the document storage means. A second discriminating unit for discriminating that a word determined to have appeared with a definition by the first discriminating unit appears without a definition (that is, a reference attribute) in each document unit; Based on the correspondence between the word that appeared with the word and the word corresponding to the word and without the definition, the document unit in which the word that appeared with the above definition was determined, and the Appearing words are to be provided with a dependency determination means for determining dependencies between document units is determined without definition.
[0015]
In this configuration, it is possible to determine whether there is a probability that there is a definition-reference relationship between words of the same character string appearing in different document positions, and to easily determine the dependency between document units based on this. it can.
[0016]
The dependency may be measured using a thesaurus or the like in consideration of the relationship between the “definition attribute” and the “reference attribute” with a synonym. "Word" broadly refers to a component of a sentence, and includes a compound word.
[0017]
The present invention can be realized not only as a device or a system but also as a method. In addition, it goes without saying that a part of such an invention can be configured as software. Also, it goes without saying that a software product used for causing a computer to execute such software is also included in the technical scope of the present invention.
[0018]
The above aspects of the present invention and other aspects of the present invention are set forth in the following claims, and will be described in detail below with reference to embodiments.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
[0020]
[Document dependency extraction method]
First, the principle of the document dependency extraction method used in the embodiment will be described.
[0021]
In this embodiment, for each word constituting each document included in a given document group, attributes that are defined or referenced (used) are given as “definition attribute” and “reference attribute”. A link is formed from a word having a reference attribute to a word having the same character string having a definition attribute, and the dependency is calculated based on the link.
[0022]
As a method of assigning the definition attribute, words defined by using a language pattern such as “is a to” are extracted, and “definition attributes” can be assigned to them.
[0023]
Further, a definition attribute may be given by using document layout information. For example, if you can extract the title of a document or chapter, and it is only words or compound words, it is likely that the document or section explains the word and it is defined as " Definition attribute ". If there is an item name such as "item name text" and there is a layout in which the explanation is continued, the definition attribute may be given assuming that the item name is explained.
[0024]
The reference attribute may be assigned to a word having the same character string as the word to which the above-described definition attribute has been assigned, but not having the definition attribute, to which the “reference attribute” has been assigned. If a word having a reference attribute has the same character string having a definition attribute, a link is provided from the word having the reference attribute to a word having the same character string having the definition attribute.
[0025]
When the above-mentioned link from a word of a certain document to a word of another document is attached, the degree of dependence between documents is represented based on the number of links and the like. If there is a dependency between certain documents, a dependent link having the dependency as an attribute value is placed between the documents. The dependency between the documents is visually shown to support the understanding of the documents.
[0026]
Dependency links are given a degree of dependency, so that when reading a certain document, only documents that depend on that document and have a certain degree of dependency or higher can be collected or ranked in descending order of dependency and presented to the user. .
[0027]
Dependency Dep (Da, Db) from document Da to document Db is calculated by the following equation.
(Equation 1)
Dep (Da, Db) = Σw (Kab [n]) (a ≠ b)
[0028]
Here, Kab is a set of words having the same attribute as the word having the definition attribute in the document Db and having the reference attribute in the document Da.
Kab [n]: the word indicated by n in the word set Kab (the number of words included in 1 ≦ n ≦ Kab).
w (Kab [n]): weight of word Kab [n].
[0029]
w (Kab [n]) is calculated from the number of times that the word Kab [n] appears with the reference attribute in the document Da and the number of times that the word appears with the definition attribute in the document Db.
(Equation 2)
w (Kab [n]) = ref_tf (Da, Kab [n]) * ref_w + def_tf (Db, Kab [n]) * def_w
[0030]
ref_tf (Da, Kab [n]) is the number of appearances of Kab [n] having the reference attribute in the document Da.
def_tf (Db, Kab [n]) is the number of appearances of Kab [n] having the definition attribute in the document Db.
ref_w and def_w can be changed by weight. For example, it may be calculated in advance from a document group given idf (inverted document frequency) in order to take into account the generality of words, a dictionary, a news article, or the like. By adding weights as described below, it is possible to reduce the weight of words considered to be general terms.
[Equation 3]
ref_w (Kab [n]) = [idf obtained from one year of newspaper article of word Kab [n]]
[0031]
[Document Registration Device]
Next, the document registration device 100 used in this embodiment will be described. The document registration apparatus 100 receives a plurality of documents and generates and registers a degree of dependency between the documents. The document registration device 100 can be realized by, for example, a stand-alone personal computer, or can be realized by a server arranged on a network.
[0032]
FIG. 1 shows the configuration of a document registration device 100. In this figure, the document registration device 100 includes a document group reception unit 10, a definition attribute assignment unit 11, a reference attribute assignment unit 12, a dependency link generation unit 13, It includes a link storage unit 14 and a dependent link database 15.
[0033]
The document group receiving unit 10 receives a document group specified by the user. The designation method is, for example, all the documents stored in a certain directory, but is not limited to this. It is assumed that the document group receiving unit 10 has received seven documents as shown in FIG.
[0034]
The document group receiving unit 10 passes the received document group to the definition attribute providing unit 11. The definition attribute assigning unit 11 assigns a definition attribute to each document.
[0035]
First, morphological analysis is performed to divide the words. Words or compound words equivalent to "~" extracted by language patterns such as "is a-" or "-is called-", or words used as titles or item names And compound words are defined and given a "definition attribute".
(1) From the document 1, the “protocol” which is the title and the “protocol” corresponding to the “to” part of the “to which − is defined” are extracted.
(2) From the document 2, "Network" corresponding to the title "Network", "Network" corresponding to "-" of "consisting of-", and "OSI" corresponding to "-" of "there is-". Reference model "is extracted.
(3) From the document 3, "Internet" which is a part of "to" of "is one of-" is extracted.
(4) From the document 4, “Packet communication” as a title is extracted.
(5) From the document 5, “Packet” as a title is extracted.
(6) From the document 6, “TCP / IP” as a title is extracted.
(7) From the document 7, “media” which is the title of “media” and “to is-” is extracted.
[0036]
The definition attribute assigning unit 11 passes the word having the extracted definition attribute and the given original document to the reference attribute assigning unit 12. The information passed to the reference attribute assignment unit 12 is as shown in FIG. In FIG. 3, “* n” represents the number (n).
[0037]
The reference attribute assigning unit 12 checks whether or not a word having no definition attribute in the original seven documents is the same character string as the received word having the definition attribute. If so, a reference attribute is assigned to the word. The determination result of the reference attribute assigning unit 12 is as follows.
(2) “Protocol” is extracted from document 2.
(3) “Internet”, “network”, “TCP / IP”, “packet communication”, and “media” included in the title are extracted from document 3.
(4) “Packet” is extracted from document 4.
(5) “TCP / IP” is extracted from the document 5.
(6) From the document 6, "Internet", "protocol", "OSI reference model", and "packet" are extracted.
(7) None from document 7.
[0038]
The reference attribute assigning unit 12 assigns a reference attribute to the extracted word, and passes the word to the dependent link generating unit 13. The reference attributes of the words are as shown in FIG.
[0039]
The dependent link generation unit 13 generates a link from the received definition attribute and reference attribute. The link generation method is to link a word having a reference attribute to a word having the same character string having a definition attribute. Ignore if the link destination is a file that contains itself. If the same word having the definition attribute exists in a plurality of documents, a link is attached to each word. In this example, links are provided as shown in FIG. In FIG. 5, the link is represented by "->".
[0040]
After creating the links, the degree of dependency between the documents is calculated. Here, ref_w and def_w are each set to 1. If the value is 0, ignore it.
(1) Dependency from document 2 to document 1 (@w (Kab [n])) = (1 * 1 + 2 * 1) = 3
(2) Dependency from document 3 to document 2 = (1 * 1 + 1 * 1) = 2
(3) Dependency from document 3 to document 4 = (1 * 1 + 1 * 1) = 2
(4) Dependency from document 3 to document 6 = (1 * 1 + 1 * 1) = 2
(5) Dependency from document 3 to document 7 = (1 * 1 + 1 * 1) = 2
(6) Dependency from document 4 to document 5 = (1 * 1 + 1 * 1) = 2
(7) Dependency from document 5 to document 6 = (2 * 1 + 1 * 1) = 3
(8) Dependency from document 6 to document 1 = (2 * 1 + 1 * 1) = 3
(9) Dependency from document 6 to document 2 = (1 * 1 + 1 * 1) = 2
(10) Dependency from document 6 to document 3 = (1 * 1 + 1 * 1) = 2
(11) Dependency from document 6 to document 5 = (3 * 1 + 1 * 1) = 4
[0041]
The dependency between the documents is passed to the dependency link storage unit 14. The dependency link storage unit 14 stores the dependency between documents in the dependency link database 15. The information on the degree of dependence passed to the dependence link database 15 is as shown in FIG.
[0042]
As described above, the dependency between documents is extracted and stored by the document registration device 100.
[0043]
[Dependency presentation device]
Next, a dependency presenting apparatus 200 that presents a dependency between documents using the degree of dependency extracted and stored by the document registration apparatus 100 of FIG. 1 will be described. The dependency relationship presentation device 200 is also configured by a stand-alone personal computer or server. The document registration device 100 and the dependency relationship presentation device 200 may constitute one device / system.
[0044]
FIG. 7 shows the configuration of the dependency presenting apparatus 200 of this embodiment. In this figure, the dependency presenting apparatus 200 includes a document name receiving unit 20, a dependent link searching unit 21, a document relationship presenting unit 22, It is configured to include a link database 15 and the like. The dependency link database 15 is the dependency link database of FIG.
[0045]
The user instructs the document name accepting unit 20 of the document name of the document that the user must read (learn). Here, it is assumed that the document 6 is designated. The document name receiving unit 20 passes the document name to the dependent link search unit 21. The dependent link search unit 21 searches the dependent link database 15 for a document on which the document 6 depends, and passes the search result to the document relation presenting unit 22. The document relation presentation unit 22 sorts the results in descending order of the degree of dependence and presents them to the user. For example, it is presented as shown in FIG. Alternatively, as shown in FIG. 9, the relationship between the document 6 and another document linked to the document 6 may be visually shown by thickening the link line depending on the degree of dependency.
[0046]
The above-described document registration device 100 and the dependency relationship presentation device 200 can be configured by, for example, a server device 400 arranged on a network 300 as shown in FIG. The server device 400 can be configured by a web server, an application server, and the like. The dependency between documents can be registered in the dependency link database 15 in response to a request from the client device 500, and the dependency can be displayed in response to a presentation request. Further, a process of an application program for learning support or browsing a manual document or the like may use information of the dependent link database 15. For installation of the server device 300, a recording medium 401 storing a program corresponding to the document registration device 100 or the dependency relationship presentation device 200 is used. This program may be sent from outside via a network.
[0047]
As described above, according to this embodiment, when a user reads a certain document, the user can assist reading of the document by presenting the document that is a prerequisite for reading the document.
[0048]
It should be noted that the present invention is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present invention. For example, in order to restrict links, assign a group name to a group of documents and restrict links only to that group, or set personal settings, and reduce the dependence on documents read by individuals as they already understand It is possible. It can be set manually by the user or automatically depending on whether the document is opened or not.
[0049]
【The invention's effect】
As described above, according to the present invention, it is possible to easily support the understanding of a document by extracting the dependency between documents.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration example of a document registration device 100 according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating the operation of the document registration device 100 of FIG.
FIG. 3 is a diagram illustrating the operation of the document registration device 100 of FIG.
FIG. 4 is a diagram illustrating the operation of the document registration device 100 of FIG.
FIG. 5 is a diagram illustrating the operation of the document registration device 100 of FIG.
FIG. 6 is a diagram illustrating an operation of the document registration device 100 of FIG.
FIG. 7 is a block diagram illustrating a configuration example of a dependency relationship presentation device 200 according to the above embodiment.
8 is a diagram illustrating the operation of the dependency relationship presentation device 200 of FIG.
FIG. 9 is a diagram illustrating the operation of the dependency relationship presentation device 200 of FIG.
FIG. 10 is a diagram illustrating an example of mounting the document registration device 100 and the dependency relationship presentation device 200 using a server device.
[Explanation of symbols]
10 Document Group Receiving Unit 11 Definition Attribute Giving Unit 12 Reference Attribute Giving Unit 13 Dependent Link Generating Unit 14 Dependent Link Storage Unit 15 Dependent Link Database 20 Document Name Receiving Unit 21 Dependent Link Searching Unit 22 Document Relationship Presentation Unit 100 Dependent on Document Registration Device 200 Relationship presentation device 300 Network 400 Server device 401 Recording medium 500 Client device

Claims

A document storage unit for storing a plurality of document units each of which can be treated as one lump document separately from other documents;
A first determination unit that determines that a word appears with a definition in each document unit stored in the document storage unit;
A second discriminator for discriminating that a word determined to have appeared with a definition by the first discriminator has appeared without a definition in each document unit stored in the document storage;
Based on the correspondence between the word that appeared with the definition and the word corresponding to the word and appeared without the definition, a document unit in which the word that appeared with the definition was determined, A dependency determining unit for determining a dependency with respect to a document unit in which a corresponding word that has appeared without a definition is determined.

2. The document dependency extracting apparatus according to claim 1, wherein the first determining means determines that the word appears with a definition based on a sentence pattern.

3. The document dependency extracting apparatus according to claim 2, wherein the pattern of the sentence is a pattern of "is a ...".

4. The document dependency extracting apparatus according to claim 1, wherein the first determination unit determines that a word appears with a definition based on a predetermined layout in the document unit.

5. The document dependency extracting apparatus according to claim 4, wherein the predetermined layout is a layout in which a title of a document or a chapter is a word or a compound word.

5. The document dependency extracting apparatus according to claim 4, wherein the predetermined layout is a layout having an item name and a description following the item name, such as "item name @ text".

Dependency storing means for storing a dependency between document units extracted by the document dependency extracting apparatus according to any one of claims 1 to 6,
Means for specifying a document unit that the user is using for access;
Means for extracting a dependency related to the specified document unit by referring to the dependency storage means based on the specified document unit;
Means for displaying the taken-out dependency degree.

A document storing step of storing a plurality of document units each of which can be treated as one lump document separately from other documents;
A first determination step of determining that a word has appeared with a definition in each document unit stored in the document storage step;
A second determination step of determining, in each document unit stored in the document storage step, that a word determined to have appeared with a definition in the first determination step has appeared without a definition;
On the basis of the correspondence between the word that appeared with the definition and the word that appeared without the definition corresponding to the word, the document unit in which the word that appeared with the above definition was determined, A dependency determining step of determining a dependency on a document unit in which a word that has appeared without a definition is determined.

A document storing step of storing a plurality of document units each of which can be treated as one lump document separately from other documents;
A first determination step of determining that a word has appeared with a definition in each document unit stored in the document storage step;
A second determination step of determining, in each document unit stored in the document storage step, that a word determined to have appeared with a definition in the first determination step has appeared without a definition;
On the basis of the correspondence between the word that appeared with the definition and the word that appeared without the definition corresponding to the word, the document unit in which the word that appeared with the above definition was determined, A dependency determining step of determining a dependency on a document unit in which a word appearing without a corresponding definition has been determined.