JP3894428B2

JP3894428B2 - Information extraction method, information retrieval method, and information extraction computer program

Info

Publication number: JP3894428B2
Application number: JP2002042820A
Authority: JP
Inventors: 毅也藤井
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 2002-02-20
Filing date: 2002-02-20
Publication date: 2007-03-22
Anticipated expiration: 2022-02-20
Also published as: JP2003242166A

Description

【０００１】
【発明の属する技術分野】
本発明は情報抽出方法、情報検索方法及び情報抽出コンピュータプログラムに係り、特に情報をノードとノード間のリンクとで構成したハイパードキュメントシステムから情報を抽出する情報抽出方法、その装置を使用した情報検索方法及び情報抽出コンピュータプログラムに関する。
【０００２】
【従来の技術】
ハイパードキュメントシステムでは、取り扱われるテキスト情報を幾つかの小さな単位（ノード）に分割し、それらを関連付けて整理している。このような関連付けを示す情報を、リンクと呼ぶ。
例えば、インターネット上では、ＷＷＷ(World Wide Web)により、ハイパードキュメントシステムが構築されている。
ＷＷＷの情報は、ＨＴＭＬ(Hyper Text Markup Language)で記述されている。このＨＴＭＬは、ノード間のリンクに意味的制約がない。
このノード間のリンクに意味的制約を持たないシステムには、ドキュメントオーサ（作者）が意のままにコンテンツとリンク構造を決めることが出来るという利点がある。
【０００３】
そして、このようなハイパードキュメントシステムにより、ドキュメントリーダ（読者）は、ドキュメントオーサの構築したリンク構造をたどり、そのドキュメントオーサが提供する全ての情報に対してアクセス出来る。
【０００４】
しかしドキュメントリーダが、自身が欲した情報が含まれているノードを発見したとしても、リンク間には意味的な制約がないため、リンク先のノードに関連情報が含まれているかを確認するためには、実際にリンクを辿ってチェックしなければならない。
ところが、ハイパードキュメントシステムの情報量は膨大である。そのため、ドキュメントリーダが必要な情報を見付け出すには、情報検索を支援するシステムが必要である。
【０００５】
そのような従来の情報検索を支援するシステム（特開平１１−３３４７号公報の情報抽出装置）について、図と共に以下に説明する。
図１０は従来の情報抽出装置の一例のブロック構成を示したものである。ここで、ドキュメントオーサが、ある情報を３つのノード７０１〜７０３に分割して、ドキュメントを作成した場合を考える。
この例では、ノード７０１から２つのノード７０２，７０３に対してリンクが張られている。そして、情報抽出装置７１０に対して、ノード７０１が起点ノードとして入力される。
【０００６】
起点ノード特徴抽出手段７１１は、起点ノード７０１が入力されると、この起点ノード７０１の内容を解析し、起点ノード７０１の特徴を抽出する。
抽出した特徴は起点ノード特徴プロファイル７１１ａとして類似性判定手段７１４に供給される。ここで、ノードの特徴に関する情報とは、そのノードの内容を特徴付ける単語とその重要度を示す値の対の集合とする。
例えば、起点ノードに出現する各単語に関する出現頻度、出現位置及び品詞の情報に基づいて重み付けすることにより、起点ノード特徴プロファイル７１１ａを作成する。
【０００７】
２次ノード取得手段７１２は、入力された起点ノード７０１からリンクが張られたノード７０２，７０３を２次ノードとして取得する。取得した各２次ノードは、２次ノード特徴抽出手段７１３に供給される。
【０００８】
この２次ノード特徴抽出手段７１３は、２次ノード取得手段７１２が取得した２次ノードの内容を解析し、各２次ノードの特徴を抽出する。その抽出した特徴は、２次ノード特徴プロファイル７１３ａとして類似性判定手段７１４に供給される。
２次ノード特徴抽出手段７１３も起点ノード特徴抽出手段７１１と同様に、２次ノードに出現する各単語に関する出現頻度、出現位置及び品詞の情報に基づいて重み付けを行い、２次ノード特徴プロファイル７１３ａを作成する。
【００１０】
これにより、起点ノードに対して、その起点ノードに類似する２次ノードを合成した合成ノード７０４が得られる。
なお、この情報抽出装置７１０に対する起点ノードの入力は、例えば、従来のハイパードキュメントの情報検索において予めノードをランダムにスキャンして得られたノードを、起点ノードとして入力する。
この場合、情報抽出装置７１０から出力される合成ノード７０４が検索対象となる。
【００１１】
このような構成の従来の情報抽出装置７１０によって行われる処理を、以下に説明する。まず特徴抽出の処理手順について説明する。
【００１２】
図１１は、特徴抽出の処理の流れを示すフローチャートである。
このフローチャートでは、起点ノードを受け取った起点ノード特徴抽出手段７１１の処理として説明するが、２次ノードを受け取った２次ノード特徴抽出手段７１３が行う処理も同様の処理であり、以下、各処理をステップ番号（ステップＳ８０１〜Ｓ８０６）に従って説明する。
【００１３】
〔ノード入力ステップＳ８０１〕
起点ノード７０１が与えられ、その情報ソースが起点ノード特徴抽出手段７１１に入力される。
〔タグの除去ステップＳ８０２〕
情報ソースから、ハイパードキュメントシステム記述言語（例えば、ＨＴＭＬ）で定義されたタグを除去する。
〔単語抽出ステップＳ８０３〕
既知の形態素解析技術を用いて、残されたテキストから単語を抽出する。
【００１４】
〔重要単語の抽出ステップＳ８０４〕
ステップＳ８０３で得られた単語の集合から重要単語だけを抽出する。
ここで、重要単語とは情報ソースの内容を特徴付けている単語であり、名詞だけを重要単語とするといった方法により重要単語の抽出を行うことが出来る。
【００１５】
〔重要単語の重み付けステップＳ８０５〕
ステップＳ８０４で得られた重要単語に対して、出現頻度や出現位置を考慮して、重み付けを行う。
すなわち、出現頻度の高い単語ほど重要度を高くし、また、出現位置が文書の先頭に近い程その単語の重要度を高くする。
〔特徴プロファイル作成ステップＳ８０６〕
最後に、重要単語とその重みとの組からなるリストを作成し、これを起点ノード特徴プロファイル７１１ａとする。
【００１６】
このようにして得られた、起点ノード７０１の特徴プロファイル７１１ａ（単数）は、類似性判定手段７１４に供給される。
また、起点ノード７０１は、２次ノード取得手段７１２にも供給されており、２次ノード取得手段７１２は、受け取ったノード７０１の情報ソースに含まれるリンク情報を検索し、そのリンク先のノードを２次ノード７０２として取得する。
例えば、起点ノードがＨＴＭＬで作成されていれば、アンカータグ（＜Ａ＞．．．＜／Ａ＞）で囲まれた領域内のＵＲＬ(Uniform Resource Locator)を抽出し、そのＵＲＬで指定された文書（２次ノード）を取得する。
【００１７】
２次ノード取得手段７１２が取得した２次ノードの集合は、２次ノード特徴抽出手段７１３に供給される。
そして、２次ノード特徴抽出手段７１３によって、図１１に示したフローチャートと同様の処理が実行され、各２次ノード７０２，７０３に対する２次ノード特徴プロファイル７１３ａ（複数）が作成される。
【００１８】
その２次ノードの特徴プロファイル７１３ａは、類似性判定手段７１４に供給される。
この類似性判定手段７１４には、上記した起点ノード特徴プロファイル７１１ａと複数の２次ノード特徴プロファイル７１３ａとが供給される。
【００１９】
つぎに、図１０に示した類似性判定手段７１４によって実行される類似性処理の流れの具体例を説明する。
図１２は類似性判定手段７１４の処理手順を示したフローチャートである。以下の類似性処理を順次説明する。
【００２０】
〔初期化ステップＳ９１１〕
変数ｎに対して、ｎ＝１という初期化を行う。また、２次ノード取得手段７１２が取得した２次ノードの数を変数ｍに代入する。ここで、２次ノード特徴プロファイル７１３ａは、１〜ｍの順番に並べられる。
【００２１】
〔起点ノードの特徴プロファイル取得ステップＳ９１２〕
起点ノードの特徴プロファイル７１１ａを取得する。
〔ｎはｍより大きいかステップＳ９１３〕
ｎとｍの大小を比較して、ｎ＞ｍであれば（Ｓ９１３ＹＥＳ）、処理を終了し、ｎ＞ｍでなければ（Ｓ９１３ＮＯ）、ステップＳ９１４に進む。
【００２２】
〔ｎ番目の２次ノードの特徴プロファイル取得ステップＳ９１４〕
ｎ番目の２次ノードの特徴プロファイル７１３ａを取得する。
〔特徴プロファイルの類似度計算ステップＳ９１５〕
ステップＳ９１２で得られた起点ノード特徴プロファイル７１１ａとステップＳ９１４で得られた２次ノード特徴プロファイル７１３ａとの類似度を計算する（類似性判定手段７１４）。この類似度の計算には、既知のベクトル内積演算手法を用いる。
【００２３】
〔類似度は閾値より大きいかステップＳ９１６〕
ステップＳ９１５で得られた類似度の値と閾値を比較して、
類似度＞閾値であれば（Ｓ９１６ＹＥＳ）、ステップＳ９１７に進み、
類似度＞閾値でなければ（Ｓ９１６ＮＯ）、ステップＳ９１８に進む。
ここで、閾値は予め設定された値であり、その大小で類似性の許容範囲を調整する。
類似度をベクトル内積演算手法で計算した場合には、閾値としては、例えば０．１程度の値を設定する。
【００２４】
〔ｎ番目の２次ノードを合成候補とするステップＳ９１７〕
類似度が閾値より大きければ、ｎ番目の２次ノードを起点ノードへ合成するノードの候補として記憶する。
〔ｎの値に１を加算するステップＳ９１８〕
ｎの値に１を加算してｎ＋１とし、ステップＳ９１３に進む。
【００２５】
これにより、順番が１〜ｍの各２次ノード特徴プロファイル７１３ａについて、起点ノード特徴プロファイル７１１ａとの間の類似性の有無が判定される。
そして、ステップＳ９１７において、合成するノードの候補として記憶された２次ノードの集合と、起点ノードとがノード合成手段７１５に供給される。
最後に、ノード合成手段７１５が、起点ノードに、類似性有りと判定された全ての２次ノードを合成し、合成ノード７０４とする。
【００２６】
このようにして得られた合成ノード７０４を情報検索の対象とすれば、例えば「概念Ａ」に類似する情報を検索した場合に、ノード７０１単独では「概念Ａ」と非類似であっても、ノード７０１と２次ノードとを合成した合成ノードと「概念Ａ」とが類似していれば、検索結果としてノード７０１を得ることが出来る。ノード７０１を得たユーザは、そのノード７０１からリンクをたどり、目的の「概念Ａ」に類似する情報の全てにアクセス出来る。
【００２７】
なお、起点ノード特徴抽出手段７１１若しくは２次ノード特徴抽出手段７１３が重要単語の重み付けをする際に、ＨＴＭＬなどのタグにより強調されている文字を、重要度の高い単語とすることも出来る。それには、図１１の処理の順番を入れ替え、ステップＳ８０２の処理を、ステップＳ８０５とステップＳ８０６との間で行う必要がある。
【００２８】
さらに、ステップＳ８０３の単語抽出処理においては、文字を強調するためのタグと、そのタグの中に記載されている文字とは、分離せずに抽出する。
文字を強調するタグに囲まれた領域から複数の単語を抽出する際には、それぞれの単語に対して、タグの情報を付加しておく。これにより、重要単語の重み付けをする際に、どの単語が強調表示されていたのかを識別することが出来る。
【００２９】
以上に説明したように、この従来例は、ハイパーテキストの各ノードから、情報の類似度を比較判定出来る特徴プロファイルを生成し、ドキュメントリーダの検索要請に合わせて、情報間の類似性を判定して、ノードを合成して合成ノードを作成し、合成ノード７０４を出力することで、意味的な纏まりを有する情報の単位で抽出しようとしたものである。
【００３０】
【発明が解決しようとする課題】
従来においては、検索時に類似性有りと判定された全ての２次ノードを起点ノードに合成して合成ノードとして情報抽出を行う方法では、たとえ時系列に沿って変化するハイパーテキストを複数個保存するような改良が加えられたとしても、記憶容量及び情報抽出時の演算量の増加を招いて、読むのにも時間が掛かり、効率が悪い。
【００３１】
また、従来においては、ハイパーテキスト全体は静的な構成を仮定していた。しかし、インターネットなどに置かれたハイパーテキストは静的なドキュメントシステムではなく、各ノードは複数のドキュメントオーサによって自在に変更され、またノードを指定するリンクも自在に追加、削除されるものである。
特に、ハイパーテキストに容易に埋め込むことの出来るリンクは、他のノードを名指しての誹謗中傷や、逆に偏見や差別を記述したノードに対して指摘や訂正を求める際に、重要な役割を持つ。
【００３２】
これらに該当するリンクは、リンク元もしくはリンク先のドキュメントオーサの恣意的な変更によって、該当部分が削除もしくは訂正され、リンクの意味がなくなることが多い。
このような例では、静的なハイパーを前提とした情報抽出装置は効果がない。よって、上記の合成ノードによって結果を得る方法では、結果的に話題の時系列は、ドキュメントオーサのリンク部分の削除もしくは訂正により関連性がなくなり破壊されてしまう。
【００３３】
具体的な一例を挙げて説明する。
あるノードＢのドキュメント管理者Ｂが、あるノードＡのドキュメント管理者Ａを誹謗中傷する文章を悪意をもって作成し、それをハイパーテキストとして公開したとしても、その誹謗中傷文章に対する非難・及び訂正要望がドキュメント管理者Ａからドキュメント管理者Ｂへ届いた時点で、ドキュメント管理者Ｂは文章の一部を改竄して何事もなかったかのように振舞ったり、巧妙に改竄することでドキュメント管理者Ａを言われなき非難を行う者としておとしめることが可能である。
【００３４】
このような例では、従来のようなハイパーテキストを個々のノードに関して最新のものを１つだけ保存するタイプの情報抽出装置（検索エンジン）では、取得時刻や更新時刻は保存していても、時系列変化を伴うハイパーテキストは保存されていないため、以前に溯って話題の変遷を追うことは不可能である。
【００３５】
そこで本発明は上記の課題を解決するためになされたものであり、特に、ノード分析ステップ、最適テキスト片選択ステップ、データベースステップを有し、最適テキスト片選択ステップによりリンクに関連する２次ノードのテキストの最適な一部分のみを切り出してデータベースステップに抽出保存し、そのデータベースステップよりそれらを検索するようにして、特定のキーワードのハイパーテキストを継続的に監視するような場合も、取得時刻毎にこのリンクに関連するテキスト付きのテキスト全文を保存するものよりも、記憶領域を大幅に節約することが出来、よって、その時系列変化を伴うデータであってもその内容も短時間で理解出来るようにすることを目的とするものである。
【００３６】
【課題を解決するための手段】
上記課題を解決するために、
請求項１に記載された発明は、
テキストとこのテキスト中に記されているリンクとよりなるノードを複数有し、前記リンクにはそのリンク先の２次ノードを指し示す２次ノードパス名が記されて前記各ノードの繋がりを示すように構成されたハイパーテキストからコンピュータが情報を抽出する情報抽出方法において、
検索キーワードに対応するノードを指し示すノードパス名が入力されると、前記ノードのテキスト及びリンクを解析し、前記２次ノードパス名と前記テキスト中のリンクの近傍に記載されているテキスト部分を切り出したリンク近傍テキストとを組とする入力テーブルを出力するノード分析ステップ１１と、
前記２次ノードにアクセスし、前記２次ノードのテキストを取得する２次ノード取得ステップ１４と、
前記２次ノードのテキストを所定の大きさのテキスト片群に分割するテキスト分割ステップ１５と、
前記リンク近傍テキストに対する前記分割した各テキスト片それぞれの類似度を計算し、前記各テキスト片と前記類似度とを組とする類似度テーブルを出力する類似度計算ステップ１６と、
前記テキスト片群の中から最も類似している類似度のテキスト片を選択する最適テキスト片選択ステップ１７と、
前記入力テーブルを順次解析し、各々の前記２次ノードについて前記最適テキスト片選択ステップより選択される前記各２次ノードの最適テキスト片を、現在時刻、前記ノードパス名、前記２次ノードパス名、及び前記リンク近傍テキストと共に夫々抽出してデータベース化するデータベースステップ１３と
を有することを特徴とする情報抽出方法を提供し、
請求項２に記載された発明は、
請求項１に記載された情報抽出方法において、
前記類似度計算ステップ１６は、テキストを形態素群に分割し、前記形態素群をベクトルと見なし、各文章に対応するベクトルのコサイン・シータを類似度と見なすことで、前記類似度を０以上かつ１以下の範囲とし、
前記最適テキスト片選択ステップ１７は、前記類似度テーブルより、類似度が所定の範囲にあるテキスト片のうち最大の類似度を有するテキスト片を選択するようにしたことを特徴とする情報抽出方法を提供し、
請求項３に記載された発明は、
テキストとこのテキスト中に記されているリンクとよりなるノードを複数有し、前記リンクにはそのリンク先の２次ノードを指し示す２次ノードパス名が記されて前記各ノードの繋がりを示すように構成されたハイパーテキストからコンピュータが情報を検索する情報検索方法において、
検索キーワードに対応するノードを指し示すノードパス名が入力されると、前記ノードのテキスト及びリンクを解析し、前記２次ノードパス名と前記テキスト中のリンクの近傍に記載されているテキスト部分を切り出したリンク近傍テキストとを組とする入力テーブルを出力するノード分析手段１１と、
前記２次ノードにアクセスし、前記２次ノードのテキストを取得する２次ノード取得手段１４と、
前記２次ノードのテキストを所定の大きさのテキスト片群に分割するテキスト分割手段１５と、
前記リンク近傍テキストに対する前記分割した各テキスト片それぞれの類似度を計算し、前記各テキスト片と前記類似度とを組とする類似度テーブルを出力する類似度計算手段１６と、
前記テキスト片群の中から最も類似している類似度のテキスト片を選択する最適テキスト片選択手段１７と、
前記入力テーブルを順次解析し、各々の前記２次ノードについて前記最適テキスト片選択手段より選択される前記各２次ノードの最適テキスト片を、現在時刻、前記ノードパス名、前記２次ノードパス名、及び前記リンク近傍テキストと共に夫々抽出してデータベース化するデータベース手段１３と
を備えた情報抽出装置に対して、検索するノードパス名を入力し、前記検索するノードパス名と前記ノードパス名とが一致する前記データベース手段１３より選択された前記データエントリ群を出力するようにして前記コンピュータが情報検索を行うことを特徴とした情報検索方法を提供し、
請求項４に記載された発明は、
請求項３に記載された情報検索方法において、
前記類似度計算手段１６は、テキストを形態素群に分割し、前記形態素群をベクトルと見なし、各文章に対応するベクトルのコサイン・シータを類似度と見なすことで、前記類似度を０以上かつ１以下の範囲とし、
前記最適テキスト片選択手段１７は、前記類似度テーブルより、類似度が所定の範囲にあるテキスト片のうち最大の類似度を有するテキスト片を選択するようにしたことを特徴とする情報検索方法を提供し、
請求項５に記載された発明は、
テキストとこのテキスト中に記されているリンクとよりなるノードを複数有し、前記リンクにはそのリンク先の２次ノードを指し示す２次ノードパス名が記されて前記各ノードの繋がりを示すように構成されたハイパーテキストから情報を抽出する情報抽出コンピュータプログラムにおいて、
検索キーワードに対応するノードを指し示すノードパス名が入力されると、前記ノードのテキスト及びリンクを解析し、前記２次ノードパス名と前記テキスト中のリンクの近傍に記載されているテキスト部分を切り出したリンク近傍テキストとを組とする入力テーブルを出力するノード分析ステップ１１と、
前記２次ノードにアクセスし、前記２次ノードのテキストを取得する２次ノード取得ステップ１４と、
前記２次ノードのテキストを所定の大きさのテキスト片群に分割するテキスト分割ステップ１５と、
前記リンク近傍テキストに対する前記分割した各テキスト片それぞれの類似度を計算し、前記各テキスト片と前記類似度とを組とする類似度テーブルを出力する類似度計算ステップ１６と、
前記テキスト片群の中から最も類似している類似度のテキスト片を選択する最適テキスト片選択ステップ１７と、
前記入力テーブルを順次解析し、各々の前記２次ノードについて前記最適テキスト片選択ステップより選択される前記各２次ノードの最適テキスト片を、現在時刻、前記ノードパス名、前記２次ノードパス名、及び前記リンク近傍テキストと共に夫々抽出してデータベース化するデータベースステップ１３と
をコンピュータに実行させることを特徴とする情報抽出コンピュータプログラムを提供し、
請求項６に記載された発明は、
請求項５に記載された情報抽出コンピュータプログラムにおいて、
前記類似度計算ステップ１６は、テキストを形態素群に分割し、前記形態素群をベクトルと見なし、各文章に対応するベクトルのコサイン・シータを類似度と見なすことで、前記類似度を０以上かつ１以下の範囲とし、
前記最適テキスト片選択ステップ１７は、前記類似度テーブルより、類似度が所定の範囲にあるテキスト片のうち最大の類似度を有するテキスト片を選択するようにしたことを特徴とする情報抽出コンピュータプログラムを提供するものである。
【００３７】
【発明の実施の形態】
情報抽出装置の実施の形態につき、好ましい実施例により、以下に図と共に説明する。
本発明の情報抽出装置は、取り扱われるテキスト情報が幾つかの小さな単位（ノード）に分割し、それらを関連付けて整理し、このような関連付けを示す情報をリンクと呼んでいるハイパーテキストを有するハイパードキュメントシステムに適用されるものである。
【００３８】
本発明の情報抽出方法が適用される情報抽出装置（請求項１に記載の発明）の一実施例のブロック構成を図１に示す。
情報抽出装置の一実施例の各ブロック１１〜１７について、以下に説明する。起点ノード分析手段１１は、特定の検索キーワード（単語、語句等）の（起点）ノードパス名が入力されると、ノードパス名が指し示す（起点）ノードに含まれるハイパーテキストを取得する。
【００３９】
ここでいうハイパーテキストとは、ＨＴＭＬに類するマークアップ言語で記述されたテキストファイルであり、プレーンテキストとリンクを表すアンカータグが混在したテキストである。
アンカータグには、リンク先のノードを指し示すための２次ノードパス名が埋め込まれている。
【００４０】
この起点ノード分析手段１１は、ハイパーテキストに含まれるリンクを解析し、２次ノードパス名と、アンカータグ周辺のプレーンテキストを予め定められた切り出し法則に従って「リンク近傍テキスト」として取り出す。
例えば切り出し法則としては、日本語の場合は句点（。）が３回現れるまで取り出し、また文法に依存せずに切り出す場合は特定バイト数分カウントアップして取り出す等の法則が採用される。
【００４１】
図２に示したように２次ノードパス名ｎとそれに対応するリンク近傍テキストｎとは、一つの組として「入力テーブル」を生成する。
よってハイパーテキストにｎ個のアンカータグが含まれていた場合は、入力テーブルの総行数はｎ行となる。
【００４２】
データベース登録手段１２は、図３に示したようにｎ行の入力テーブルを入力として、後述するｍ行のデータエントリを生成する。
【００４３】
つぎに、このデータベース登録手段１２のフローチャートを図４に示す。
データベース登録手段１２の各ステップ（ステップＳ１０１〜Ｓ１１０）について説明する。
【００４４】
まず、データエントリ内の行操作を行うために変数ｍを、また入力テーブルの行操作を行うために変数ｉを、それぞれ初期化する(ステップＳ１０１)。
入力テーブルの全行を走査したら、終了するための分岐を持つ(ステップＳ１０２)。これはｎ回の繰り返し処理を含むことを意味する。
【００４５】
ｉ行目の入力テーブルを読み込み(ステップＳ１０３)、
２次ノードパス名を読み込み、これを２次ノード取得手段１４へ供給し起動を行う(ステップＳ１０４)。
【００４６】
つぎに、ｉ行目のリンク近傍テキストを類似度計算手段１６へ供給する(ステップＳ１０５)。
以上を終了すると、後述するサービスルーチンを経由して最適テキスト片ｔが生成されるので、最適テキスト片ｔの取得を行う(ステップＳ１０６)。
【００４７】
最適テキスト片ｔが空テキストかどうかの判定を行い(ステップＳ１０７)、空テキストの場合はデータエントリの書き込みをスキップする。
空テキストでない場合は、データエントリｍ行目を書き込み(ステップＳ１０８)、ｍを＋１する(ステップＳ１０９)。
【００４８】
いずれの場合も、行操作を行う為の変数ｉを＋１して(ステップＳ１１０)、
分岐 (ステップＳ１０２) に戻る。
以上がデータベース登録手段１２の制御の詳細である。
【００４９】
つぎに、データエントリ内のデータについて説明する。
データベース登録手段１２は、現在時刻１〜ｍにデータベース登録手段１２に入力テーブルが入力された現在時刻をセットする（図３参照）。
また、起点ノードパス名１〜ｍは、入力テーブルに対応する起点ノードパス名をそのままセットする（図３参照）。
【００５０】
また、２次ノードパス名１〜ｍには、入力データベースの各行に対応した２次ノードパス名をセットする（図３参照）。
また、リンク近傍テキスト１〜ｍも入力データベースの各行に対応したリンク近傍テキストをセットする（図３参照）。
【００５１】
また、最適テキスト片１〜ｍには、２次ノードパス名が指し示すハイパーテキストに対して最適テキスト片選択手段１７が処理を行い、その結果生成されたテキストをセットする（図３参照）。
【００５２】
データベース手段１３は、データベース登録手段１２から出力されたデータエントリを格納し、かつ容易に検索可能にするためのデータベースとしての基本機能を備える。
少なくとも、起点ノードパス名をキーとして、０個以上のデータエントリを出力可能である（図３参照）。
【００５３】
つぎに、データベース登録手段１２に関するサービスルーチン（２次ノード取得手段１４〜最適テキスト片選択手段１７）について、以下に説明する。
２次ノード取得手段１４は、データベース登録手段１２から出力された２次ノードパス名を入力とし、２次ノードパス名が指し示すノードからハイパーテキストを取得し出力する。
例えば、起点ノードがＨＴＭＬで作成されていれば、アンカータグ（＜Ａ＞．．．＜／Ａ＞）で囲まれた領域内のＵＲＬ(Uniform Resource Locator)を抽出し、そのＵＲＬで指定された文書（２次ノード）を取得し出力する。
【００５４】
テキスト分割手段１５は、ハイパーテキストからリンク等のマークアップテキストを取り除きプレーンテキストに変換した後に、全体をｑ個のテキスト片に分割する。
【００５５】
分割のアルゴリズムには、
マークアップの段落タグを基準に分割する方法、
予め指定されたバイト数毎に分割する方法、
プレーンテキストが日本語で記述されている場合は句点（。）３つカウント毎に分割する方法などが考えられる。
【００５６】
類似度計算手段１６は、データベース登録手段１２から１個のリンク近傍テキストと、テキスト分割手段１５から分割したｑ個のテキスト片を入力とし、テキスト片と類似度とを組とするｑ組の類似度テーブルを出力する（図５参照）。
類似度１〜ｑは、それぞれテキスト片１〜ｑをリンク近傍テキストとの組み合わせについて類似度を計算したものである。
【００５７】
最適テキスト片選択手段１７は、全部でｑ個の類似度を比較することにより、ｑ個のテキスト片の中からただ１つの最適テキスト片を選択して出力する。
最適テキスト片選択手段１７のフローチャートを図６に示す。
【００５８】
最適テキスト片選択手段１７の各ステップ(Ｓ２０１〜Ｓ２０７)について以下に順次説明する。
まず、類似度テーブルを走査するための行カウンタ変数ｉ（＝１）、
最適テキスト片を保持する変数ｔ（＝空テキスト）、
最大類似度ｒ（＝０）の初期化を行う(ステップＳ２０１)。
【００５９】
類似度テーブルの総行数ｑと行カウンタ変数ｉを比較し、ｉが大きければテーブルの終端まで走査したと見なして終了する(ステップＳ２０２)。
終了時には最適テキスト片ｔをデータベース登録手段１２へ書き込む(ステップＳ２０３)。
【００６０】
終了せずに、テーブルの走査が継続している場合には、類似度テーブルからｉ行目を読み込み、行の内容をそれぞれ現在テキスト片t(i)と現在類似度r(i)にセットする(ステップＳ２０４)。
【００６１】
現在類似度r(i)と最大類似度rを比較する(ステップＳ２０５)。
現在類似度r(i)の方が最大類似度rよりも大きい場合のみ、
最大類似度r = r(i)、最適テキスト片ｔ＝t(i)を実行する(ステップＳ２０６)。
行を進めるために、ｉを＋１する(ステップＳ２０７)。
【００６２】
以上が最適テキスト片選択手段１７の制御の詳細である。
これは、適したテキスト片が一つもないか、類似度テーブルの行数が０の場合は、空テキストを出力し、それ以外の場合は、ｑ個の類似度のうち、最も値が大きいものを出力することを意味する。
【００６３】
つぎに、データベースに格納されたデータエントリを検索するブロックについて説明する。
関連ノード検索手段１８のフローチャートを図７に示す。
各ステップ(Ｓ４０１〜Ｓ４０８)について以下に説明する。
【００６４】
まず、関連ノード検索手段１８は、ドキュメントリーダに指定の所定日時の検索時刻の入力を促す(ステップＳ４０１)。
つぎに、検索ノードパス名の入力を促す(ステップＳ４０２)。
【００６５】
つぎに、検索ノードパス名をキーとし、データベース手段１３に対して検索を行う(ステップＳ４０３)。
データエントリのうち主キーとなるのは起点ノードパス名であり、サブキーとなるのが指定の所定日時の検索時刻である。
【００６６】
この時のデータベースクエリー（キーワードの組み合わせ）は以下のものである。
（起点ノードパス名＝検索ノードパス名）
かつ（現在時刻＞所定日時の検索時刻）
いうなれば、利用者が検索対象としているノードの、所定日時の検索時刻より新しいデータエントリのデータが選択される。
【００６７】
つぎに、得られたデータエントリを現在時刻をキーとして時刻順ソートを行う(ステップＳ４０４)。
【００６８】
これにより古いものから新しいものへデータエントリの整列が行われる。
データエントリに含まれる、
現在時刻、起点ノードパス名、リンク近傍テキスト、２次ノードパス名、及び最適テキスト片、の５つのデータは、ドキュメントリーダに対して提示するためにレンダリング(rendering)される(ステップＳ４０５)。
【００６９】
レンダリングの際には、２次ノードパス名部分を新たな対話入力によって選択出来るように、ボタン・リンクアンカー等の表示を行う。
ドキュメントリーダは、自身の興味にしたがって、この２次ノードパス名を選択することが出来る(ステップＳ４０６)。
【００７０】
２次ノードパス名が選択された場合は(ステップＳ４０７)、
選択された２次ノードパス名を次回の検索時の検索ノードパス名としてセットし、データベース検索を繰り返すことも出来る(ステップＳ４０８)。
【００７１】
なお、本実施例において使用される関連ノード検索手段１８では、ステップＳ４０２において検索ノードパス名を入力としたが、形態素インデックス（形態素と、形態素が含まれる検索ノードパス名のリストを保持するデータベース）を用意しておき、検索ノードパス名を直接入力する代わりに利用者が適合したキーワードを入力することで、形態素インデックスから検索ノードパス名を得るようにしてもよい。
以上が請求項１記載の情報抽出方法である。
【００７２】
つぎに、本発明の情報抽出方法が適用される情報抽出装置（請求項２に記載）の一実施例のブロック構成について説明する。
請求項１記載の情報抽出装置と比較して、類似度計算手段１６と、最適テキスト片選択手段１７に違いがある。
【００７３】
類似度計算手段１６は、類似度計算のアルゴリズムに、ベクトル空間モデルを利用する。
ベクトル空間モデルとは、比較対象のテキスト全てからユニークな形態素集合を求め、各テキストを行方向に形態素の有無を列記した行ベクトルと見なし、ベクトルのコサイン・シータを求め、これを類似度とする手法である。
【００７４】
仮に、リンク近傍テキストと全てのテキスト片１〜ｑに含まれるユニークな形態素がｐ個、リンク近傍テキストに対応するベクトルをＸrと、s番目のテキスト片に対応するベクトルＸs(１≦s≦ｑ)は、それぞれの形態素の存在(１)、非存在(０)を表す変量Ｘを用いて、以下の、数１、数２ように表すことが出来る。
【００７５】
【数１】

【数２】

【００７６】
このとき、２つのテキストの類似度であるコサイン・シータ（cosθ）は、ベクトルの内積の公式から、
以下の、数３のように求める。
【００７７】
【数３】

よって、類似度は、cosθの特性により０以上かつ１以下の範囲の値を取り得る。
類似度は、１の場合は完全一致、０の場合は完全独立を意味する。
ここで、上記の式、数３の右辺の計算方法について説明する。
仮にＡ = [ 1 0 1 ]、Ｂ=[ 1 0 0 ]という行列であるとすると、
この内積Ｘr・Ｘs´は、
A11 × B11 + A12 × B12 + A13 × B13 = 1 × 1 + 0 × 0 + 1 × 0 = 1
となる。
【００７８】
一方、Ｐ及びＱはベクトルの正規化であり、ベクトルの各要素を二乗して、それぞれを加算した合計の平方根を取ったものであるので、
Ｐは sqrt(1×1 + 0×0 + 1×1) = １．４１４、
Ｑは sqrt(1×1 + 0×0 + 0×0) = １
と表すことが出来る。
【００７９】
よって、上記のcosθにあたる値は、１／(１．４１４×１) =０．７０７となり、同時に角度も求まります。
【００８０】
以上は、非常に低い次元で横の変量が少ない場合の説明であり、現実には、インターネット検索エンジンのような応用では、標本としてのハイパーテキストは数十万乃至数十億個に達する。
また、それら全てから抽出した形態素の個数、正確には形態素辞書の登録分だけの次元があることになる（日本語の場合では、数万次元）。
【００８１】
また、一般的に類似度は、検索結果の膨大さに対処し、的確な結果を利用者に提示するため、指標として用いられる。
例えば１回検索すると一番検索に適したページが１件目、つぎに似ているものが２件目、といったようにランキングを求める用途に使用する場合が多いので、結果として標本ｎ個、変量ｍ個からなる（ｍ×ｎ）標本行列を、（ｎ×ｎ）の類似度行列を求めることが多い。
【００８２】
類似度行列は、縦・横が標本の番号で、中の要素には類似度が入っている対称行列である。例えば標本１２０番のテキストと、標本５６５番のテキストの類似度を知りたいのなら（１２０,５６５）の要素を取り出す。
対称行列なので（５６５,１２０）の要素でも同じことである。
【００８３】
類似度に関する理論的な部分は以上であるが、実際にはいろいろ工夫がしてある。実装的には、（数万×数億）の標本行列を１台のコンピュータで計算すると時間がかかるので、互いに同一形態素を含む標本同士を発見し、かつ全く同じ形態素を含まない個々のグループとして分割し、行列の独立性を利用して、各々のグループを複数台のコンピュータに仕分けして演算すると効率がよい。互いに独立なベクトル同士でグループ化して複数台のコンピュータに仕分けをして行うと効率がよい。
【００８４】
また、より高い類似度を求めるために、形態素のあるなし１または０で変量を表すのをやめ、形態素の分布を調べて、助詞・助動詞などのどの文章にも頻出する形態素の重み付けを下げたり、逆に特定の文章にしか出現しない形態素の重み付けを上げたりする。
この重み付けとして代表的なものがＴＦ・ＩＤＦ法である。
【００８５】
前記ベクトル空間モデルを図９を用いて具体的に説明する。
ここでは、リンク近傍テキストと、ｑ＝３として３つに分割したテキスト片１〜３を考え、リンク近傍テキストと最も似たテキスト片をベクトル空間モデルを用いて判定するものとする。
【００８６】
仮に、リンク近傍テキストに形態素「ミカン」、「温州」が含まれ、テキスト片１には形態素「ミカン」が、テキスト片２には形態素「温州」が、テキスト片３には形態素「ミカン」、「温州」が含まれるものとする。
この時、ユニークな形態素数ｎは「ミカン」、「温州」のｎ＝２であり、全ベクトルは２次元で表すことが出来る。
【００８７】
リンク近傍テキストのベクトルをＸr, テキスト片のベクトルＸs(１≦ｓ≦３)とし、「ミカン」軸をＸ軸、「温州」軸をＹ軸とすると、図９のような２次元グラフに表すことが出来る。
まず、ベクトルＸrとＸ1について比較する。両者は１つだけ同じ形態素「ミカン」を含み、両者のベクトル角はθ1＝４５度であるから、cosθ1は約０．７となる。
【００８８】
つぎに、ベクトルＸrとＸ2について比較する。両者は１つだけ同じ形態素「温州」を含み、両者のベクトル角はθ2＝４５度であるから、cosθ2は約０．７となる。
つぎに、ベクトルＸrとＸ3について比較する。両者は完全に一致した形態素「温州」、「ミカン」を含み、両者のベクトル角はθ3＝０度であり、cosθ3は１（完全一致・類似度極大）となる。
【００８９】
以上の各cosθ1〜cosθ3の値を比較すると、cosθ3が最も大きく、すなわち、リンク近傍テキストに最も類似している類似度のテキスト片は、テキスト片３であることを示している。
【００９０】
一方、ベクトルＸ1とＸ2に着目すると、両者は１つも同じ形態素を含まないベクトルである。両者のベクトル角はθ4＝９０度であり、cosθ4は０（完全独立・類似度極小）となる。
このように、ベクトル空間モデルは各ベクトル間の傾きから類似度を算出する方式である。
【００９１】
類似度を角度ではなくcosθを利用する理由は、cosθは全てのベクトルが正の変量しか持たない場合、類似度は０〜１の範囲に必ず収まり、類似度を互いに比較可能なように数値化する方式として好適であるからである。
【００９２】
ベクトル空間モデルによる類似度計算においては、変量Ｘに形態素の出現頻度によって重み付けを行うＴＦ・ＩＤＦ法、あるいは形態素の品詞をチェックして文脈寄与度の高い名詞・固有名詞のみでベクトルを構成する絞り込み法を組み合わせて使うことが多い。
【００９３】
つぎに請求項２に記載の最適テキスト片選択手段１７について説明する。
最適テキスト片選択手段１７は、予め定められた２つの閾値ａ１と閾値ａ２を有する。
これは前記類似度計算手段１６でcosθによって０以上かつ１以下に正規化された類似度を、以下の３領域に分割するために用いられる。
【００９４】
登録棄却領域：閾値０ ≦ 類似度＜閾値ａ１
登録可能領域：閾値ａ１ ≦ 類似度＜閾値ａ２
候補除外領域：閾値ａ２ ≦ 類似度 ≦ 閾値１
【００９５】
登録棄却領域とは、類似度が著しく低く、リンク近傍テキストに関連するテキスト片として採用しても意味がない領域を意味する。
また、候補除外領域は、類似度が著しく高く、リンク近傍テキストそのものがテキスト片として含まれている可能性が高い領域を意味する。
【００９６】
候補除外領域は、２つのテキスト間の全文引用を候補として抽出することを避けること、あるいは構造化された複数のハイパーテキストによく含まれるナビゲーションバー等の定型的なリンク近傍のテキスト片を排除するために設けている。
【００９７】
これらに対して登録可能領域は、リンク近傍テキストに関連の深いテキストとして適当な領域であることを意味する。
ここで閾値ａ１、ａ２の一実施例を示す。
例えば、総形態素数ｐ＝１００であり、リンク近傍テキスト及び各テキスト片に含まれる形態素の平均数が１０個であると仮定した場合、
閾値ａ１= ０．４４８、閾値ａ２＝０．８９４５が適当である。
【００９８】
ｑ個の類似度全てが登録棄却領域または候補除外領域に属している場合、最適テキスト片選択手段１７は空テキストを出力する。
それ以外の場合は、ｑ個の類似度のうち、最も値が大きいものを採用し、対応するテキスト片を最適テキスト片として出力する。
【００９９】
請求項２に記載の最適テキスト片選択手段１７のフローチャートを図８に示す。各ステップ(ステップＳ３０１〜Ｓ３０８)について以下に説明する。
図８中の(ステップＳ３０１〜Ｓ３０７)までの各要素は、図６中の(ステップＳ２０１〜Ｓ２０７)までにそれぞれ対応するものであり、その説明を省略する。
【０１００】
閾値ａ１及び閾値ａ２を用いた現在類似度r(i)を比較し(ステップＳ３０８)、
登録可能領域である場合には(ステップＳ３０８Ｙｅｓ)、
ステップＳ３０５に進む処理を行い、
登録可能領域外である場合には(ステップＳ３０８Ｎｏ)、
最大類似度rとの比較処理をスキップして、ステップＳ３０７に進む処理を行う。
以上が請求項２記載の情報抽出装置である。
【０１０１】
つぎに本発明における情報抽出コンピュータプログラム（請求項５及び請求項６に記載の発明）について説明する。
本発明の情報抽出はその情報抽出処理を行うコンピュータプログラムまたはそのコンピュータプログラムを記録した記録媒体のコンピュータプログラムをコンピュータで読み取ることによって実現することが出来る。
【０１０２】
図１の情報抽出装置のブロック図において、各構成手段をコンピュータプログラムでコンピュータに実行させる各ステップに書き換えると、請求項５及び請求項６に記載の各ステップからなる、コンピュータにより実行可能なコンピュータプログラムを提供することが出来る（図２乃至図９参照）。
【０１０３】
本発明の情報抽出コンピュータプログラム（請求項５に記載の発明）は、図１乃至図７に示されるように、情報の単位であるノードとノード間のリンクで構成されたハイパーテキストの中から特定の検索キーワードに関するデータを情報を抽出する情報抽出コンピュータプログラムにおいて、
前記特定の検索キーワードにより起点ノードを指し示す起点ノードパス名が入力されると、前記起点ノードに含まれるテキストとリンクとを解析し、前記２次ノードを指し示す前記２次ノードパス名とリンク近傍テキストとの組を入力テーブルとして出力する起点ノード分析ステップ１１と、
前記２次ノードにアクセスし、前記２次ノードに含まれるテキストを取得する２次ノード取得ステップ１４と、
前記テキストを所定の大きさのテキスト片群に分割するテキスト分割ステップ１５と、
前記起点ノードから派生したリンク近傍テキストと、前記２次ノードパス名から派生したテキスト片群それぞれの類似度を計算し、前記テキスト片と前記類似度とを組とした類似度テーブルを出力する類似度計算ステップ１６と、
前記類似度テーブルを解析し、前記テキスト片群の中から前記リンク近傍テキストに相応しい最適テキスト片を選択する最適テキスト片選択ステップ１７と、前記入力テーブルを順次解析し、各々の２次ノードについて前記最適テキスト片選択手段１７を呼び出し、得られた前記最適テキスト片を現在時刻と共にデータベース登録するデータベース登録ステップ１２と、
前記現在時刻、前記起点ノードパス名、前記２次ノードパス名、前記リンク近傍テキスト、及び前記最適テキスト片をデータエントリとしてデータベース化するデータベースステップ１３と
をコンピュータに実行させる情報抽出コンピュータプログラムを提供するものである。
【０１０４】
また、本発明の情報抽出コンピュータプログラム（請求項６に記載の発明）は、図１、図４、図８、及び図９に示されるように、上記の情報抽出コンピュータプログラムにおいて、
前記類似度計算ステップ１６は、テキストを形態素群に分割し、前記形態素群をベクトルと見なし、各文章に対応するベクトルのコサイン・シータを類似度と見なすことで、前記類似度を０以上かつ１以下の範囲で求め、
前記最適テキスト片選択ステップ１７は、前記類似度テーブル中の前記類似度を比較検討し、所定の範囲の値（閾値ａ１以上かつ閾値ａ２未満）にあるテキスト片のうち最大の類似度を有するテキスト片を選択するようにした情報抽出コンピュータプログラムを提供するものである。
【０１０５】
このような情報抽出コンピュータプログラムは、一般のコンピュータやＰＤＡ(Personal Data Aid)で実行してもよいし、そのプログラムをＬＳＩに組み込んで実行してもよい。
【０１０６】
また更に、ＤＶＤ-ＲＯＭのようなその情報抽出コンピュータプログラムを記録した記録媒体により供給してもよいし、インターネットを介してネットワークに供給してもよい。
【０１０７】
以上の本発明の実施例の構成の効果を纏めると、本実施例によると、現在時刻と起点ノードパス名と２次ノードパス名とリンク近傍テキストと最適テキスト片をデータエントリとしてデータベース化するデータベース手段と、検索時刻と検索ノードパス名が入力されると、前記検索時刻以降かつ検索ノードパス名と起点ノードパス名が一致する前記データエントリ群を得て出力する関連ノード検索手段とを有しているので、監視対象のハイパーテキストに新たに追加されていくリンクをデータベースに保存することが出来、ドキュメントリーダは任意のタイミングでリンクの存在、リンクが追加された時刻、リンク近傍のテキスト、リンク先のハイパーテキストを容易に検索することが出来る。
【０１０８】
本実施例によると、最適テキスト片選択手段を有しリンクに関連するテキストの一部分のみを切り出してデータベースに保存するため、特定のハイパーテキストを継続的に監視する場合、取得時刻毎にハイパーテキスト全文を保存する方式よりも記憶領域を節約することが出来る。
【０１０９】
本実施例によると、監視対象のハイパーテキストが他のノードへのリンクを伴う誹謗中傷等の攻撃を行い得る要注意監視対象である場合、ドキュメントリーダは定期的に本発明を実施することで、監視対象のハイパーテキスト全文を読むことなく、迅速にリンクの意図を汲み取り、誹謗中傷等を発見することが出来る。
【０１１０】
本実施例によると、現在時刻と起点ノードパス名と２次ノードパス名とリンク近傍テキストと最適テキスト片をデータエントリとしてデータベース化するデータベース手段を有し、検索時刻と検索ノードパス名が入力されると、前記検索時刻以降かつ検索ノードパス名と起点ノードパス名が一致する前記データエントリ群を得て出力する関連ノード検索手段を有しているので、ドキュメントリーダは定期的に本発明を実施することで、誹謗中傷等のリンクが何時追加され、何時削除されたかの具体的な証拠を時刻と共にデータベースに保存することが出来る。
【０１１１】
本実施例では、最適テキスト片選択手段を有しているので、２つのハイパーテキスト間で極端に類似しているテキストをデータベースに登録しないため、ハイパーテキスト中のナビゲーションバー等の定型的なテキストを予め排除し、ドキュメントリーダにとり有益な変化のあるテキストのみを提供し、ドキュメントリーダがリンクを辿る必要性を低下させ、労力を軽減させることが出来る。
【０１１２】
本実施例では、現在時刻と起点ノードパス名と２次ノードパス名とリンク近傍テキストと最適テキスト片をデータエントリとしてデータベース化するデータベース手段を有し、検索時刻と検索ノードパス名が入力されると、前記検索時刻以降かつ検索ノードパス名と起点ノードパス名が一致する前記データエントリ群を得て出力する関連ノード検索手段を有しているので、ある時刻に起点ノードが誹謗中傷を行ったとして、ドキュメントリーダが２次ノードへの被害の伝播を確認する際に、ドキュメントリーダが予め指定した検索時刻以降に出現したリンクのみを抜き出すことが出来、さらに２次ノードのリンクを表示する際にも、所定日時の検索時刻以降に出現したリンクのみを抜き出すため、２次ノード側から起点ノードに対して誹謗中傷に対する抗議などを行った証拠を見付け易く、ドキュメントリーダがリンクを辿る必要性を低下させ、労力を軽減させることが出来る。
【０１１３】
【発明の効果】
以上に説明したように請求項１または請求項５に記載された発明によると、ノード分析ステップ、最適テキスト片選択ステップ、データベースステップを有し、最適テキスト片選択ステップによりリンクに関連する２次ノードのテキストの最適な一部分のみを切り出してデータベースステップに保存するため、特定のキーワードのハイパーテキストを継続的に監視するような場合も、取得時刻毎にこのリンクに関連するテキスト付きのテキスト全文を保存するものよりも、本発明は記憶領域を節約することが出来、そのデータの内容も短時間で理解することが出来る。
【０１１４】
また、請求項１または請求項５に記載された発明によると、監視対象の特定の検索キーワードに関するハイパーテキストに新たに追加されていくリンクをデータベース手段に保存するようにすれば、利用者は任意のタイミングでリンクの存在、リンクが追加された時刻、リンク近傍のテキスト、リンク先のハイパーテキストを検索することが出来る。
【０１１５】
また、請求項２または請求項６に記載された発明によると、２つのハイパーテキスト間で極端に類似しているテキストをデータベース登録手段に登録しないため、ハイパーテキスト中のナビゲーションバー等の定型的なテキストを予め排除し、利用者にとり有益な変化のあるテキストのみを提供し、利用者がリンクを辿る必要性を低下させ、労力を軽減させることが出来る。
【０１１６】
また、請求項３に記載された情報検索方法の発明によると、監視対象の特定の検索キーワードに関するハイパーテキストに新たに追加されていくリンクをデータベース手段に保存するようにすれば、利用者は任意のタイミングでリンクの存在、リンクが追加された時刻、リンク近傍のテキスト、リンク先のハイパーテキストを検索することが出来る。
【０１１７】
また、請求項３に記載された情報検索方法の発明によると、検索ノードに対応した２次ノードの最適な一部のテキスト部分のみの情報を検索することが出来、より速く理解することが出来る。
【０１１８】
また、請求項５に記載された発明によると、情報抽出装置が有している処理内容はコンピュータプログラムに記述されており、このプログラムをコンピュータで実行することにより、上記処理がコンピュータで実現出来る
【図面の簡単な説明】
【図１】本発明における請求項１記載の情報抽出方法が適用される情報抽出装置の一実施例のブロック構成を示した図である。
【図２】請求項１に記載の発明が適用される情報抽出装置を構成する起点ノード分析手段の一実施例が出力する入力テーブルのデータ構造を示したものである。
【図３】請求項１に記載の発明が適用される情報抽出装置を構成するデータベース登録手段の一実施例が出力するデータエントリのデータ構造を示したものである。
【図４】請求項１に記載の発明が適用される情報抽出装置を構成するデータベース登録手段の一実施例の動作をフローチャートとして示したものである。
【図５】請求項１に記載の発明が適用される情報抽出装置を構成する類似度計算手段の一実施例が出力する類似度テーブルのデータ構造を示したものである。
【図６】請求項１に記載の発明が適用される情報抽出装置を構成する最適テキスト片選択手段の一実施例の動作をフローチャートとして示したものである。
【図７】請求項１に記載の発明が適用される情報抽出装置を構成する関連ノード検索手段の一実施例の動作をフローチャートとして示したものである。
【図８】請求項２に記載の発明が適用される情報抽出装置を構成する最適テキスト片選択手段の一実施例の動作をフローチャートとして示したものである。
【図９】請求項２に記載の発明が適用される情報抽出装置を構成する最適テキスト片選択手段の一実施例に用いられるベクトル空間モデルを説明した図である。
【図１０】従来の情報抽出装置の一例のブロック構成を示した図である。
【図１１】従来の情報抽出装置の一例の特徴抽出処理の流れのフローチャートを示した図である。
【図１２】従来の情報抽出装置の一例の類似性判定手段の処理手順のフローチャートを示した図である。
【符号の説明】
１０情報抽出装置
１１起点ノード分析手段（ノード分析ステップ）
１２データベース登録ステップ（データベース登録手段）
１３データベースステップ（データベース手段）
１４２次ノード取得ステップ（２次ノード取得手段）
１５テキスト分割ステップ（テキスト分割手段）
１６類似度計算ステップ（類似度計算手段）
１７最適テキスト片選択ステップ（最適テキスト片選択手段）
１８関連ノード検索ステップ（関連ノード検索手段）
ａ１，ａ２閾値
ｒ最大類似度
ｔ最適テキスト片
Ｘ1〜Ｘ3,Ｘr ベクトル
θ1〜θ4 ベクトル角[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information extraction method, an information retrieval method, and an information extraction computer program, and more particularly, an information extraction method for extracting information from a hyperdocument system in which information is composed of nodes and links between nodes, and information retrieval using the apparatus. The present invention relates to a method and an information extraction computer program.
[0002]
[Prior art]
In the hyper document system, the text information to be handled is divided into several small units (nodes), and these are associated and organized. Information indicating such association is called a link.
For example, on the Internet, a hyper document system is constructed by WWW (World Wide Web).
WWW information is described in HTML (Hyper Text Markup Language). This HTML has no semantic restrictions on links between nodes.
The system having no semantic restriction on the link between the nodes has an advantage that the document author (author) can determine the content and the link structure at will.
[0003]
By such a hyper document system, the document reader (reader) can follow the link structure constructed by the document author and access all information provided by the document author.
[0004]
However, even if the document reader finds a node that contains the information that it wants, there is no semantic restriction between the links, so it is necessary to check whether the link destination node contains related information. In order to check, you must actually follow the link.
However, the amount of information in the hyper document system is enormous. Therefore, in order for the document reader to find necessary information, a system that supports information retrieval is required.
[0005]
Such a conventional system for supporting information retrieval (information extracting apparatus disclosed in Japanese Patent Laid-Open No. 11-3347) will be described below with reference to the drawings.
FIG. 10 shows a block configuration of an example of a conventional information extraction apparatus. Here, consider a case where the document author creates a document by dividing certain information into three nodes 701 to 703.
In this example, a link is extended from the node 701 to the two nodes 702 and 703. Then, the node 701 is input to the information extraction device 710 as a starting node.
[0006]
When the starting node 701 is input, the starting node feature extracting unit 711 analyzes the contents of the starting node 701 and extracts the feature of the starting node 701.
The extracted feature is supplied to the similarity determination unit 714 as a starting node feature profile 711a. Here, the information related to the characteristics of a node is a set of a pair of a word that characterizes the contents of the node and a value indicating its importance.
For example, the origin node feature profile 711a is created by weighting based on the appearance frequency, appearance position, and part-of-speech information regarding each word appearing at the origin node.
[0007]
The secondary node acquisition unit 712 acquires nodes 702 and 703 linked from the input origin node 701 as secondary nodes. Each acquired secondary node is supplied to secondary node feature extraction means 713.
[0008]
The secondary node feature extraction unit 713 analyzes the content of the secondary node acquired by the secondary node acquisition unit 712 and extracts the feature of each secondary node. The extracted feature is supplied to the similarity determination unit 714 as a secondary node feature profile 713a.
Similarly to the origin node feature extraction unit 711, the secondary node feature extraction unit 713 performs weighting based on the appearance frequency, appearance position, and part-of-speech information regarding each word appearing in the secondary node, and sets the secondary node feature profile 713a. create.
[0010]
As a result, a synthesis node 704 obtained by synthesizing a secondary node similar to the origin node with respect to the origin node is obtained.
The starting point node input to the information extracting device 710 is, for example, a node obtained by randomly scanning a node in advance in a conventional hyper document information search is input as the starting point node.
In this case, the synthesis node 704 output from the information extraction device 710 is a search target.
[0011]
Processing performed by the conventional information extraction apparatus 710 having the above configuration will be described below. First, the procedure of feature extraction will be described.
[0012]
FIG. 11 is a flowchart showing the flow of feature extraction processing.
In this flowchart, the processing of the origin node feature extraction unit 711 that has received the origin node will be described. However, the processing performed by the secondary node feature extraction unit 713 that has received the secondary node is also the same processing. Description will be made according to step numbers (steps S801 to S806).
[0013]
[Node input step S801]
A starting node 701 is given, and the information source is input to the starting node feature extracting unit 711.
[Tag Removal Step S802]
Tags defined in a hyper document system description language (eg, HTML) are removed from the information source.
[Word Extraction Step S803]
Extract words from the remaining text using known morphological analysis techniques.
[0014]
[Important Word Extraction Step S804]
Only important words are extracted from the set of words obtained in step S803.
Here, an important word is a word that characterizes the contents of an information source, and an important word can be extracted by a method in which only a noun is regarded as an important word.
[0015]
[Important word weighting step S805]
The important word obtained in step S804 is weighted in consideration of the appearance frequency and the appearance position.
That is, the higher the frequency of appearance, the higher the importance, and the closer the appearance position is to the top of the document, the higher the importance of the word.
[Characteristic profile creation step S806]
Finally, a list composed of pairs of important words and their weights is created, and this is set as a starting node feature profile 711a.
[0016]
The feature profile 711a (single) of the origin node 701 obtained in this way is supplied to the similarity determination unit 714.
The origin node 701 is also supplied to the secondary node acquisition unit 712. The secondary node acquisition unit 712 searches for link information included in the received information source of the node 701, and determines the link destination node. Obtained as a secondary node 702.
For example, if the origin node is created in HTML, the URL (Uniform Resource Locator) in the area enclosed by the anchor tags (<A> ... </A>) is extracted and specified by the URL A document (secondary node) is acquired.
[0017]
The set of secondary nodes acquired by the secondary node acquisition unit 712 is supplied to the secondary node feature extraction unit 713.
Then, the secondary node feature extraction unit 713 executes the same processing as the flowchart shown in FIG. 11, and secondary node feature profiles 713a (plural) for the secondary nodes 702 and 703 are created.
[0018]
The characteristic profile 713a of the secondary node is supplied to the similarity determination unit 714.
The similarity determination means 714 is supplied with the above-described origin node feature profile 711a and a plurality of secondary node feature profiles 713a.
[0019]
Next, a specific example of the flow of similarity processing executed by the similarity determination unit 714 shown in FIG. 10 will be described.
FIG. 12 is a flowchart showing a processing procedure of the similarity determination unit 714. The following similarity processing will be described sequentially.
[0020]
[Initialization Step S911]
The variable n is initialized as n = 1. Further, the number of secondary nodes acquired by the secondary node acquisition unit 712 is substituted into the variable m. Here, the secondary node feature profiles 713a are arranged in the order of 1 to m.
[0021]
[Get Feature Profile of Origin Node Step S912]
The feature profile 711a of the starting node is acquired.
[Is n larger than m? Step S913]
The magnitudes of n and m are compared, and if n> m (S913 YES), the process ends. If n> m is not satisfied (S913 NO), the process proceeds to step S914.
[0022]
[Get Feature Profile of nth Secondary Node Step S914]
The feature profile 713a of the nth secondary node is acquired.
[Feature Profile Similarity Calculation Step S915]
The similarity between the origin node feature profile 711a obtained in step S912 and the secondary node feature profile 713a obtained in step S914 is calculated (similarity determination means 714). For the calculation of the similarity, a known vector inner product calculation method is used.
[0023]
[Similarity is greater than threshold value Step S916]
The similarity value obtained in step S915 is compared with the threshold value,
If similarity> threshold (YES in S916), the process proceeds to step S917.
If the similarity is not greater than the threshold (NO in S916), the process proceeds to step S918.
Here, the threshold value is a preset value, and the allowable range of similarity is adjusted according to the magnitude.
When the similarity is calculated by the vector inner product calculation method, for example, a value of about 0.1 is set as the threshold value.
[0024]
[The nth secondary node is selected as a synthesis candidate Step S917]
If the similarity is greater than the threshold, the nth secondary node is stored as a node candidate to be combined with the starting node.
[Add 1 to the value of n Step S918]
1 is added to the value of n to obtain n + 1, and the process proceeds to step S913.
[0025]
Thereby, for each secondary node feature profile 713a of the order 1 to m, the presence / absence of similarity with the origin node feature profile 711a is determined.
In step S 917, the set of secondary nodes stored as candidate nodes to be combined and the starting node are supplied to the node combining unit 715.
Finally, the node synthesis unit 715 synthesizes all the secondary nodes determined to have similarities with the origin node to be a synthesis node 704.
[0026]
If the combined node 704 obtained in this way is an information search target, for example, when searching for information similar to “concept A”, even if the node 701 alone is dissimilar to “concept A”, If the combined node obtained by combining the node 701 and the secondary node and the “concept A” are similar, the node 701 can be obtained as a search result. The user who has obtained the node 701 follows the link from the node 701 and can access all of the information similar to the target “concept A”.
[0027]
In addition, when the origin node feature extraction unit 711 or the secondary node feature extraction unit 713 weights an important word, a character emphasized by a tag such as HTML can be a word having high importance. For this purpose, it is necessary to change the order of the processes in FIG. 11 and perform the process in step S802 between step S805 and step S806.
[0028]
Furthermore, in the word extraction process in step S803, a tag for emphasizing a character and a character described in the tag are extracted without being separated.
When extracting a plurality of words from a region surrounded by tags that emphasize characters, tag information is added to each word. This makes it possible to identify which word is highlighted when weighting important words.
[0029]
As described above, this conventional example generates a feature profile that can compare and determine the similarity of information from each node of the hypertext, and determines the similarity between information according to the search request of the document reader. Thus, the nodes are combined to create a combined node, and the combined node 704 is output, so that the information is extracted in units of information having a semantic group.
[0030]
[Problems to be solved by the invention]
Conventionally, in the method of extracting information as a composite node by combining all secondary nodes determined to have similarities at the time of search as origin nodes, a plurality of hypertexts that change in time series are stored. Even if such improvements are added, the storage capacity and the amount of calculation at the time of information extraction are increased, and it takes time to read and the efficiency is poor.
[0031]
Conventionally, the entire hypertext is assumed to have a static configuration. However, hypertext placed on the Internet or the like is not a static document system. Each node is freely changed by a plurality of document authors, and links specifying nodes are freely added and deleted.
In particular, links that can be easily embedded in hypertext play an important role in seeking to point out or correct a slander that names other nodes, or conversely, a node that describes prejudice or discrimination. .
[0032]
In many cases, links corresponding to these are deleted or corrected due to an arbitrary change in the document author of the link source or link destination, and the meaning of the link is often lost.
In such an example, the information extraction device based on the static hyper is not effective. Therefore, in the method of obtaining a result by the above synthesis node, as a result, the time series of the topic is not related and is destroyed by deleting or correcting the link portion of the document author.
[0033]
A specific example will be described.
Even if a document manager B of a certain node B creates a sentence that slanders the document manager A of a certain node A and makes it public as hypertext, there is a criticism and correction request for the slandered sentence. When the document manager A arrives at the document manager B, the document manager B is told to document manager A by tampering with a part of the text and acting as if nothing happened, or by tampering with it. It is possible to conclude as a person who makes a blame.
[0034]
In such an example, in a conventional information extraction device (search engine) that stores only the latest hypertext for each node, even if the acquisition time and update time are stored, Since hypertext with series changes is not preserved, it is impossible to keep track of changes in the topic.
[0035]
Therefore, the present invention has been made to solve the above problems, and in particular, has a node analysis step, an optimum text fragment selection step, and a database step. Even when only the optimal part of the text is cut out, extracted and stored in the database step, and searched for from that database step, the hypertext of a specific keyword is continuously monitored. Saves a lot of storage space compared to storing full text with text related to links, so that even data with time series changes can be understood quickly It is for the purpose.
[0036]
[Means for Solving the Problems]
To solve the above problem,
The invention described in claim 1
There are a plurality of nodes consisting of a text and a link described in the text, and a secondary node path name indicating the secondary node of the link destination is written on the link so as to indicate a connection between the nodes. From composed hypertext Computer In an information extraction method for extracting information,
When a node path name indicating a node corresponding to a search keyword is input, a link obtained by analyzing the text and link of the node and cutting out the text portion described in the vicinity of the secondary node path name and the link in the text A node analysis step 11 for outputting an input table of pairs of neighboring texts;
A secondary node acquisition step 14 for accessing the secondary node and acquiring the text of the secondary node;
A text splitting step 15 for splitting the text of the secondary node into text fragments of a predetermined size;
A similarity calculation step 16 for calculating the similarity of each of the divided text pieces with respect to the link neighborhood text, and outputting a similarity table in which each text piece and the similarity are paired;
An optimum text piece selecting step 17 for selecting a text piece having the most similar degree from the text piece group;
The input table is sequentially analyzed, and the optimum text fragment of each secondary node selected from the optimum text fragment selection step for each of the secondary nodes is represented as a current time, the node path name, the secondary node path name, and A database step 13 for extracting and creating a database together with the link neighborhood text;
The Have Providing an information extraction method characterized by
The invention described in claim 2
In the information extraction method described in Claim 1,
The similarity calculation step 16 divides the text into morpheme groups, regards the morpheme group as a vector, and regards the cosine theta of the vector corresponding to each sentence as the similarity, whereby the similarity is 0 or more and 1 With the following ranges:
The optimum text fragment selection step 17 selects a text fragment having the maximum similarity from among the text fragments whose similarity is within a predetermined range from the similarity table. Offer to,
The invention described in claim 3
There are a plurality of nodes consisting of a text and a link described in the text, and a secondary node path name indicating the secondary node of the link destination is written on the link so as to indicate a connection between the nodes. From composed hypertext Computer In an information retrieval method for retrieving information,
When a node path name indicating a node corresponding to a search keyword is input, a link obtained by analyzing the text and link of the node and cutting out the text portion described in the vicinity of the secondary node path name and the link in the text Node analysis means 11 for outputting an input table that is paired with neighboring text;
Secondary node acquisition means 14 for accessing the secondary node and acquiring text of the secondary node;
Text dividing means 15 for dividing the text of the secondary node into text pieces of a predetermined size;
Similarity calculation means 16 for calculating the similarity of each of the divided text pieces with respect to the link neighboring text, and outputting a similarity table in which each text piece and the similarity are paired;
An optimum text piece selection means 17 for selecting a text piece having the most similar degree from the text piece group;
The input table is sequentially analyzed, and the optimum text fragment of each secondary node selected by the optimum text fragment selection means for each of the secondary nodes is represented as a current time, the node path name, the secondary node path name, and Database means 13 for extracting and creating a database together with the link neighborhood text;
A node path name to be searched is input to the information extraction apparatus comprising: and the data entry group selected by the database means 13 in which the node path name to be searched matches the node path name is output. The computer Provides an information search method characterized by performing information search,
The invention described in claim 4
In the information search method described in Claim 3,
The similarity calculation means 16 divides the text into morpheme groups, regards the morpheme groups as vectors, and regards the cosine theta of the vector corresponding to each sentence as the similarity, so that the similarity is 0 or more and 1 With the following ranges:
The optimum text fragment selecting means 17 selects the text fragment having the maximum similarity among the text fragments whose similarity is within a predetermined range from the similarity table. Offer to,
The invention described in claim 5
There are a plurality of nodes consisting of a text and a link described in the text, and a secondary node path name indicating the secondary node of the link destination is written on the link so as to indicate a connection between the nodes. In an information extraction computer program that extracts information from configured hypertext,
When a node path name indicating a node corresponding to a search keyword is input, a link obtained by analyzing the text and link of the node and cutting out the text portion described in the vicinity of the secondary node path name and the link in the text A node analysis step 11 for outputting an input table of pairs of neighboring texts;
A secondary node acquisition step 14 for accessing the secondary node and acquiring the text of the secondary node;
A text splitting step 15 for splitting the text of the secondary node into text fragments of a predetermined size;
A similarity calculation step 16 for calculating the similarity of each of the divided text pieces with respect to the link neighborhood text, and outputting a similarity table in which each text piece and the similarity are paired;
An optimum text piece selecting step 17 for selecting a text piece having the most similar degree from the text piece group;
The input table is sequentially analyzed, and the optimum text fragment of each secondary node selected from the optimum text fragment selection step for each of the secondary nodes is represented as a current time, the node path name, the secondary node path name, and A database step 13 for extracting and creating a database together with the link neighborhood text;
Providing an information extraction computer program characterized by causing a computer to execute
The invention described in claim 6
In the information extraction computer program according to claim 5,
The similarity calculation step 16 divides the text into morpheme groups, regards the morpheme group as a vector, and regards the cosine theta of the vector corresponding to each sentence as the similarity, whereby the similarity is 0 or more and 1 With the following ranges:
The optimum text fragment selecting step 17 selects from the similarity table a text fragment having the maximum similarity among text fragments whose similarity is within a predetermined range. Is to provide.
[0037]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the information extracting apparatus will be described below with reference to the drawings by a preferred example.
The information extraction apparatus according to the present invention divides text information to be handled into several small units (nodes), associates them and organizes them, and has hypertext having hypertext in which information indicating such association is called a link. Applies to document systems.
[0038]
FIG. 1 shows a block configuration of an embodiment of an information extraction apparatus (the invention according to claim 1) to which the information extraction method of the present invention is applied.
Each block 11-17 of one Example of an information extraction device is demonstrated below. When the (starting) node path name of a specific search keyword (word, phrase, etc.) is input, the starting node analysis unit 11 acquires hypertext included in the (starting) node indicated by the node path name.
[0039]
The hypertext here is a text file described in a markup language similar to HTML, and is a text in which plain text and an anchor tag representing a link are mixed.
A secondary node path name for indicating a link destination node is embedded in the anchor tag.
[0040]
This origin node analysis means 11 analyzes the link included in the hypertext, and extracts the secondary node path name and the plain text around the anchor tag as “link neighborhood text” according to a predetermined clipping rule.
For example, as a cut-out rule, a rule such as taking out until a punctuation mark (.) Appears three times in Japanese, and counting up by a specific number of bytes when cutting out without depending on grammar is adopted.
[0041]
As shown in FIG. 2, the secondary node path name n and the link neighborhood text n corresponding to it generate an “input table” as one set.
Therefore, when n anchor tags are included in the hypertext, the total number of lines in the input table is n lines.
[0042]
The database registration unit 12 receives an n-row input table as shown in FIG. 3 and generates an m-row data entry to be described later.
[0043]
Next, a flowchart of the database registration means 12 is shown in FIG.
Each step (steps S101 to S110) of the database registration unit 12 will be described.
[0044]
First, a variable m is initialized to perform a row operation in the data entry, and a variable i is initialized to perform a row operation of the input table (step S101).
When all the rows of the input table have been scanned, a branch for terminating is provided (step S102). This means that n iterations are included.
[0045]
Read the input table of the i-th line (step S103),
The secondary node path name is read, supplied to the secondary node acquisition means 14 and activated (step S104).
[0046]
Next, the link neighborhood text of the i-th line is supplied to the similarity calculation means 16 (step S105).
When the above is completed, the optimum text fragment t is generated via a service routine described later, and the optimum text fragment t is acquired (step S106).
[0047]
It is determined whether or not the optimum text fragment t is empty text (step S107). If it is empty text, the writing of the data entry is skipped.
If it is not empty text, the data entry m-th line is written (step S108), and m is incremented by 1 (step S109).
[0048]
In any case, the variable i for performing the row operation is incremented by 1 (step S110),
Return to the branch (step S102).
The details of the control of the database registration unit 12 have been described above.
[0049]
Next, data in the data entry will be described.
The database registration unit 12 sets the current time when the input table is input to the database registration unit 12 at the current time 1 to m (see FIG. 3).
In addition, as the origin node path names 1 to m, the origin node path names corresponding to the input table are set as they are (see FIG. 3).
[0050]
In addition, secondary node path names corresponding to the respective rows of the input database are set in the secondary node path names 1 to m (see FIG. 3).
In addition, the link neighborhood texts 1 to m are also set with the link neighborhood text corresponding to each line of the input database (see FIG. 3).
[0051]
In addition, the optimum text pieces 1 to m are processed by the optimum text piece selecting means 17 for the hypertext indicated by the secondary node path name, and the resulting text is set (see FIG. 3).
[0052]
The database means 13 has a basic function as a database for storing the data entry output from the database registration means 12 and making it easily searchable.
At least zero or more data entries can be output using the origin node path name as a key (see FIG. 3).
[0053]
Next, a service routine (secondary node acquisition unit 14 to optimum text piece selection unit 17) related to the database registration unit 12 will be described below.
The secondary node acquisition unit 14 receives the secondary node path name output from the database registration unit 12 as input, and acquires and outputs hypertext from the node indicated by the secondary node path name.
For example, if the origin node is created in HTML, the URL (Uniform Resource Locator) in the area enclosed by the anchor tags (<A> ... </A>) is extracted and specified by the URL A document (secondary node) is acquired and output.
[0054]
The text dividing means 15 removes markup text such as links from hypertext and converts it into plain text, and then divides the whole into q text pieces.
[0055]
The segmentation algorithm includes
How to split based on markup paragraph tags,
A method of dividing every predetermined number of bytes,
If the plain text is written in Japanese, there can be a method of dividing every three (.) Counts.
[0056]
The similarity calculation means 16 receives one link neighborhood text from the database registration means 12 and q text pieces divided from the text division means 15 as input, and q sets of similarities having the text piece and the similarity as a pair. The degree table is output (see FIG. 5).
Similarities 1 to q are obtained by calculating the similarities of the text pieces 1 to q with respect to the combination with the link neighboring text.
[0057]
The optimum text fragment selecting means 17 selects and outputs only one optimum text fragment from q text fragments by comparing q similarity in total.
A flowchart of the optimum text piece selecting means 17 is shown in FIG.
[0058]
Each step (S201 to S207) of the optimum text piece selecting means 17 will be described in turn below.
First, a row counter variable i (= 1) for scanning the similarity table,
Variable t (= empty text) holding the optimal text fragment,
The maximum similarity r (= 0) is initialized (step S201).
[0059]
The total number of rows q in the similarity table is compared with the row counter variable i, and if i is large, it is considered that scanning has been completed up to the end of the table, and the processing ends (step S202).
At the end, the optimum text fragment t is written into the database registration means 12 (step S203).
[0060]
If scanning of the table continues without finishing, the i-th row is read from the similarity table, and the contents of the row are set to the current text fragment t (i) and the current similarity r (i), respectively. (Step S204).
[0061]
The current similarity r (i) is compared with the maximum similarity r (step S205).
Only if the current similarity r (i) is greater than the maximum similarity r,
The maximum similarity r = r (i) and the optimum text fragment t = t (i) are executed (step S206).
In order to advance the line, i is incremented by 1 (step S207).
[0062]
The details of the control of the optimum text piece selecting means 17 have been described above.
If there is no suitable text fragment or if the number of rows in the similarity table is 0, an empty text is output, otherwise, the highest value of q similarities Is output.
[0063]
Next, a block for searching for a data entry stored in the database will be described.
A flowchart of the related node search means 18 is shown in FIG.
Each step (S401 to S408) will be described below.
[0064]
First, the related node search unit 18 prompts the document reader to input a search time at a specified date and time (step S401).
Next, input of a search node path name is prompted (step S402).
[0065]
Next, the database means 13 is searched using the search node path name as a key (step S403).
Of the data entries, the primary key is the origin node path name, and the subkey is the search time for the specified date and time.
[0066]
The database query (keyword combination) at this time is as follows.
(Starting node path name = Search node path name)
And (current time> search time for a given date and time)
In other words, the data entry data that is newer than the search time of the predetermined date and time of the node that the user is searching for is selected.
[0067]
Next, the obtained data entries are sorted in time order using the current time as a key (step S404).
[0068]
This sorts the data entries from old to new.
Contained in the data entry,
The five data of the current time, the origin node path name, the link neighborhood text, the secondary node path name, and the optimum text fragment are rendered for presentation to the document reader (step S405).
[0069]
When rendering, buttons, link anchors, etc. are displayed so that the secondary node path name portion can be selected by a new dialog input.
The document reader can select the secondary node path name according to its own interest (step S406).
[0070]
When the secondary node path name is selected (step S407),
The selected secondary node path name can be set as a search node path name for the next search, and the database search can be repeated (step S408).
[0071]
In the related node search means 18 used in the present embodiment, the search node path name is input in step S402, but a morpheme index (a database holding a list of morphemes and search node path names including morphemes) is prepared. The search node path name may be obtained from the morpheme index by inputting a keyword suitable for the user instead of directly inputting the search node path name.
The above is the information extraction method according to claim 1.
[0072]
Next, a block configuration of an embodiment of an information extraction apparatus (claim 2) to which the information extraction method of the present invention is applied will be described.
Compared with the information extraction apparatus according to claim 1, there is a difference between the similarity calculation means 16 and the optimum text piece selection means 17.
[0073]
The similarity calculation means 16 uses a vector space model as a similarity calculation algorithm.
The vector space model is to obtain a unique morpheme set from all the texts to be compared, regard each text as a row vector listing the presence or absence of morphemes in the row direction, obtain the cosine theta of the vector, and use this as the similarity It is a technique.
[0074]
Suppose that there are p unique morphemes contained in the link neighborhood text and all the text pieces 1 to q, the vector corresponding to the link neighborhood text is Xr, and the vector Xs (1 ≦ s ≦ q corresponding to the sth text piece). ) Can be expressed as the following equations 1 and 2 using the variable X representing the presence (1) and non-existence (0) of each morpheme.
[0075]
[Expression 1]

[Expression 2]

[0076]
At this time, the cosine theta (cosθ), which is the similarity between the two texts, is obtained from the vector inner product formula:
The following is obtained as Equation 3.
[0077]
[Equation 3]

Therefore, the similarity can take a value in the range of 0 or more and 1 or less depending on the characteristic of cos θ.
When the degree of similarity is 1, it means perfect match, and when it is 0, it means complete independence.
Here, the calculation method of the above equation, the right side of Equation 3, will be described.
If the matrix is A = [1 0 1] and B = [1 0 0],
The inner product Xr · Xs ′ is
A11 × B11 + A12 × B12 + A13 × B13 = 1 × 1 + 0 × 0 + 1 × 0 = 1
It becomes.
[0078]
On the other hand, P and Q are vector normalization, and each element of the vector is squared, and the sum of the squares is added.
P is sqrt (1 × 1 + 0 × 0 + 1 × 1) = 1.414,
Q is sqrt (1 × 1 + 0 × 0 + 0 × 0) = 1
Can be expressed as
[0079]
Therefore, the value corresponding to the above cos θ is 1 / (1.414 × 1) = 0.707, and the angle is also obtained at the same time.
[0080]
The above is an explanation when the horizontal variable is small with very low dimensions. In reality, in an application such as an Internet search engine, the number of hypertexts as samples reaches hundreds of thousands to billions.
In addition, the number of morphemes extracted from all of them, more precisely, the number of dimensions corresponding to the registration of the morpheme dictionary (in the case of Japanese, tens of thousands of dimensions).
[0081]
In general, the similarity is used as an index in order to deal with the enormous amount of search results and present accurate results to the user.
For example, if you search once, the most suitable page for search is the first one, and the next page is the second one that is similar to the next. In many cases, m (n × n) sample matrices are obtained, and (n × n) similarity matrices are obtained.
[0082]
The similarity matrix is a symmetric matrix in which the vertical and horizontal numbers are the sample numbers, and the elements in the middle are similarities. For example, if it is desired to know the similarity between the text of sample 120 and the text of sample 565, the element (120, 565) is taken out.
Since it is a symmetric matrix, the same applies to the elements (565, 120).
[0083]
Although the theoretical part regarding the similarity is as described above, various ideas have been devised. In terms of implementation, it takes time to calculate (tens of thousands of hundreds of millions) of sample matrices with a single computer, so it is possible to find samples that contain the same morpheme and that do not contain exactly the same morpheme. It is efficient to divide and use the independence of the matrix to sort each group into a plurality of computers. It is efficient to group the vectors independent of each other and sort them into a plurality of computers.
[0084]
In addition, in order to obtain a higher degree of similarity, the variable is not expressed as 1 or 0 with or without morphemes, the distribution of morphemes is examined, and the weight of morphemes that frequently appear in any sentence such as particles and auxiliary verbs is reduced. Conversely, the weight of a morpheme that appears only in a specific sentence is increased.
A representative weighting method is the TF / IDF method.
[0085]
The vector space model will be specifically described with reference to FIG.
Here, the link neighborhood text and the text pieces 1 to 3 divided into three with q = 3 are considered, and the text piece most similar to the link neighborhood text is determined using a vector space model.
[0086]
For example, the link neighborhood text includes morpheme “mandarin” and “Wenzhou”, the text piece 1 has the morpheme “mandarin”, the text piece 2 has the morpheme “Wenzhou”, and the text piece 3 has the morpheme “mandarin”. “Wenzhou” shall be included.
At this time, the unique morpheme number n is “= 2” for “mandarin orange” and “Wenzhou”, and all vectors can be expressed in two dimensions.
[0087]
When the link neighborhood text vector is Xr, the text fragment vector Xs (1 ≦ s ≦ 3), the “mandarin orange” axis is the X axis, and the “Wenzhou” axis is the Y axis, the two-dimensional graph shown in FIG. I can do it.
First, the vectors Xr and X1 are compared. Since both include only one identical morpheme “mandarin orange” and the vector angle of both is θ1 = 45 degrees, cos θ1 is approximately 0.7.
[0088]
Next, the vectors Xr and X2 are compared. Since both include only one same morpheme “Wenzhou” and the vector angle of both is θ2 = 45 degrees, cos θ2 is approximately 0.7.
Next, the vectors Xr and X3 are compared. Both include the completely matched morphemes “Wenzhou” and “Tangerine”, the vector angle of both is θ3 = 0 degrees, and cos θ3 is 1 (perfect match / maximum similarity).
[0089]
Comparing the above values of cos θ1 to cos θ3, it is shown that cos θ3 is the largest, that is, the text piece having the similarity most similar to the link neighborhood text is the text piece 3.
[0090]
On the other hand, paying attention to the vectors X1 and X2, both are vectors that do not contain the same morpheme. The vector angle of both is θ4 = 90 degrees, and cos θ4 is 0 (completely independent / minimum similarity).
As described above, the vector space model is a method for calculating the similarity from the inclination between the vectors.
[0091]
The reason for using cosθ instead of angle is that cosθ is quantified so that when all vectors have only positive variables, the similarity is always in the range of 0 to 1 and the similarities can be compared with each other. It is because it is suitable as a method to do.
[0092]
In the similarity calculation by the vector space model, the TF / IDF method weights the variable X according to the appearance frequency of the morpheme, or the morpheme part-of-speech is checked and the vector is composed only of nouns / proper nouns having high context contributions. Often used in combination with the law.
[0093]
Next, the optimum text piece selecting means 17 described in claim 2 will be described.
The optimum text piece selecting means 17 has two predetermined threshold values a1 and a2.
This is used in order to divide the similarity normalized to 0 or more and 1 or less by cos θ by the similarity calculation means 16 into the following three regions.
[0094]
Registration rejection area: threshold 0 ≤ similarity <threshold a1
Registerable area: threshold value a1 ≦ similarity <threshold value a2
Candidate exclusion area: threshold a2 ≦ similarity ≦ threshold 1
[0095]
The registration rejection area means an area that has a remarkably low degree of similarity and is meaningless even if it is adopted as a text fragment related to the link neighborhood text.
Further, the candidate exclusion area means an area having a very high similarity and a high possibility that the link neighborhood text itself is included as a text fragment.
[0096]
The candidate exclusion area avoids extracting full-text citations between two texts as candidates, or excludes text fragments near standard links such as navigation bars often included in a plurality of structured hypertexts. It is provided for.
[0097]
On the other hand, the registerable area means an area suitable as a text deeply related to the link neighborhood text.
Here, an example of the threshold values a1 and a2 is shown.
For example, assuming that the total number of morphemes is p = 100 and the average number of morphemes contained in the link neighborhood text and each text fragment is 10,
The threshold value a1 = 0.448 and the threshold value a2 = 0.8945 are appropriate.
[0098]
When all the q similarities belong to the registration rejection area or the candidate exclusion area, the optimum text piece selection means 17 outputs an empty text.
In other cases, the q similarity having the largest value is adopted, and the corresponding text piece is output as the optimum text piece.
[0099]
FIG. 8 shows a flowchart of the optimum text piece selecting means 17 according to claim 2. Each step (steps S301 to S308) will be described below.
Each element from (Steps S301 to S307) in FIG. 8 corresponds to (Steps S201 to S207) in FIG. 6 and will not be described.
[0100]
The current similarity r (i) using the threshold value a1 and the threshold value a2 is compared (step S308),
If it is a registerable area (Yes at Step S308),
Processing to proceed to step S305,
If it is outside the registrable area (No at step S308),
The comparison process with the maximum similarity r is skipped, and the process proceeds to step S307.
The above is the information extraction apparatus according to claim 2.
[0101]
Next, an information extraction computer program according to the present invention (inventions according to claims 5 and 6) will be described.
The information extraction of the present invention can be realized by reading a computer program for performing the information extraction process or a computer program of a recording medium on which the computer program is recorded with a computer.
[0102]
7. A computer program executable by a computer comprising the steps of claim 5 and claim 6 when each component means is rewritten to each step of causing the computer to execute the computer program in the block diagram of the information extraction apparatus of FIG. Can be provided (see FIGS. 2 to 9).
[0103]
As shown in FIGS. 1 to 7, the information extraction computer program of the present invention (invention according to claim 5) is specified from hypertext composed of nodes that are units of information and links between the nodes. In an information extraction computer program that extracts information about data related to search keywords,
When a starting node path name indicating the starting node is input by the specific search keyword, the text and the link included in the starting node are analyzed, and the secondary node path name indicating the secondary node and the link neighborhood text are analyzed. Origin node analysis step 11 for outputting a set as an input table;
A secondary node acquisition step 14 for accessing the secondary node and acquiring text included in the secondary node;
A text dividing step 15 for dividing the text into groups of text pieces of a predetermined size;
Similarity of calculating the similarity of each of the link neighborhood text derived from the origin node and the text fragment group derived from the secondary node path name and outputting a similarity table in which the text fragment and the similarity are paired Calculation step 16;
Analyzing the similarity table, selecting an optimal text fragment selection step 17 for selecting an optimal text fragment suitable for the link neighborhood text from the text fragment group, and sequentially analyzing the input table, and for each secondary node A database registration step 12 for calling the optimum text fragment selecting means 17 and registering the obtained optimum text fragment together with the current time in a database;
A database step 13 for database-forming the current time, the origin node path name, the secondary node path name, the link neighborhood text, and the optimum text fragment as data entries;
An information extraction computer program for causing a computer to execute is executed.
[0104]
Further, the information extraction computer program of the present invention (the invention according to claim 6) is the above-described information extraction computer program as shown in FIG. 1, FIG. 4, FIG. 8, and FIG.
The similarity calculation step 16 divides the text into morpheme groups, regards the morpheme group as a vector, and regards the cosine theta of the vector corresponding to each sentence as the similarity, whereby the similarity is 0 or more and 1 Obtained within the following range,
The optimum text fragment selection step 17 compares the similarities in the similarity table, and the text having the maximum similarity among the text fragments within a predetermined range (threshold value a1 and less than threshold value a2). An information extraction computer program for selecting a piece is provided.
[0105]
Such an information extraction computer program may be executed by a general computer or PDA (Personal Data Aid), or may be executed by incorporating the program in an LSI.
[0106]
Furthermore, the information extraction computer program such as a DVD-ROM may be supplied by a recording medium or may be supplied to a network via the Internet.
[0107]
Summarizing the effects of the configuration of the embodiment of the present invention described above, according to this embodiment, the database means for creating a database of the current time, the origin node path name, the secondary node path name, the link neighborhood text, and the optimum text fragment as data entries; When the search time and the search node path name are input, there is associated node search means for obtaining and outputting the data entry group after the search time and the search node path name and the origin node path name match. Newly added links to the target hypertext can be saved in the database, and the document reader can show the existence of the link at any time, the time when the link was added, the text near the link, and the hypertext at the link destination. You can search easily.
[0108]
According to the present embodiment, since there is an optimum text fragment selection means and only a part of the text related to the link is cut out and stored in the database, the entire hypertext is recorded at each acquisition time when continuously monitoring a specific hypertext. It is possible to save the storage area as compared with the method of saving the file.
[0109]
According to this embodiment, when the hypertext to be monitored is a monitoring target requiring attention that can perform an attack such as slander with a link to another node, the document reader periodically implements the present invention, Without reading the full text of the hypertext to be monitored, it is possible to quickly find the intention of the link and find slander etc.
[0110]
According to the present embodiment, it has database means for creating a database of the current time, the origin node path name, the secondary node path name, the link neighborhood text, and the optimum text fragment as data entries, and when the search time and the search node path name are input, Since there is associated node search means for obtaining and outputting the data entry group whose search node path name and start node path name match after the search time, the document reader periodically implements the present invention, Specific evidence of when links such as slander were added and deleted can be stored in the database along with the time.
[0111]
In this embodiment, since the optimum text fragment selection means is provided, text that is extremely similar between two hypertexts is not registered in the database, so that standard text such as a navigation bar in the hypertext is not displayed. Eliminating in advance, providing only text with useful changes for the document reader, reducing the need for the document reader to follow the link and reducing labor.
[0112]
In this embodiment, the present invention has database means for creating a database with the current time, the origin node path name, the secondary node path name, the link neighborhood text, and the optimum text fragment as a data entry, and when the search time and the search node path name are input, Since there is related node search means for obtaining and outputting the data entry group in which the search node path name and the start node path name match after the search time, it is assumed that the start node has slandered at a certain time. When confirming the propagation of damage to the secondary node, only the links that appeared after the search time specified in advance by the document reader can be extracted, and when the link of the secondary node is displayed, In order to extract only the links that appear after the search time, the secondary node is in the middle of the starting node Easy to find the evidence that was carried out, such as protest against, the document reader reduces the need to follow the link, it is possible to reduce the effort.
[0113]
【The invention's effect】
As described above, according to the invention described in claim 1 or claim 5, the secondary node has a node analysis step, an optimum text fragment selection step, and a database step, and is associated with the link by the optimum text fragment selection step. Since only the most appropriate part of the text is cut out and saved in the database step, the full text with the text related to this link is saved at each acquisition time even when the hypertext of a specific keyword is continuously monitored. The present invention can save the storage area and can understand the contents of the data in a short time rather than what is done.
[0114]
According to the invention described in claim 1 or claim 5, if the link newly added to the hypertext related to the specific search keyword to be monitored is stored in the database means, the user can arbitrarily It is possible to search for the existence of the link, the time when the link was added, the text near the link, and the hypertext of the link destination at the timing of.
[0115]
Further, according to the invention described in claim 2 or claim 6, since a text extremely similar between two hypertexts is not registered in the database registration means, a standard form such as a navigation bar in the hypertext is not provided. By eliminating the text in advance and providing only text that has beneficial changes for the user, the need for the user to follow the link can be reduced and labor can be reduced.
[0116]
Further, according to the invention of the information search method described in claim 3, if the link newly added to the hypertext related to the specific search keyword to be monitored is stored in the database means, the user can arbitrarily It is possible to search for the existence of the link, the time when the link was added, the text near the link, and the hypertext of the link destination at the timing of.
[0117]
Further, according to the information retrieval method of the present invention, it is possible to retrieve information of only the optimum partial text portion of the secondary node corresponding to the retrieval node, which can be understood more quickly. .
[0118]
According to the invention described in claim 5, the processing contents of the information extracting apparatus are described in the computer program, and the above processing can be realized by the computer by executing the program by the computer.
[Brief description of the drawings]
FIG. 1 is a diagram showing a block configuration of an embodiment of an information extraction apparatus to which an information extraction method according to claim 1 of the present invention is applied.
FIG. 2 shows a data structure of an input table output by an embodiment of a starting point node analyzing means constituting an information extracting apparatus to which the invention according to claim 1 is applied.
FIG. 3 shows a data structure of a data entry output by an embodiment of database registration means constituting an information extracting apparatus to which the invention according to claim 1 is applied.
FIG. 4 is a flowchart showing the operation of an embodiment of database registration means constituting an information extraction apparatus to which the invention according to claim 1 is applied.
FIG. 5 shows a data structure of a similarity table output by an embodiment of a similarity calculation means constituting an information extraction apparatus to which the invention according to claim 1 is applied.
FIG. 6 is a flowchart showing the operation of an embodiment of the optimum text piece selecting means constituting the information extracting apparatus to which the invention according to claim 1 is applied.
FIG. 7 is a flowchart showing the operation of an embodiment of the related node search means constituting the information extracting apparatus to which the invention according to claim 1 is applied.
FIG. 8 is a flowchart showing the operation of an embodiment of the optimum text piece selecting means constituting the information extracting apparatus to which the invention according to claim 2 is applied.
FIG. 9 is a diagram for explaining a vector space model used in an embodiment of optimum text piece selecting means constituting an information extracting apparatus to which the invention according to claim 2 is applied;
FIG. 10 is a diagram showing a block configuration of an example of a conventional information extraction apparatus.
FIG. 11 is a diagram showing a flowchart of a feature extraction process of an example of a conventional information extraction apparatus.
FIG. 12 is a diagram showing a flowchart of a processing procedure of similarity determination means of an example of a conventional information extraction apparatus.
[Explanation of symbols]
10 Information extractor
11 Origin node analysis means (node analysis step)
12 Database registration step (Database registration means)
13 Database step (database means)
14 Secondary node acquisition step (secondary node acquisition means)
15 Text division step (text division means)
16 Similarity calculation step (similarity calculation means)
17 Optimal text fragment selection step (optimal text fragment selection means)
18 related node search step (related node search means)
a1, a2 threshold
r Maximum similarity
t Optimal text fragment
X1 ~ X3, Xr vector
θ1 to θ4 vector angle

Claims

There are a plurality of nodes consisting of a text and a link described in the text, and a secondary node path name indicating the secondary node of the link destination is written on the link so as to indicate a connection between the nodes. In an information extraction method in which a computer extracts information from configured hypertext,
When a node path name indicating a node corresponding to a search keyword is input, a link obtained by analyzing the text and link of the node and cutting out the text portion described in the vicinity of the secondary node path name and the link in the text A node analysis step for outputting an input table paired with neighboring texts;
A secondary node acquisition step of accessing the secondary node and acquiring text of the secondary node;
A text dividing step of dividing the text of the secondary node into a group of text pieces of a predetermined size;
Calculating a similarity of each of the divided text pieces with respect to the link neighborhood text, and outputting a similarity table in which each text piece and the similarity are paired; and
An optimum text piece selecting step for selecting a text piece having the most similar degree from the text piece group;
The input table is sequentially analyzed, and the optimum text fragment of each secondary node selected from the optimum text fragment selection step for each of the secondary nodes is represented as a current time, the node path name, the secondary node path name, and information extraction method characterized by having a database step of database and each extracted with the link near the text.

In the information extraction method described in Claim 1,
The similarity calculation step divides the text into morpheme groups, regards the morpheme groups as vectors, and regards the cosine theta of the vector corresponding to each sentence as the similarity, whereby the similarity is 0 or more and 1 or less. And range
In the information extraction method, the optimum text fragment selecting step selects a text fragment having the maximum similarity among text fragments having a similarity in a predetermined range from the similarity table.

There are a plurality of nodes consisting of a text and a link described in the text, and a secondary node path name indicating the secondary node of the link destination is written on the link so as to indicate a connection between the nodes. In an information retrieval method in which a computer retrieves information from configured hypertext,
When a node path name indicating a node corresponding to a search keyword is input, a link obtained by analyzing the text and link of the node and cutting out the text portion described in the vicinity of the secondary node path name and the link in the text Node analysis means for outputting an input table which is a set of neighboring texts;
Secondary node acquisition means for accessing the secondary node and acquiring text of the secondary node;
Text dividing means for dividing the text of the secondary node into a group of text pieces of a predetermined size;
Similarity calculation means for calculating the similarity of each of the divided text pieces with respect to the link neighborhood text, and outputting a similarity table in which each text piece and the similarity are paired;
An optimum text piece selecting means for selecting a text piece having the most similar degree from the text piece group;
The input table is sequentially analyzed, and the optimum text fragment of each secondary node selected by the optimum text fragment selection means for each of the secondary nodes is represented as a current time, the node path name, the secondary node path name, and From the database means for inputting the node path name to be searched to the information extracting device provided with the database means for extracting and creating a database together with the link neighborhood text, and the node path name to be searched matches the node path name. An information search method, wherein the computer performs an information search so as to output the selected data entry group.

In the information search method described in Claim 3,
The similarity calculation means divides the text into morpheme groups, regards the morpheme group as a vector, and regards a cosine theta of a vector corresponding to each sentence as a similarity, whereby the similarity is 0 or more and 1 or less. And range
The information retrieval method according to claim 1, wherein the optimum text fragment selecting means selects, from the similarity table, a text fragment having a maximum similarity among text fragments having a similarity in a predetermined range.

There are a plurality of nodes consisting of a text and a link described in the text, and a secondary node path name indicating the secondary node of the link destination is written on the link so as to indicate a connection between the nodes. In an information extraction computer program that extracts information from configured hypertext,
When a node path name indicating a node corresponding to a search keyword is input, a link obtained by analyzing the text and link of the node and cutting out the text portion described in the vicinity of the secondary node path name and the link in the text A node analysis step for outputting an input table paired with neighboring texts;
A secondary node acquisition step of accessing the secondary node and acquiring text of the secondary node;
A text dividing step of dividing the text of the secondary node into a group of text pieces of a predetermined size;
Calculating a similarity of each of the divided text pieces with respect to the link neighborhood text, and outputting a similarity table in which each text piece and the similarity are paired; and
An optimum text piece selecting step for selecting a text piece having the most similar degree from the text piece group;
The input table is sequentially analyzed, and the optimum text fragment of each secondary node selected from the optimum text fragment selection step for each of the secondary nodes is represented as a current time, the node path name, the secondary node path name, and An information extraction computer program for causing a computer to execute a database step of extracting and creating a database together with the link neighborhood text.

In the information extraction computer program according to claim 5,
The similarity calculation step divides the text into morpheme groups, regards the morpheme groups as vectors, and regards the cosine theta of the vector corresponding to each sentence as the similarity, whereby the similarity is 0 or more and 1 or less. And range
The information extraction computer program characterized in that the optimum text fragment selecting step selects a text fragment having the maximum similarity among text fragments having a similarity in a predetermined range from the similarity table.