JP3604069B2

JP3604069B2 - Apparatus for calculating relevance between documents, method therefor, and recording medium therefor

Info

Publication number: JP3604069B2
Application number: JP13913399A
Authority: JP
Inventors: 雅且大久保; 正之杉崎; 大二郎森; 一男田中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-05-19
Filing date: 1999-05-19
Publication date: 2004-12-22
Anticipated expiration: 2019-05-19
Also published as: JP2000331017A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書間の関連度を計算する文書間関連度計算装置および方法に係り、特に、互いにハイパーリンクによって参照されている文書間の関連度を、ハイパーリンクに応じて計算する文書間関連度計算装置および方法に関する。
【０００２】
【従来の技術】
文書間の関連度を計算することによって、ユーザが指定した文書に関連する文書を効率的に検索したり、大量の文書を分類して活用することができる。これを実現するために、文書間の関連度を計算する方法が、従来から提案されている。
【０００３】
たとえば、文献（Ｇ．Ｓａｌｔｏｎ，“ＡｕｔｏｍａｔｉｃＴｅｘｔＰｒｏｃｅｓｓｉｎｇ”，ＡｄｄｉｓｏｎＷｅｓｌｅｙ，Ｒｅａｄｉｎｇ，Ｍａｓｓ，１９８９）では、各文書に含まれる単語の頻度に基づいて、文書間の関連度を計算する点が開示されている。すなわち、２つの文書間で、頻出する単語が互いに類似している場合には、互いに関連度が高く、逆に、頻出する単語が互いに類似している度合いが低い場合には、関連度が低いと考える。
【０００４】
しかし、同じ概念を記述する場合でも、同義語を用いたり、日本語と英語等言語そのものが異なったりする場合があるので、単語の統計的な処理によって得られた文書間関連度の精度は、必ずしも高いとはいえない。
【０００５】
さらに、文書間の関連性は、使用されている単語の類似性のみによって決定されるものではなく、様々な視点から定義することができる。
【０００６】
したがって、関連文書検索等のサービスを提供する場合、互いに関連していることを多くの人間が認めるような文書集合を見つけることが必要である。
【０００７】
【発明が解決しようとする課題】
しかし、従来例では、上記のように多くの人間が認めるような文書間関連性を見つけることができないという問題がある。
【０００８】
たとえば、インターネット上では、３億ページを超えるＷＷＷ文書が公開され（Ｓ．ＬａｗｒｅｎｃｅａｎｄＣ．Ｌ．Ｇｉｌｅｓ，“ＳｅａｒｃｈｉｎｇｔｈｅＷｏｒｌｄＷｉｄｅＷｅｂ”，Ｓｃｉｅｎｃｅ，Ｖｏｌ．２８０，Ｎｏ．５３６０，ｐ．９８，１９９８）、多くのユーザは、自分の興味に合致し、しかも何度もアクセスする文書を、自分の視点から関連付けたリンク集を作成して公開している。すなわち、リンク集に掲載されている文書は、ある程度良質で、しかも所定の視点からの関連性が定義されているとみなすことができる。したがって、これらの関連性を集計することによって、良質な関連文書集合を作成することができ、この結果、関連ページ検索等を初めとする有効なサービスを提供することが可能になる。
【０００９】
しかし、上記従来例では、人間の視点により近い形での文書間の関連度を計算することができないという問題がある。
【００１０】
本発明は、人間が作成したリンク集等に記載されている関連文書集合を集計することによって、人間の視点により近い形での文書間の関連度を計算する文書間関連度計算装置および方法を提供することを目的とする。
【００１１】
【課題を解決するための手段】
本発明は、参照元の文書中で、ＵＲＬ等のハイパーリンクによって参照されている複数の文書間の関連度を計算する場合、各文書内に記述されているハイパーリンクを抽出し、この抽出されたハイパーリンクによってリンクされている文書のそれぞれの間の関連度を計算し、この計算された関連度を集計するものである。
【００１２】
【発明の実施の形態および実施例】
図１は、本発明の一実施例である文書間関連度計算装置１００を示すブロック図である。
【００１３】
文書間関連度計算装置１００は、参照元の文書中で、ＵＲＬ等のハイパーリンクによって参照されている複数の文書間の関連度を計算する装置であり、文書選択手段１０と、ＨＴＭＬ文書集合用メモリ１１と、ＵＲＬ抽出手段２０と、増減ルール用メモリ２１と、文書間関連度計算手段３０と、文書間集計手段４０と、文書間関連度用メモリ４１とを有する。
【００１４】
ここで、上記実施例において、関連度を計算する対象となる文書は、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）によって記述されている文書であるとし、またリンク先の文書位置（文書格納場所）は、ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）によって示されているとする。
【００１５】
文書選択手段１０は、ＨＴＭＬ文書集合用メモリ１１に格納されているＨＴＭＬ文書から所望の文書を選択する手段である。
【００１６】
ＨＴＭＬ文書集合用メモリ１１は、ＨＴＭＬ文書集合が多数格納されているメモリである。
【００１７】
ＵＲＬ抽出手段２０は、各文書内に記述されているハイパーリンクを抽出するハイパーリンク抽出手段の例であり、上記実施例では、各文書内からＵＲＬを抽出する手段である。
【００１８】
増減ルール用メモリ２１は、タグの種類に応じて距離を増減する増減ルールのデータを格納してあるメモリである。
【００１９】
文書間関連度計算手段３０は、ＵＲＬ抽出手段２０によって抽出されたＵＲＬによってリンクされている複数の文書のそれぞれの間における文書間関連度を計算する手段である。
【００２０】
文書間集計手段４０は、文書間関連度計算手段３０によって計算された文書間関連度を集計する手段である。
【００２１】
文書間関連度用メモリ４１は、文書間関連度が格納されているメモリである。
【００２２】
図２は、文書間関連度計算装置１００の動作を示すフローチャートである。
【００２３】
まず、処理対象とするＨＴＭＬ文書を選択する（Ｓ１）。選択されたＨＴＭＬ文書に記述されているＵＲＬと、そのＵＲＬによって参照されている文書の格納場所とを抽出し（Ｓ２）、抽出された各ＵＲＬの間における関連度を求める。つまり、１つのＵＲＬによって参照されている文書と、他のＵＲＬによって参照されている文書との間の関連度を求める（Ｓ３）。そして、各ＵＲＬ間の関連度を集計する。つまり、上記求められた文書間関連度を集計する（Ｓ４）。そして、全てのＨＴＭＬ文書について上記計算が完了するまで、上記処理（Ｓ１〜Ｓ４）を繰り返す（Ｓ５）。
【００２４】
上記実施例を、記録媒体の発明として把握することができる。つまり、上記実施例は、処理対象とするＨＴＭＬ文書を選択する文書選択手順と、上記選択されたＨＴＭＬ文書において表示されているハイパーリンクと、上記選択されたＨＴＭＬ文書において上記ハイパーリンクを表示する表示用記述における上記ハイパーリンクの位置とを抽出する抽出手順と、上記抽出された１つのハイパーリンクによって参照されている文書と、上記抽出された他のハイパーリンクによって参照されている文書との間の文書間関連度を演算する文書間関連度演算手順と、上記演算された文書間関連度を集計する集計手順とをコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体の例である。
【００２５】
この場合、上記記録媒体として、ＦＤ、ＣＤ、ＤＶＤ、半導体メモリ等が考えられる。
【００２６】
図３は、上記実施例で用いられているＨＴＭＬ文書の記述例を示す図である。
【００２７】
図４は、図３に示したＨＴＭＬ文書をブラウザで表示した例を示す図である。
【００２８】
図３に示すように、ＨＴＭＬ文書は、「＜ＨＥＡＤ＞」や「＜／ＨＥＡＤ＞」のように、「＜」で始まり「＞」で終わるタグと、通常のテキストデータとが混在したものである。
【００２９】
また、他の文書へのハイパーリンクは、たとえば図３の１２行目に記載されている「＜ＡＨＲＥＦ＝“ＵＲＬ１”＞文書１＜／Ａ＞」のように、表される。つまり、
（１）ハイパーリンクを示すタグ「＜Ａ＞」、リンク先文書の格納場所を表す「“ＵＲＬ１”」と、
（２）ハイパーリンク先の文書の表示用テキスト「文書１」と、
（３）ハイパーリンクの記述の終了を示すタグ「＜／Ａ＞」と
によって、他の文書へのハイパーリンクが表される。
【００３０】
また、図２に示すステップＳ２では、ＨＴＭＬ文書中から、ハイパーリンクとして記述されている他文書の格納場所（つまり、ＵＲＬ）と、そのＵＲＬに対応する表示用テキストとを抽出する。
【００３１】
図３に示すＨＴＭＬ文書では、１２行目、１３行目、１７行目、１８行目に、ハイパーリンクが記述されている。このＨＴＭＬ文書から、他文書の格納場所として、ＵＲＬ１、ＵＲＬ２、ＵＲＬ１１、ＵＲＬ１２をそれぞれ抽出する。
【００３２】
一方、各ハイパーリンクが参照するテキストは、それぞれ、文書１、文書２、文書１１、文書１２である。これらの（表示用）テキストが記述されている位置を、所定の位置算出ルールに従って求める。上記実施例における位置算出ルールは、ＨＴＭＬ文書の最初の部分から、テキストが記述されている位置までのバイト数である。なお、図３の記述では、１行目は１ｂｙｔｅ目から始まる。
【００３３】
このようにして数えると、文書１、文書２、文書１１、文書１２の記述位置は、図３（１）に示すように、それぞれ、１２９、１５８、２１５、２４６である。
【００３４】
なお、上記実施例では、各リンクの表示用テキストの記述位置を算出するルールは、ＨＴＭＬ文書の最初の部分から、テキストが記述されている位置までのバイト数であるが、上記ルールとは別のルールを使用するようにしてもよい。
【００３５】
たとえば、ＨＴＭＬ文書の最初の部分から、テキストが記述されている位置までまでの間で、タグを除いたバイト数が、各リンクの表示用テキストの記述位置であるとするルールを採用するようにしてもよい。このルールによれば、文書１、文書２、文書１１、文書１２の記述位置は、図３（２）に示すように、それぞれ、４４、５０、６５、７２である。
【００３６】
また、ＨＴＭＬ文書の最初の部分から、テキストが記述されている位置までのバイト数を、タグの種類に応じて増減する増減ルールを定め、この増減ルールを加味し、上記テキストが記述されている位置までのバイト数を求めるようにしてもよい。
【００３７】
図５は、上記実施例において、タグの種類に応じた増減ルールの例を示す図である。
【００３８】
図５において、＜ＨＲ＞タグは＋１００、＜ＵＬ＞タグと＜Ｈ１＞タグとは＋５０、＜Ｈ２＞タグは＋３０、その他のタグは増減しない。この結果、たとえば図３の８行目に記載されている関連文献集は、上記タグを除いたバイト数で数えると、その記述位置は２４であるが、増減ルールを適用すると、＜Ｈ１＞の後なので、「５０」を加算するので、その記述位置は７４である。
【００３９】
このように、増減ルールを適用することによって、文書１、文書２、文書１１、文書１２の記述位置は、図３（３）に示すように、それぞれ、２７４、２８０、３７５、３８２である。
【００４０】
また、ブラウザで表示した場合に何行目に表示されるかによって、記述位置を表すことができる。この場合、文書１、文書２、文書１１、文書１２の記述位置は、図３（４）に示すように、それぞれ、４、５、８、９である。
【００４１】
図２におけるステップＳ３では、ステップＳ２において抽出された各ＵＲＬと、ＵＲＬを表示する表示用記述における上記ＵＲＬの位置とに基づいて、各ＵＲＬ間の関連度（つまり、ＵＲＬが参照する文書間の関連度）を計算する。上記実施例において、関連度は、表示用テキストの記述位置同士の差分の逆数とする。
【００４２】
図６は、上記実施例において計算されたＵＲＬ間の関連度を示す図である。
【００４３】
上記のように、表示用テキストの記述位置同士の差分の逆数によって、文書間関連度を求め、このようにして求められたＵＲＬ１、ＵＲＬ２、ＵＲＬ１１、ＵＲＬ１２の間の関連度は、図６のように計算される。
【００４４】
なお、ＵＲＬ間の関連度の計算方法として、表示用テキストの記述位置の差分の２乗の逆数を用いる方法以外に、表示用テキストの記述位置に基づく方法を採用するようにしてもよい。
【００４５】
図３におけるステップＳ４では、各ＨＴＭＬ文書について計算されたＵＲＬ間の関連度を集計する。
【００４６】
ステップＳ１〜Ｓ４を、対象となる全てのＨＴＭＬ文書について実行することによって、各ＵＲＬ間の関連度、すなわちそのＵＲＬで示されているＨＴＭＬ文書間の関連度を求めることができる。
【００４７】
このようにして求められた関連度は、もとのＨＴＭＬ文書間ではハイパーリンクによって直接結合されていなくても、多くのＨＴＭＬ文書内で互いに近くに記述してあれば、高い関連度を持つことになる。したがって、多くのユーザが様々な観点から関連リンク集を作成しているインターネット上のＷＷＷ文書の場合には、その関連性を集計することによる関連ページ検索等のサービスを提供することができるので、利便性を著しく向上させることができる。
【００４８】
図７は、上記実施例におけるステップＳ２、Ｓ３の動作説明図である。
【００４９】
図７（１）は、抽出された各リンク、その表示用テキストの記述位置を示す図であり、図７（２）は、２つのリンクのそれぞれが参照するテキストの記述位置同士の差分の逆数を文書間関連度として示す図である。図７は、図６と内容的には同じものである。
【００５０】
図８は、上記実施例において、２つのリンクのそれぞれが参照するテキストの記述位置同士の差分の逆数である文書間関連度を計算する具体例を示すフローチャートである。
【００５１】
入力されたリンクの数をＮとし（図７に示す例ではＮ＝４）、各リンクを、ＬＩＮＫ［ｉ］とし（図７に示す例ではｉ＝１、２、３、４）、各リンクの表示用テキストの記述位置を、ＰＯＳ［ｉ］とし、ＬＩＮＫ［ｉ］とＬＩＮＫ［ｊ］との関連度を、ＲＥＬ［ｉ，ｊ］とする。
【００５２】
図８において、リンクの順番ｉを１にセットし（Ｓ１１）、このｉがＮ−１に達すれば（Ｓ１２）、出力し、達しなければ、文書関連度を求める相手のリンクの順番ｊをｎ＋１とし（Ｓ１３）、ＬＩＮＫ［ｉ］とＬＩＮＫ［ｊ］との関連度ＲＥＬ［ｉ，ｊ］を、差分の逆数として演算し（Ｓ１４）、相手のリンクの順番ｊを１インクリメントし（Ｓ１５）、相手のリンクの順番ｊがＮ以下であれば（Ｓ１６）、上記処理（Ｓ１４、Ｓ１５）を繰り返す（Ｓ１６）。相手のリンクの順番ｊがＮよりも大きくなれば（Ｓ１６）、リンクの順番ｉを１インクリメントし（Ｓ１７）、ステップＳ１２に戻る。
【００５３】
つまり、複数のＨＴＭＬ文書から１つのＨＴＭＬ文書が選択され、この選択されたＨＴＭＬ文書において表示されているハイパーリンクを抽出し、上記選択されたＨＴＭＬ文書において上記ハイパーリンクを表示する表示用記述における上記ハイパーリンクの位置を抽出した後、上記抽出された１つ目のハイパーリンクによって参照されている文書と、上記抽出された２つ目のハイパーリンクによって参照されている文書との間の文書間関連度を文書間関連度演算手順で演算するが、この文書間関連度演算手段の例として、上記実施例では、入力されたハイパーリンクの数をＮとし、各ハイパーリンクをＬＩＮＫ［ｉ］とし、各ハイパーリンクの表示用テキストの記述位置を、ＰＯＳ［ｉ］とし、ＬＩＮＫ［ｉ］とＬＩＮＫ［ｊ］との関連度をＲＥＬ［ｉ，ｊ］とし、ＬＩＮＫ［ｉ］とＬＩＮＫ［ｊ］との関連度ＲＥＬ［ｉ，ｊ］を、差分の逆数として演算する。
【００５４】
すなわち、上記選択されたＨＴＭＬ文書において１つ目の上記ハイパーリンクを表示する表示用記述における上記ハイパーリンクの位置と、上記選択されたＨＴＭＬ文書において２つ目の上記ハイパーリンクを表示する表示用記述における上記ハイパーリンクの位置との差分の逆数を、上記文書間関連度として求める。
【００５５】
図９は、上記実施例におけるステップＳ２、Ｓ３の他の動作説明図である。
【００５６】
図９（１）は、入力された複数のリンクのうちで同一のリンクがあった場合の例を示す図である。つまりＵＲＬ１が２つ存在する。この場合、２つのリンクがそれぞれ参照する２つのテキストの位置の最大値を採用する。なお、２つのリンクがそれぞれ参照する２つのテキストの位置の平均値を採用するようにしてもよい。
【００５７】
図９（２）は、図９（１）に示す場合において、２つのリンクのそれぞれが参照するテキストの記述位置同士の差分の逆数を文書間関連度として示す図である。
【００５８】
図１０は、上記実施例において、入力された複数のリンクのうちで同一のリンクがあり、これら２つのリンクがそれぞれ参照する２つのテキストの位置の最大値を採用した場合に、文書関連度を求めるフローチャートである。
【００５９】
図１０に示すフローチャートは、基本的には、図８に示すフローチャートと同じであるが、図８に示すフローチャートにおけるステップＳ１４の代わりに、ステップＳ２１〜Ｓ２４を設けたものである。
【００６０】
なお、ＬＩＮＫ［１］＝ＬＩＮＫ［３］であり、各リンクの、リンク名から決定される一意なＩＤを、ＩＤ［リンク名］とする。また、ＩＤ［ＬＩＮＫ［ｉ］］と、ＩＤ［ＬＩＮＫ［ｊ］］との関連度を、ＲＥＬ［ＩＤ［ＬＩＮＫ［ｉ］］，ＩＤ［ＬＩＮＫ［ｊ］］］とし、ｍａｘ（ａ，ｂ）は、ａ，ｂのうちで小さくない方の値である。
【００６１】
つまり、文書関連度を求める相手のリンクの順番ｊをｎ＋１とし（Ｓ１３）た後に、ＬＩＮＫ［ｉ］とＬＩＮＫ［ｊ］との差分の逆数Ｒを求め（Ｓ２１）、ＲＥＬ［ＩＤ［ＬＩＮＫ［ｉ］］，ＩＤ［ＬＩＮＫ［ｊ］］］の計算が終了していれば（Ｓ２２）、ｍａｘ（ＲＥＬ［ＩＤ［ＬＩＮＫ［ｉ］］，ＩＤ［ＬＩＮＫ［ｊ］］］，Ｒ）を、ＲＥＬ［ＩＤ［ＬＩＮＫ［ｉ］］，ＩＤ［ＬＩＮＫ［ｊ］］］とし（Ｓ２３）、一方、ＲＥＬ［ＩＤ［ＬＩＮＫ［ｉ］］，ＩＤ［ＬＩＮＫ［ｊ］］］の計算が終了していなければ（Ｓ２２）、Ｒを、ＲＥＬ［ＩＤ［ＬＩＮＫ［ｉ］］，ＩＤ［ＬＩＮＫ［ｊ］］］とする（Ｓ２４）
。
【００６２】
すなわち、１つ目の上記ハイパーリンクが２つ存在する場合、上記１つ目の上記ハイパーリンクを表示する表示用記述における上記ハイパーリンクの位置の最大値または、その平均値を、上記ハイパーリンクの位置とする。
【００６３】
上記実施例において、文書を記述する方式として、ＨＴＭＬによる記述方式を採用しているが、文書間の関連を記述できる言語であれば、他の記述言語を使用するようにしてもよい。この記述言語としては、たとえばＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）等がある。
【００６４】
なお、上記実施例は、集計対象となる文書が予め収集され、データベース等に格納されている場合の例であるが、集計対象となる文書を収集する処理と並行して、文書間の関連度を算出するようにしてもよい。
【００６５】
上記実施例によれば、まず、各文書内に記述されているハイパーリンクを抽出し、ハイパーリンクは、リンク先の文書を一意に特定する文書ロケーション（ＵＲＬ）と、そのリンクを画面に表示する際に使われる表示用記述とによって構成され、ハイパーリンク抽出する場合、上記文書ロケーション（ＵＲＬ）と、上記表示用記述の文書内での位置を抽出する。次に、抽出された各文書ロケーション（ＵＲＬ）間の関連度を、表示用記述の文書内での位置に基づいて計算する。このとき、記述位置が近い程、関連度が高くなり、記述位置が遠い程、関連度が低くなるように計算することによって、その文書の作成者が意図した文書間の関連性を求める。最後に、計算された文書間関連度を集計することによって、最終的に各文書間の関連度を求めることができる。
【００６６】
【発明の効果】
本発明によれば、人間が記述したリンク集等を集計することによって、文書間の関連度を計算するので、人間の視点によって近い形で関連度を求めることができ、この結果、関連情報の提示や検索等の情報提供システムの操作性が極めて向上するという効果を奏する。
【図面の簡単な説明】
【図１】本発明の一実施例である文書間関連度計算装置１００を示すブロック図である。
【図２】文書間関連度計算装置１００の動作を示すフローチャートである。
【図３】上記実施例で用いられているＨＴＭＬ文書の記述例を示す図である。
【図４】図３に示したＨＴＭＬ文書をブラウザで表示した例を示す図である。
【図５】上記実施例において、タグの種類に応じた増減ルールの例を示す図である。
【図６】上記実施例において計算されたＵＲＬ間の関連度を示す図である。
【図７】上記実施例におけるステップＳ２、Ｓ３の動作説明図である。
【図８】上記実施例において、２つのリンクのそれぞれが参照するテキストの記述位置同士の差分の逆数である文書間関連度を計算する具体例を示すフローチャートである。
【図９】上記実施例におけるステップＳ２、Ｓ３の他の動作説明図である。
【図１０】上記実施例において、入力された複数のリンクのうちで同一のリンクがあり、これら２つのリンクがそれぞれ参照する２つのテキストの位置の最大値を採用した場合に、文書関連度を求めるフローチャートである。
【符号の説明】
１０…文書選択手段、
１１…ＨＴＭＬ文書集合用メモリ、
２０…ＵＲＬ抽出手段、
２１…増減ルール用メモリ、
３０…文書間関連度計算手段、
４０…文書間集計手段、
４１…文書間関連度用メモリ。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an inter-document relevance calculation apparatus and method for calculating the relevancy between documents, and more particularly, to an inter-document relation calculating the relevancy between documents referenced by hyperlinks according to hyperlinks. The present invention relates to a degree calculation device and method.
[0002]
[Prior art]
By calculating the degree of relevance between documents, documents related to the document specified by the user can be efficiently searched, and a large number of documents can be classified and used. In order to realize this, a method of calculating the degree of relevance between documents has been conventionally proposed.
[0003]
For example, a document (G. Salton, “Automatic Text Processing”, Addison Wesley, Reading, Mass, 1989) discloses that the relevance between documents is calculated based on the frequency of words included in each document. I have. That is, if two frequently appearing words are similar to each other, the degree of relevance is high. Conversely, if the frequently appearing words are low in similarity, the degree of relevance is low. Think.
[0004]
However, even when describing the same concept, there are cases where synonyms are used and languages such as Japanese and English are different, so the accuracy of inter-document relevance obtained by statistical processing of words is Not necessarily high.
[0005]
Furthermore, the relevance between documents is not determined only by the similarity of the words used, but can be defined from various viewpoints.
[0006]
Therefore, when providing a service such as a related document search, it is necessary to find a document set that many people recognize as being related to each other.
[0007]
[Problems to be solved by the invention]
However, in the conventional example, there is a problem in that it is not possible to find the inter-document relevance recognized by many people as described above.
[0008]
For example, over 300 million pages of WWW documents are published on the Internet (S. Lawrence and CL Giles, "Searching the World Wide Web", Science, Vol. 280, No. 5360, p. 98, 1998). Many users create and publish links that link documents that match their interests and are accessed many times from their own perspective. In other words, it can be considered that the documents included in the link collection have a high quality to some extent, and that the relevance from a predetermined viewpoint is defined. Therefore, by collecting these relevances, a high-quality related document set can be created. As a result, it is possible to provide an effective service such as a related page search.
[0009]
However, in the above-described conventional example, there is a problem that it is not possible to calculate the degree of relevance between documents in a form closer to a human viewpoint.
[0010]
The present invention provides an inter-document relevance calculation apparatus and method for calculating the relevance between documents in a form closer to the human viewpoint by totalizing related document sets described in a link collection or the like created by a human. The purpose is to provide.
[0011]
[Means for Solving the Problems]
The present invention extracts the hyperlink described in each document when calculating the relevance between a plurality of documents referred to by a hyperlink such as a URL in a reference source document, and extracts the extracted hyperlink. The relevance between each of the documents linked by the hyperlinks is calculated, and the calculated relevance is totaled.
[0012]
Embodiments and Examples of the Invention
FIG. 1 is a block diagram showing an inter-document relevance calculating apparatus 100 according to an embodiment of the present invention.
[0013]
The inter-document relevance calculation device 100 is a device that calculates the relevance between a plurality of documents referenced by a hyperlink such as a URL in a reference source document. The system includes a memory 11, a URL extracting unit 20, a memory 21 for increase / decrease rules, an inter-document relevance calculating unit 30, an inter-document totaling unit 40, and an inter-document relevance memory 41.
[0014]
Here, in the above embodiment, it is assumed that the document whose relevance is to be calculated is a document described in HTML (Hyper Text Markup Language), and the linked document position (document storage location) is URL. (Uniform Resource Locator).
[0015]
The document selecting means 10 is a means for selecting a desired document from the HTML documents stored in the HTML document collection memory 11.
[0016]
The HTML document set memory 11 is a memory in which a large number of HTML document sets are stored.
[0017]
The URL extracting unit 20 is an example of a hyperlink extracting unit that extracts a hyperlink described in each document. In the above embodiment, the URL extracting unit 20 is a unit that extracts a URL from each document.
[0018]
The increase / decrease rule memory 21 is a memory that stores data of an increase / decrease rule for increasing / decreasing a distance according to the type of tag.
[0019]
The inter-document relevance calculating unit 30 is a unit that calculates the inter-document relevance between each of a plurality of documents linked by the URL extracted by the URL extracting unit 20.
[0020]
The inter-document counting means 40 is a means for counting the inter-document relevance calculated by the inter-document relevance calculation means 30.
[0021]
The inter-document relevance memory 41 is a memory in which inter-document relevance is stored.
[0022]
FIG. 2 is a flowchart showing the operation of the inter-document relevance calculation device 100.
[0023]
First, an HTML document to be processed is selected (S1). The URL described in the selected HTML document and the storage location of the document referred to by the URL are extracted (S2), and the degree of association between the extracted URLs is determined. That is, the degree of relevance between the document referenced by one URL and the document referenced by another URL is determined (S3). Then, the degree of association between the URLs is totaled. That is, the calculated inter-document relevance is counted (S4). Then, the above processing (S1 to S4) is repeated until the above calculation is completed for all HTML documents (S5).
[0024]
The above embodiment can be understood as a recording medium invention. That is, in the above embodiment, the document selection procedure for selecting the HTML document to be processed, the hyperlink displayed in the selected HTML document, and the display for displaying the hyperlink in the selected HTML document An extraction procedure for extracting the position of the hyperlink in the application description, and a process for extracting a document referenced by the extracted one hyperlink and a document referenced by the extracted other hyperlink. It is an example of a computer-readable recording medium in which a program for causing a computer to execute the inter-document relevance calculation procedure for calculating the inter-document relevance and the counting procedure for summing the calculated inter-document relevance is described.
[0025]
In this case, the recording medium may be an FD, a CD, a DVD, a semiconductor memory, or the like.
[0026]
FIG. 3 is a diagram showing a description example of an HTML document used in the above embodiment.
[0027]
FIG. 4 is a diagram showing an example in which the HTML document shown in FIG. 3 is displayed on a browser.
[0028]
As shown in FIG. 3, an HTML document is a mixture of tags such as “<HEAD>” and “</ HEAD>”, which start with “<” and end with “>”, and normal text data. is there.
[0029]
A hyperlink to another document is represented, for example, as “<A HREF=“URL1”> document 1 </A>” described in the twelfth line in FIG. That is,
(1) A tag “<A>” indicating a hyperlink, ““ URL1 ”” indicating a storage location of a linked document,
(2) A display text “document 1” of the hyperlinked document,
(3) A tag “</A>” indicating the end of the description of the hyperlink indicates a hyperlink to another document.
[0030]
In step S2 shown in FIG. 2, a storage location (that is, a URL) of another document described as a hyperlink and a display text corresponding to the URL are extracted from the HTML document.
[0031]
In the HTML document shown in FIG. 3, hyperlinks are described on the twelfth, thirteenth, seventeenth, and eighteenth lines. From this HTML document, URL1, URL2, URL11, and URL12 are extracted as storage locations of other documents.
[0032]
On the other hand, the texts referred to by the hyperlinks are document 1, document 2, document 11, and document 12, respectively. The position where these (display) texts are described is determined according to a predetermined position calculation rule. The position calculation rule in the above embodiment is the number of bytes from the first part of the HTML document to the position where the text is described. In the description of FIG. 3, the first line starts from the first byte.
[0033]
When counted in this way, the description positions of Document 1, Document 2, Document 11, and Document 12 are 129, 158, 215, and 246, respectively, as shown in FIG.
[0034]
In the above embodiment, the rule for calculating the description position of the display text of each link is the number of bytes from the first part of the HTML document to the position where the text is described. May be used.
[0035]
For example, a rule may be adopted in which the number of bytes excluding the tag from the first part of the HTML document to the position where the text is described is the description position of the display text of each link. You may. According to this rule, the description positions of Document 1, Document 2, Document 11, and Document 12 are 44, 50, 65, and 72, respectively, as shown in FIG.
[0036]
Further, an increase / decrease rule for increasing / decreasing the number of bytes from the first part of the HTML document to the position where the text is described is determined according to the type of tag, and the text is described in consideration of the increase / decrease rule. The number of bytes up to the position may be obtained.
[0037]
FIG. 5 is a diagram showing an example of an increase / decrease rule according to the type of tag in the above embodiment.
[0038]
In FIG. 5, the <HR> tag is +100, the <UL> tag and the <H1> tag are +50, the <H2> tag is +30, and the other tags do not increase or decrease. As a result, for example, the related document collection described in the eighth line in FIG. 3 is described in the number of bytes excluding the tag, the description position is 24. However, when the increase / decrease rule is applied, <H1> Since "50" is added later, the description position is 74.
[0039]
As described above, by applying the increase / decrease rule, the description positions of document 1, document 2, document 11, and document 12 are 274, 280, 375, and 382, respectively, as shown in FIG.
[0040]
In addition, the description position can be represented by the number of the displayed line when displayed on a browser. In this case, the description positions of Document 1, Document 2, Document 11, and Document 12 are 4, 5, 8, and 9, respectively, as shown in FIG.
[0041]
In step S3 in FIG. 2, based on each URL extracted in step S2 and the position of the URL in the display description for displaying the URL, the degree of association between the URLs (that is, the URL between the documents referenced by the URL) Relevance). In the above embodiment, the degree of association is the reciprocal of the difference between the description positions of the display text.
[0042]
FIG. 6 is a diagram illustrating the degree of association between URLs calculated in the above embodiment.
[0043]
As described above, the inter-document relevance is obtained from the reciprocal of the difference between the description positions of the display text, and the relevancy between the URL1, URL2, URL11, and URL12 thus obtained is as shown in FIG. Is calculated.
[0044]
As a method of calculating the degree of association between URLs, a method based on the description position of the display text may be adopted other than the method of using the reciprocal of the square of the difference between the description positions of the display text.
[0045]
In step S4 in FIG. 3, the relevance between URLs calculated for each HTML document is totaled.
[0046]
By executing steps S1 to S4 for all target HTML documents, the degree of relevance between URLs, that is, the degree of relevance between HTML documents indicated by the URLs can be obtained.
[0047]
The degree of relevancy obtained in this way has a high degree of relevance if it is described close to each other in many HTML documents, even if the original HTML documents are not directly linked by hyperlinks. become. Therefore, in the case of a WWW document on the Internet in which many users create related link collections from various viewpoints, it is possible to provide a service such as a related page search by counting the relatedness. Convenience can be significantly improved.
[0048]
FIG. 7 is an operation explanatory diagram of steps S2 and S3 in the above embodiment.
[0049]
FIG. 7A is a diagram illustrating each extracted link and the description position of the display text. FIG. 7B is a diagram illustrating the reciprocal of the difference between the description positions of the texts referred to by the two links. Is a diagram showing the degree of relevance between documents. FIG. 7 is the same in content as FIG.
[0050]
FIG. 8 is a flowchart illustrating a specific example of calculating the inter-document relevance, which is the reciprocal of the difference between the description positions of the texts referred to by two links in the above embodiment.
[0051]
The number of input links is set to N (N = 4 in the example shown in FIG. 7), each link is set to LINK [i] (i = 1, 2, 3, 4 in the example shown in FIG. 7), and each link is set. Is described as POS [i], and the degree of association between LINK [i] and LINK [j] is REL [i, j].
[0052]
In FIG. 8, the link order i is set to 1 (S11), and if this i reaches N-1 (S12), the link is output. (S13), the degree of association REL [i, j] between LINK [i] and LINK [j] is calculated as the reciprocal of the difference (S14), and the order j of the link of the partner is incremented by 1 (S15). If the order j of the link of the partner is N or less (S16), the above processing (S14, S15) is repeated (S16). If the link order j of the partner is larger than N (S16), the link order i is incremented by 1 (S17), and the process returns to step S12.
[0053]
In other words, one HTML document is selected from a plurality of HTML documents, the hyperlink displayed in the selected HTML document is extracted, and the hyperlink displayed in the selected HTML document is displayed in the display description for displaying the hyperlink. After extracting the position of the hyperlink, the inter-document association between the document referenced by the extracted first hyperlink and the document referenced by the extracted second hyperlink The degree is calculated by the inter-document relevance calculating procedure. In the above embodiment, as an example of the inter-document relevance calculating means, the number of input hyperlinks is N, and each hyperlink is LINK [i]. The description position of the display text of each hyperlink is POS [i], and the degree of association between LINK [i] and LINK [j] And REL [i, j], LINK [i] and LINK [j] and the relevance REL [i, j] to be calculated as the reciprocal of the difference.
[0054]
That is, the position of the hyperlink in the display description for displaying the first hyperlink in the selected HTML document, and the display description for displaying the second hyperlink in the selected HTML document. The reciprocal of the difference from the position of the hyperlink in is calculated as the inter-document relevance.
[0055]
FIG. 9 is a diagram for explaining another operation of steps S2 and S3 in the above embodiment.
[0056]
FIG. 9A is a diagram illustrating an example of a case where the same link is present among a plurality of input links. That is, there are two URL1s. In this case, the maximum value of the positions of the two texts referred to by the two links is adopted. The average value of the positions of the two texts referred to by the two links may be adopted.
[0057]
FIG. 9B is a diagram showing, as the inter-document relevance, the reciprocal of the difference between the description positions of the texts referred to by the two links in the case shown in FIG. 9A.
[0058]
FIG. 10 shows a case where the same link is present among a plurality of input links in the above embodiment, and the maximum value of the positions of the two texts referred to by the two links is adopted. It is a flowchart which is calculated | required.
[0059]
The flowchart shown in FIG. 10 is basically the same as the flowchart shown in FIG. 8, except that steps S21 to S24 are provided instead of step S14 in the flowchart shown in FIG.
[0060]
Note that LINK [1] = LINK [3], and a unique ID determined from the link name of each link is ID [link name]. Also, the degree of association between ID [LINK [i]] and ID [LINK [j]] is REL [ID [LINK [i]], ID [LINK [j]]], and max (a, b) Is the smaller value of a and b.
[0061]
That is, after setting the order j of the link of the partner for which the document relevance is determined to be n + 1 (S13), the reciprocal R of the difference between LINK [i] and LINK [j] is determined (S21), and REL [ID [LINK [i] ]], ID [LINK [j]]] is completed (S22), and max (REL [ID [LINK [i]], ID [LINK [j]], R) is converted to REL [ ID [LINK [i]], ID [LINK [j]]] (S23), while if the calculation of REL [ID [LINK [i]], ID [LINK [j]]] has not been completed (S23). S22), and let R be REL [ID [LINK [i]], ID [LINK [j]]] (S24).
.
[0062]
That is, when there are two first hyperlinks, the maximum value or the average value of the positions of the hyperlinks in the display description for displaying the first hyperlink is determined by Position.
[0063]
In the above-described embodiment, a description method using HTML is adopted as a method for describing a document. However, another description language may be used as long as it is a language capable of describing the relationship between documents. This description language includes, for example, XML (extensible Markup Language).
[0064]
Note that the above embodiment is an example in which documents to be totaled are collected in advance and stored in a database or the like. May be calculated.
[0065]
According to the above embodiment, first, a hyperlink described in each document is extracted, and the hyperlink displays a document location (URL) for uniquely specifying a linked document and the link on the screen. When extracting a hyperlink, the document location (URL) and the position of the display description in the document are extracted. Next, the degree of association between the extracted document locations (URLs) is calculated based on the position of the display description in the document. At this time, the relevance between the documents intended by the creator of the document is determined by calculating such that the closer the description position is, the higher the relevance is, and the further the description position is, the lower the relevance is. Finally, the calculated inter-document relevance can be summed up to finally obtain the relevancy between the documents.
[0066]
【The invention's effect】
According to the present invention, the relevance between documents is calculated by totalizing a link collection or the like described by a human, so that the relevance can be obtained in a form closer to the human viewpoint. The effect is that the operability of the information providing system such as presentation and search is extremely improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an inter-document relevance calculating apparatus 100 according to an embodiment of the present invention.
FIG. 2 is a flowchart showing an operation of the inter-document relevance calculating apparatus 100;
FIG. 3 is a diagram showing a description example of an HTML document used in the embodiment.
FIG. 4 is a diagram showing an example in which the HTML document shown in FIG. 3 is displayed on a browser.
FIG. 5 is a diagram showing an example of an increase / decrease rule according to a tag type in the embodiment.
FIG. 6 is a diagram showing a degree of association between URLs calculated in the embodiment.
FIG. 7 is an operation explanatory diagram of steps S2 and S3 in the embodiment.
FIG. 8 is a flowchart illustrating a specific example of calculating the inter-document relevance, which is the reciprocal of the difference between the description positions of texts referred to by two links in the embodiment.
FIG. 9 is another operation explanatory view of steps S2 and S3 in the embodiment.
FIG. 10 shows a case where the same link is present among a plurality of input links in the above embodiment and the maximum value of the positions of the two texts referred to by the two links is adopted. It is a flowchart which is calculated | required.
[Explanation of symbols]
10. Document selection means
11. HTML document collection memory
20 URL extraction means,
21: memory for increase / decrease rules,
30 ... Document relevance calculating means
40 ... means for counting between documents
41: Memory for the degree of association between documents.

Claims

An inter-document relevance calculator that calculates a relevance between a plurality of documents referenced by a hyperlink in at least one document of a reference source ,
Each hyperlink described in the above reference source in a document, and hyperlinks extracting means for extracting a description location of each hyperlink in the referencing document;
The closer the description position of the predetermined hyperlink described in the reference document and the description position of another hyperlink described in the reference document are closer to each other, the higher the degree of association is, and as both described position is far from one another, by the degree-of-association calculation rules to be less relevance, the document relevancy calculating means which calculates a degree of association between the document;
An inter-document relevance counting means for summing up all the relevances between the documents obtained by the inter-document relevance calculation means for each of the plurality of documents referenced by the hyperlinks in the reference source document ;
An inter-document relevance calculation apparatus characterized by having:

In claim 1,
The relevance calculation rule is a reciprocal of a difference between a description position of a predetermined hyperlink described in the reference source document and a description position of another hyperlink described in the reference source document. The inter-document relevance calculating apparatus is characterized in that the relevancy is a rule .

In claim 1 or claim 2 ,
The description position of the hyperlink described in the reference source document is the number of bytes from the head of the document in the reference source document to the position where the hyperlink is described. Inter-association degree calculation device.

In claim 1 or claim 2 ,
The description position of the hyperlink described in the reference source document is the number of bytes excluding the tag from the head of the document in the reference source document to the position where the hyperlink is described. Characteristic inter-document relevance calculator.

In claim 1 or claim 2 ,
According to the type of tag appearing in the document of the reference source, an increase / decrease rule for increasing / decreasing the number of bytes of the tag is defined,
The description position of the hyperlink described in the above referenced document is
For tags that exist in the number of bytes from the beginning of the source document to the position where the hyperlink is described, and between the head of the source document and the position where the hyperlink is described An inter-document relevance calculation device, characterized in that it is a position obtained by adding the number of bytes to which the increase / decrease rule is applied .

In claim 1 or claim 2 ,
The description position of the hyperlink described in the above referenced document is
An inter-document relevance calculation apparatus , wherein the number of lines on which the hyperlink is displayed when the reference source document is displayed according to a predetermined display rule .

What is claimed is: 1. A control method for an inter-document relevance calculating apparatus for calculating a relevance between a plurality of documents referenced by a hyperlink in at least one document as a reference source, comprising: a hyperlink extracting unit; In a control method of an inter-document relevance calculating device having a calculating unit and an inter-document relevance totaling unit,
Extracting the hyperlinks described in the reference source document and the description positions of the hyperlinks in the reference source document;
The inter-document relevance calculating means determines that a description position of a predetermined hyperlink described in the reference source document is close to a description position of another hyperlink described in the reference source document. Determining the relevance between the documents by a relevance calculation rule that determines that the higher the relevance is, the lower the two description positions are, and the lower the relevance is.
The inter-document relevance summing means sums up all the inter-document relevance calculated by the inter-document relevance calculation means for each of a plurality of documents referenced by hyperlinks in the reference source document. Storing in a memory;
A method for controlling an inter-document relevance calculating apparatus, comprising:

In claim 7,
The relevance calculation rule is a reciprocal of a difference between a description position of a predetermined hyperlink described in the reference source document and a description position of another hyperlink described in the reference source document. The control method of the inter-document relevance calculating device, wherein the rule is a rule that sets the relevance .

In claim 7 or claim 8 ,
The description position of the hyperlink described in the reference source document is the number of bytes from the head of the document in the reference source document to the position where the hyperlink is described. A control method of the inter-association degree calculation device .

In claim 7 or claim 8 ,
The description position of the hyperlink described in the reference source document is the number of bytes excluding the tag from the head of the document in the reference source document to the position where the hyperlink is described. A control method of an inter-document relevance calculation apparatus .

In claim 7 or claim 8 ,
According to the type of tag appearing in the document of the reference source, an increase / decrease rule for increasing / decreasing the number of bytes of the tag is defined,
The description position of the hyperlink described in the above referenced document is
For tags that exist in the number of bytes from the beginning of the source document to the position where the hyperlink is described, and between the head of the source document and the position where the hyperlink is described A control method of an inter-document relevance calculating apparatus, characterized in that the position is a position obtained by adding the number of bytes to which the increase / decrease rule is applied .

In claim 7 or claim 8 ,
The description position of the hyperlink described in the above referenced document is
A control method for an inter-document relevance calculating apparatus, wherein when the reference source document is displayed according to a predetermined display rule, the number of lines in which the hyperlink is displayed is provided .

A program for causing a computer to function as a hyperlink extracting unit, an inter-document relevance calculating unit, and an inter-document relevance counting unit, and which is referred to by a hyperlink in at least one referencing document. A computer-readable storage medium storing a program for calculating the degree of relevance between a plurality of documents,
A procedure in which the hyperlink extracting means extracts each hyperlink described in the reference source document and a description position of each hyperlink in the reference source document;
The inter-document relevance calculating means determines that a description position of a predetermined hyperlink described in the reference source document is close to a description position of another hyperlink described in the reference source document. Determining the degree of relevance between the documents according to a relevance calculation rule that determines that the degree of relevance is higher and the distance between the two description positions is farther from each other;
The inter-document relevance summing means sums up all the inter-document relevance calculated by the inter-document relevance calculation means for each of a plurality of documents referenced by hyperlinks in the reference source document. Storing the information in a memory;
Computer-readable recording medium storing a program for causing a computer to execute the program.

In claim 13,
The procedure for determining the relevance between the above documents is as follows:
The position of the hyperlink in the display description that displays the first hyperlink in the reference source document, and the hyperlink in the display description that displays the second hyperlink in the reference source document A computer-readable recording medium, which is a procedure for obtaining a reciprocal of a difference from a position of the document as the inter-document relevance.

In claim 14,
When two of the first hyperlinks are present, the maximum value or the average value of the positions of the hyperlinks in the display description for displaying the first hyperlink is calculated as the position of the hyperlinks. A computer-readable recording medium characterized by the following.