JP3705331B2

JP3705331B2 - Hypertext analysis apparatus and method, and storage medium storing hypertext analysis program

Info

Publication number: JP3705331B2
Application number: JP34575998A
Authority: JP
Inventors: 雄大中山; 裕樹加藤
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-12-04
Filing date: 1998-12-04
Publication date: 2005-10-12
Anticipated expiration: 2018-12-04
Also published as: JP2000172665A

Description

【０００１】
【発明の属する技術分野】
本発明は、ネットワーク上に構成されるハイパーテキストシステムにおいて、その構成の優劣を判断するための知識を発見するためにハイパーリンク構造を解析するハイパーテキスト解析装置及びハイパーテキスト解析方法と、そのハイパーテキスト解析装置または方法をコンピュータで実現するためのハイパーテキスト解析プログラムを記録した記録媒体に関するものである。
【０００２】
【従来の技術】
ネットワーク上に構成されるハイパーテキストシステム（例えば、ＷｏｒｌｄＷｉｄｅＷｅｂ：以降Ｗｅｂと略す）では、ハイパーテキストを格納しているサーバにおいてユーザ（訪問者）のアクセス履歴を記録することができる。このアクセス履歴には、一般に、アクセスしてきたユーザが使用しているコンピュータの識別子（インターネットを利用しているのであればＩＰアドレス）、アクセスしてきた時刻、アクセスしたノードのサーバ上での識別子（ＷｅｂではＵＲＬ）が含まれる。
【０００３】
アクセス履歴を解析して、個々のユーザが遷移したハイパーリンクの経路を特定するには、基本的には、各コンピュータ毎にアクセスしたノードを時刻順に並べればよい。しかし、コンピュータのキャッシュ機能等により完全な経路を特定することは困難であった。
【０００４】
これに対し、例えば、Ｃ．Ｓｈａｈａｂｉ，Ａ．Ｍ．Ｚａｒｋｅｓｈ，Ｊ．Ａｄｉｂｉ，ａｎｄＶ．Ｓｈａｈ，“ＫｎｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙｆｒｏｍＵｓｅｒｓＷｅｂ−ＰａｇｅＮａｖｉｇａｔｉｏｎ”，ｉｎＰｒｏｃ．ｏｆＩＥＥＥＲＩＤＥ，１９９７．では、リモートエージェントを使うことによりキャッシュへのアクセスを認識している。これによって、より正確な経路を得ることができるようになった。ただし、ユーザのコンピュータ側でリモートエージェントプログラムをロードするためのコスト（時間とスペース）が犠牲となるという不具合がある。
【０００５】
アクセス履歴を利用して重要経路を発見する技術としては、例えば、Ｊ．ＢｏｒｇｅｓａｎｄＭ．Ｌｅｖｅｎｅ，“ＭｉｎｉｎｇＡｓｓｏｃｉａｔｉｏｎＲｕｌｅｓｉｎＨｙｐｅｒｔｅｘｔＤａｔａｂａｓｅｓ”，ｉｎＰｒｏｃ．ｏｆＫＤＤ，１９９８．に記載されている技術がある。この技術は、まずアクセス履歴を収集して、ハイパーリンクのトラフィック量を重みとする有向グラフでハイパー構造を表現する。この有向グラフにおいて、ノードＡからノードＢへの遷移をＡ→Ｂと記し、これを結合規則と呼ぶ。結合規則は、コンフィデンス値（＝Ａを起点とする遷移の総数に対するＡ→Ｂの遷移の総数）とサポート値（＝有向グラフ中のすべてのアークの遷移数の平均値に対するＡ→Ｂの遷移の総数）によって評価される。さらに、結合規則を合成した合成結合規則（Ａ→Ｂ）＆（Ｂ→Ｃ）＆（Ｃ→Ｄ）＆．．．を定義して３つ以上のノードの遷移を評価している。
【０００６】
この手法では、コンピュータ識別子（ＩＰアドレス）の情報は利用していないので、合成結合規則においては、単にトラフィック量が多いハイパーリンクの組み合わせ経路を発見しているに過ぎない。例えば、（Ａ→Ｂ）＆（Ｂ→Ｃ）という合成結合規則が、高いコンフィデンス値とサポート値を持っていることがわかったとしても、実際に、Ａ→Ｂ→Ｃという経路を辿ったユーザが多かったとは限らない。
【０００７】
このように従来技術によって、ユーザのアクセス経路を特定したり、重要経路を発見することは可能である。しかし、ハイパーテキストシステム（例えば、Ｗｅｂサイト）の構成の優劣を判断するような知識を得ることはできなかった。
【０００８】
【発明が解決しようとする課題】
本発明は、上述した事情に鑑みてなされたもので、アクセス履歴からユーザのアクセス傾向を認識できるとともに、ハイパーテキストシステムの構成の優劣を判断するような知識を得ることを支援するハイパーテキスト解析装置及びハイパーテキスト解析方法を提供することを目的とするものである。また、そのハイパーテキスト解析装置または方法をコンピュータで実現するためのハイパーテキスト解析プログラムを記録した記録媒体を提供することを目的とするものである。
【０００９】
【課題を解決するための手段】
本発明は、ハイパーテキストシステムへのアクセス履歴情報に基づいて該ハイパーテキストシステムを構成するノードに対してクラスタリングを行い、得られた各クラスタについて該クラスタを構成するノード間のハイパーリンク結束度を計算し、計算したハイパーリンク結束度を表示することを特徴とするものである。表示されるハイパーリンク結束度は、ユーザがアクセスした履歴に基づいた値であるから、ユーザのアクセス傾向を示している。そのため、ハイパーリンク結束度を得ることによって、例えばハイパーテキストシステム（例えばＷｅｂサイト）のハイパーリンク構成などとともに、ハイパーテキストシステムの構成の優劣を判断することが可能となる。
【００１０】
【発明の実施の形態】
図１は、本発明の第１の実施の形態を示す構成図、図２は、Ｗｅｂの一例の説明図である。図中、１はＷｅｂサーバ、２はハイパーテキスト解析装置、１１はアクセス履歴情報、２１はアクセス傾向解析部、２２はハイパーリンク構成解析部、２３はハイパーリンク結束度表示部である。ネットワーク上に構成されるハイパーテキストシステムのなかで代表的なものはＷｅｂである。以下、Ｗｅｂを例として説明する。
【００１１】
Ｗｅｂサーバ１は、ネットワーク上で情報を発信する手段である。Ｗｅｂサーバ１には、図２に示すように、ユーザに提供したい情報がノード（矩形で示す）とハイパーリンク（矢線で示す）によるハイパー構造で貯えられている。ユーザはＷｅｂサーバ１にアクセスすることでコンテンツを入手できる。このとき、Ｗｅｂサーバ１では、一般に、ユーザのアクセスがある毎に、ユーザのコンピュータを識別するためのコンピュータ識別子（ＩＰアドレス）とアクセス時刻とユーザのアクセスしたノードのあるアドレス（ＵＲＬ）をアクセス履歴情報１１として記録している。
【００１２】
ハイパーテキスト解析装置２は、アクセス傾向解析部２１，ハイパーリンク構成解析部２２，ハイパーリンク結束度表示部２３などを有している。アクセス傾向解析部２１は、Ｗｅｂサーバ１上のノード群に対し、それぞれに対応するアクセス履歴情報１１を用いてクラスタリングの処理を施す。このクラスタリングの処理には既存の技術を使用することができる。例えば、ＡｇｇｌｏｍｅｒａｔｉｖｅＨｉｅｒａｒｃｈｉｃａｌＣｌｕｓｔｅｒｉｎｇによるクラスタリングの方法では、以下の１，２，３のステップを行う。
１．Ｗｅｂサーバ１上の各ノードをそれぞれ一つのクラスタとする。
２．各クラスタ間の類似度を計算し、最大類似度を持つクラスタ同士を一つのクラスタにマージする。
３．クラスタが一つになるまで２の処理を繰り返す。
この過程の途中で順次生成されるそれぞれのクラスタをクラスタリングの処理結果として得る。例えば１．の処理において生成される各ノードのみのクラスタ、そのクラスタをマージした各クラスタ、最後に生成された１つのクラスタなどがクラスタリング結果となる。なお、上述のクラスタリングの方法は、例えば、Ｅ．Ｍ．Ｖｏｏｒｈｅｅｓ，“ＩｍｐｌｅｍｅｎｔｉｎｇＡｇｇｌｏｍｅｒａｔｉｖｅＨｉｅｒａｒｃｈｉｃａｌＣｌｕｓｔｅｒｉｎｇＡｌｇｏｒｉｔｈｍｓｆｏｒＵｓｅｉｎＤｏｃｕｍｅｎｔＲｅｔｒｉｅｖａｌ”，ＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇ＆Ｍａｎａｇｅｍｅｎｔ，Ｖｏｌ．２２，Ｎｏ．６，１９８６等に記載されている。
【００１３】
アクセス傾向解析部２１では、上述のクラスタリング処理の２．の処理における類似度計算において、アクセス履歴情報１１の重複度合いを利用する。具体的には、例えば、アクセス履歴情報１１の中のＩＰアドレスに注目して、クラスタ間で共有するＩＰアドレスの数を類似度と定義する。または、クラスタ間で共有するＩＰアドレスの数を当該二つのクラスタが持つ全てのＩＰアドレスの数で割った値を類似度と定義する。ここで、クラスタが持つＩＰアドレスとは、クラスタを構成する各ノードへのアクセス履歴を全てマージして、その中に出現するＩＰアドレスのことである。ＩＰアドレスの数は、このようなＩＰアドレスから重複を排除した異なるＩＰアドレスを計数した値とすることができる。なお、ＩＰアドレスを用いる代わりに、例えばアクセス履歴情報１１中のアクセス時刻を用いてもよい。この場合、秒や分単位ではなく日付や月の単位で取り扱うとよい。
【００１４】
あるいは、類似度計算として、各クラスタを該クラスタに対応するアクセス履歴情報１１を項とし、該アクセス履歴情報の出現頻度を該項の値とするようなベクトルを生成して、ベクトル間の内積値の大小を類似度として用いてもよい。
【００１５】
ハイパーリンク構成解析部２２は、アクセス傾向解析部２１で生成された各クラスタについて、クラスタを構成するノード間のハイパーリンク結束度（以下、単に結束度と呼ぶ）を計算する。結束度は、例えば、ノード間に一つ以上のハイパーリンクが存在すれば、そのノード間には結合があると定義したときに、クラスタを構成するノード間結合の総数を、クラスタを構成する全てのノードから二つを選ぶ組み合わせ数で割った値とすることができる。すなわち、ノード間結合の総数をＬ、ノード数をＮとしたとき、
結束度＝Ｌ／_ＮＣ_２
で計算することができる。ハイパーリンク構成解析部２２で計算した、各クラスタの結束度の値は、ハイパーリンク結束度表示部２３に渡される。
【００１６】
ハイパーリンク結束度表示部２３は、ハイパーリンク構成解析部２２で計算された結束度の値を、利用しやすい形態で表示する。図３は、ハイパーリンク結束度表示部による表示の一例の説明図である。例えば、結束度の値をいくつかのセグメントに分け、各セグメントに入るクラスタの数を棒グラフで表示することができる。図３では、クラスタ数の分布を縦軸とし、結束度を横軸として、二つのＷｅｂサイト（ＷｅｂサイトＡとＷｅｂサイトＢ）について並べて表示したものである。前述のように、各クラスタはアクセス履歴情報を基に構成されているので、各クラスタを構成するノード群は、例えば、同一のユーザから前後してアクセスされる傾向が強いものである。ノード群の結束度が高いと、ユーザにとってはノード間遷移のための経路が多数提供されることになるので効率よくブラウジングできることになる。一方、該ノード群の結束度が低いとブラウジング効率は悪くなる。図３では、ＷｅｂサイトＡの方は結束度が低いクラスタが多数あり、ＷｅｂサイトＢの方は結束度が高いクラスタが多数あるということが一目でわかる。この表示によって、ＷｅｂサイトＢの方がＷｅｂサイトＡよりも優れた構成でハイパーテキストシステムが構築されていると容易に判断できる。
【００１７】
ハイパーリンク結束度表示部２３は、図３に示す表示形態のほか、各種の表示形態により結束度を表示することが可能である。図４は、ハイパーリンク結束度表示部による表示の別の例の説明図である。図４に示した例では、クラスタのサイズと結束度の値の関係を表示している。図中の点は、それぞれがクラスタを示している。クラスタのサイズとしては、例えば、クラスタを構成するノード数や、クラスタを構成する各ノードが持つ単語の総数や、クラスタを構成する各ノードのファイルサイズの総計などを用いることができる。図４に示した例では、クラスタのサイズとしてクラスタを構成するノード数を用いて表示した例を示している。このような表示を行った場合、同じクラスタサイズであれば、結束度が高いほど優れた構成であると判断することができる。また、このような表示によって、クラスタサイズに注目しながら、各クラスタの構成の優劣を俯瞰することができる。
【００１８】
図５は、ハイパーリンク結束度表示部による表示の別の例においてクラスタを選択した場合の表示例の説明図である。図４に示したような結束度の表示が行われているとき、クラスタを表す点をマウス等のポインティングデバイスで選択すると、図５に示すように、選択されたクラスタを構成するノードのＵＲＬ（識別子）を表示できるように構成することができる。あるいは、ＵＲＬではなく、ノードのタイトルを表示してもよい。さらに、図５に示すように表示されたノード（ＵＲＬで表示されている）をマウス等のポインティングデバイスで選択すると、ネットワークを通じて選択したノードにアクセスして、そのノードのコンテンツを獲得し、そのコンテンツを表示するようにしてもよい。このような表示によって、Ｗｅｂサイトの管理者は、Ｗｅｂサイト内の問題箇所にアクセスしてコンテンツを参照することができ、さらに編集することができるので、Ｗｅｂサイト内の構成を容易に改善して行くことができる。
【００１９】
図６は、ハイパーリンク結束度表示部による表示の別の例においてクラスタを選択した場合の別の表示例の説明図である。図５に示した例と同様に、図４に示したような結束度の表示が行われているとき、クラスタを表す点をマウス等のポインティングデバイスで選択することにより、図６に示すように選択したクラスタを構成するノードとそのノード間のハイパーリンクを表示することもできる。ここで、ノードのラベル（図６中では、Ｎ１，Ｎ２，…，Ｎ１０）は、ＵＲＬでもタイトルでもよい。このような表示を行うことによって、Ｗｅｂサイトの管理者は、Ｗｅｂサイト内の問題箇所を容易に発見することができる。例えば図６に示したクラスタ内のハイパーリンクの表示例では、Ｎ１〜Ｎ７とＮ８〜Ｎ１０の間にハイパーリンクがないのでユーザは両グループ間を容易に行き来できないことがわかる。
【００２０】
図７は、ハイパーリンク結束度表示部による表示のさらに別の例の説明図である。この例では、クラスタ内のノード間の類似度と結束度の値との関係を表示した例を示している。クラスタ内のノード間の類似度としては、例えばアクセス傾向解析部２１においてクラスタ生成時に用いた類似度の値を用いることができる。このような表示において、同じクラスタ内類似度であれば、結束度が高いほど優れた構成であると判断することができる。このような表示によって、クラスタ内のノード間の類似度に注目しながら、各クラスタの構成の優劣を俯瞰することができる。
【００２１】
なお、ハイパーリンク結束度表示部２３では、上述の各例に示した表示形態によらず、任意の形態で結束度を表示させることができる。
【００２２】
図８は、本発明の第２の実施の形態を示す構成図である。図中、図１と同様の部分には同じ符号を付して説明を省略する。２４は識別子獲得部である。識別子獲得部２４は、アクセス履歴情報１１の中からＷｅｂサーバ１上の各ノード毎に予め定めたある一定期間にアクセスしてきたコンピュータの識別子（例えばＩＰアドレス）を獲得する。
【００２３】
アクセス傾向解析部２１は、第１の実施の形態と同様に、Ｗｅｂサーバ１上のノード群に対し、それぞれに対応するアクセス履歴情報１１を用いてクラスタリングの処理を施す。このクラスタリングの処理において類似度を計算する際に、識別子獲得部２４において獲得したコンピュータの識別子の重複度合いを利用することができる。
【００２４】
図９は、識別子獲得部における処理の一例を示すフローチャートである。まず、Ｓ３１において、アクセス履歴情報１１の中から、予め定められた期間内のアクセスに関するものだけを抽出し、残りを破棄する。Ｓ３２において、Ｓ３１で抽出されたアクセス履歴情報１１に存在する全てのコンピュータ識別子（ＩＰアドレス）のうちの異なる識別子の数（異なり数）Ｎを求める。次にＳ３３において、アクセス履歴情報１１に存在する全てのノードについて、各ノード毎に次元数Ｎのベクトルを生成する。ここで、ベクトルの各項は、アルファベットの昇順（あるいは降順）に並べた互いに異なるコンピュータ識別子（ＩＰアドレス）に対応する。各項の初期値は０としておく。次にＳ３４において、アクセス履歴情報１１の中から各ノードにアクセスしてきたコンピュータのコンピュータ識別子（ＩＰアドレス）を取り出し、対応するベクトルの項に１を加える。この処理を、アクセス履歴情報１１すべてについて、順に行う。
【００２５】
このようにして得られたＮ次元のベクトルを用いて、アクセス傾向解析部２１では、ベクトル間の内積値の大小を類似度としてクラスタリングをすることができる。ここでクラスタとクラスタをマージする際は、それぞれのベクトルの和を取れば、これがマージされたクラスタのベクトルとなる。
【００２６】
以降の処理は上述の第１の実施の形態と同様である。ハイパーリンク構成解析部２２において結束度を計算し、ハイパーリンク結束度表示部２３において結束度を表示する。このとき、上述のような各種の表示形態あるいはそれ以外の各種の表示形態で結束度を表示することができる。
【００２７】
図１０は、本発明の第３の実施の形態を示す構成図である。図中、図１と同様の部分には同じ符号を付して説明を省略する。２５はコンピュータ識別子整形部である。コンピュータ識別子整形部２５は、Ｗｅｂサーバ１に対するアクセスのうち、ユーザ（＝人間）によって操作されるコンピュータからのものではなく、Ｗｅｂを網羅的にアクセスして自動的に情報を収集している情報収集ロボットのようなコンピュータによるアクセスを排除する。例えば、ある慣習に従って情報収集ロボットがアクセスする特殊なノード（Ｗｅｂでは、例えばルート直下に置かれるｒｏｂｏｔｓ．ｔｘｔという名のファイル）へのアクセスの有無によって、情報収集ロボットからのアクセスであるか否かを判断することができる。あるいは、短期間に多数のノードを網羅的にアクセスするという情報収集ロボットに特徴的な振る舞いの有無や、既知の情報収集ロボットのコンピュータ識別子であるか否かによっても、情報収集ロボットを識別することができる。情報収集ロボットからのアクセスであると判断されたコンピュータについては、そのコンピュータ識別子に関わるアクセス履歴情報１１を、識別子獲得部２４において獲得しないようにすることができる。
【００２８】
これによって、情報収集ロボットのようなＷｅｂを網羅的にアクセスして自動的に情報を収集しているコンピュータによるアクセスを排除し、解析結果に対するこれらの影響を除去することができ、ユーザのアクセス動向を正しく反映した解析結果を得ることができる。
【００２９】
なお、この第３の実施の形態におけるその他の構成および動作は、上述の第２の実施の形態と同様である。
【００３０】
図１１は、本発明の第４の実施の形態を示す構成図である。図中、図１と同様の部分には同じ符号を付して説明を省略する。２６はハイパーリンク結束度評価部である。ハイパーリンク結束度評価部２６は、ハイパーリンク構成解析部２２で得られた各クラスタの結束度の値を予め定められた閾値と比較し、結束度が閾値よりも小さなクラスタをハイパーリンク結束度表示部２３に渡す。これによって、ハイパーリンク結束度表示部２３では結束度が小さい、すなわち構成が劣るクラスタとそのクラスタを構成するノードを容易に得ることができる。
【００３１】
図１２は、ハイパーリンク結束度表示部による表示のさらに別の例の説明図である。図１２に示した表示例では、結束度の値が小さい順にクラスタおよびそのクラスタを構成するノードの識別子（ＵＲＬ）を表示している。クラスタは「ＩＤ」の欄に示している。ここで「ＩＤ」は、クラスタを参照するためにユニークにつけられた番号である。例えば、アクセス傾向解析部２１におけるクラスタ生成時に、生成された順に番号を付与すればよい。このような表示によって、構成上劣っている部分から表示されるので、ユーザが利用する上でネックとなっている部分を容易に知ることができる。もちろん、この第４の実施の形態においても、第１の実施の形態で示したような各種の表示形態あるいはその他の表示形態で結束度を表示することが可能である。また、第２，第３の実施の形態で説明した識別子獲得部２４，コンピュータ識別子整形部２５などを設けてもよい。
【００３２】
上述の実施の形態は、コンピュータプログラムによっても実現することが可能である。その場合、そのプログラムおよびそのプログラムが用いるデータなどは、コンピュータが読み取り可能な記憶媒体に記録しておくことも可能である。記憶媒体とは、コンピュータのハードウェア資源に備えられている読取装置に対して、プログラムの記述内容に応じて、磁気、光、電気等のエネルギーの変化状態を引き起こして、それに対応する信号の形式で、読取装置にプログラムの記述内容を伝達できるものである。例えば、磁気ディスク、光ディスク、ＣＤ−ＲＯＭ、コンピュータに内蔵されるメモリ等である。
【００３３】
【発明の効果】
以上の説明から明らかなように、本発明によれば、ユーザのアクセス履歴情報に基づいて、ハイパーテキストシステムを構成するノードに対してクラスタリングを行い、得られた各クラスタを構成するノード間のハイパーリンク結束度を計算する。このハイパーリンク結束度によって、ハイパーテキストシステムの構成の優劣を容易に判断することが可能になる。例えばＷｅｂの管理者は、構成に問題がある部分に変更を加えて、より良い構成のハイパーテキストシステムを構築することができるという効果がある。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態を示す構成図である。
【図２】Ｗｅｂの一例の説明図である。
【図３】ハイパーリンク結束度表示部による表示の一例の説明図である。
【図４】ハイパーリンク結束度表示部による表示の別の例の説明図である。
【図５】ハイパーリンク結束度表示部による表示の別の例においてクラスタを選択した場合の表示例の説明図である。
【図６】ハイパーリンク結束度表示部による表示の別の例においてクラスタを選択した場合の別の表示例の説明図である。
【図７】ハイパーリンク結束度表示部による表示のさらに別の例の説明図である。
【図８】本発明の第２の実施の形態を示す構成図である。
【図９】識別子獲得部における処理の一例を示すフローチャートである。
【図１０】本発明の第３の実施の形態を示す構成図である。
【図１１】本発明の第４の実施の形態を示す構成図である。
【図１２】ハイパーリンク結束度表示部による表示のさらに別の例の説明図である。
【符号の説明】
１…Ｗｅｂサーバ、２…ハイパーテキスト解析装置、１１…アクセス履歴情報、２１…アクセス傾向解析部、２２…ハイパーリンク構成解析部、２３…ハイパーリンク結束度表示部、２４…識別子獲得部、２５…コンピュータ識別子整形部、２６…ハイパーリンク結束度評価部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a hypertext analysis apparatus and a hypertext analysis method for analyzing a hyperlink structure in order to discover knowledge for determining the superiority or inferiority of the configuration in a hypertext system configured on a network, and the hypertext The present invention relates to a recording medium on which a hypertext analysis program for realizing an analysis apparatus or method by a computer is recorded.
[0002]
[Prior art]
In a hypertext system (for example, World Wide Web: hereinafter abbreviated as “Web”) configured on a network, a user (visitor) access history can be recorded in a server storing hypertext. This access history generally includes the identifier of the computer used by the accessing user (IP address if the Internet is used), the time of access, the identifier on the server of the accessed node (Web URL).
[0003]
In order to analyze the access history and identify the path of the hyperlink to which each user has transitioned, basically, the accessed nodes may be arranged in time order for each computer. However, it has been difficult to specify a complete path by a computer cache function or the like.
[0004]
In contrast, for example, C.I. Shahabi, A .; M.M. Zarkesh, J. et al. Adibi, and V.D. Shah, “Knowledge Discovery from Users Web-Page Navigation”, Proc. of IEEE RIDE, 1997. The remote agent is used to recognize access to the cache. As a result, a more accurate route can be obtained. However, there is a problem that the cost (time and space) for loading the remote agent program on the user's computer side is sacrificed.
[0005]
As a technique for finding an important route using an access history, for example, J.A. Borges and M.M. Levene, “Minning Association Rules in Hypertext Databases”, Proc. of KDD, 1998. There are techniques described in. This technique first collects access histories and expresses the hyper structure with a directed graph weighted by the amount of hyperlink traffic. In this directed graph, the transition from node A to node B is denoted as A → B, and this is called a combination rule. The join rule is: a confidence value (= total number of transitions A → B with respect to the total number of transitions starting from A) and a support value (= total number of transitions A → B with respect to an average value of the number of transitions of all arcs in the directed graph). ). Further, a combined coupling rule (A → B) & (B → C) & (C → D) &. . . Is defined, and transitions of three or more nodes are evaluated.
[0006]
In this method, since information of a computer identifier (IP address) is not used, the combined link rule merely finds a hyperlink combination route with a large traffic volume. For example, even if it is found that the composite combination rule (A → B) & (B → C) has a high confidence value and support value, the user who actually followed the path of A → B → C It was not always the case.
[0007]
As described above, it is possible to specify the access route of the user or discover the important route by the conventional technique. However, it has not been possible to obtain knowledge for judging the superiority or inferiority of the configuration of a hypertext system (for example, a website).
[0008]
[Problems to be solved by the invention]
The present invention has been made in view of the above-described circumstances, and is capable of recognizing a user's access tendency from an access history, and a hypertext analysis apparatus that assists in obtaining knowledge for determining the superiority or inferiority of the configuration of a hypertext system. It is another object of the present invention to provide a hypertext analysis method. It is another object of the present invention to provide a recording medium on which a hypertext analysis program for realizing the hypertext analysis apparatus or method by a computer is recorded.
[0009]
[Means for Solving the Problems]
The present invention performs clustering on the nodes constituting the hypertext system based on the access history information to the hypertext system, and calculates the hyperlink cohesion degree between the nodes constituting the cluster for each obtained cluster. Then, the calculated hyperlink cohesion degree is displayed. Since the displayed hyperlink cohesion is a value based on the history accessed by the user, it indicates the user's access tendency. Therefore, by obtaining the hyperlink cohesion degree, for example, it is possible to determine the superiority or inferiority of the configuration of the hypertext system together with the hyperlink configuration of the hypertext system (for example, Web site).
[0010]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 is a configuration diagram showing a first embodiment of the present invention, and FIG. 2 is an explanatory diagram of an example of the Web. In the figure, 1 is a Web server, 2 is a hypertext analysis device, 11 is access history information, 21 is an access trend analysis unit, 22 is a hyperlink configuration analysis unit, and 23 is a hyperlink cohesion degree display unit. A typical hypertext system configured on a network is the Web. Hereinafter, the Web will be described as an example.
[0011]
The web server 1 is a means for transmitting information on a network. As shown in FIG. 2, the Web server 1 stores information to be provided to the user in a hyper structure including nodes (indicated by rectangles) and hyperlinks (indicated by arrows). The user can obtain content by accessing the Web server 1. At this time, in the Web server 1, in general, every time a user accesses, an access history includes a computer identifier (IP address) for identifying the user's computer, an access time, and an address (URL) of the node accessed by the user. Information 11 is recorded.
[0012]
The hypertext analysis device 2 includes an access tendency analysis unit 21, a hyperlink configuration analysis unit 22, a hyperlink cohesion degree display unit 23, and the like. The access tendency analysis unit 21 performs a clustering process on the node group on the Web server 1 using the corresponding access history information 11. An existing technique can be used for the clustering process. For example, in the clustering method using Agglomerative Hierarchical Clustering, the following steps 1, 2, and 3 are performed.
1. Each node on the Web server 1 is set as one cluster.
2. The similarity between the clusters is calculated, and the clusters having the maximum similarity are merged into one cluster.
3. Repeat step 2 until there is one cluster.
Each cluster sequentially generated during this process is obtained as a clustering processing result. For example: The cluster of only each node generated in the process of FIG. 5, each cluster obtained by merging the clusters, one cluster generated last, and the like are clustering results. Note that the above-described clustering method is, for example, E.I. M.M. Voorhees, “Implementing Aggregative Hierarchical Clustering Algorithms for Use in Document Retrieval”, Information Processing & Management, Vol. 22, no. 6, 1986, and the like.
[0013]
In the access tendency analysis unit 21, the above-described clustering process 2. In the similarity calculation in this process, the overlapping degree of the access history information 11 is used. Specifically, for example, paying attention to the IP address in the access history information 11, the number of IP addresses shared between the clusters is defined as the similarity. Alternatively, a value obtained by dividing the number of IP addresses shared between clusters by the number of all IP addresses of the two clusters is defined as similarity. Here, the IP address possessed by the cluster is an IP address that appears in the merged history of all the access histories to the nodes constituting the cluster. The number of IP addresses can be a value obtained by counting different IP addresses obtained by eliminating duplication from such IP addresses. Instead of using the IP address, for example, the access time in the access history information 11 may be used. In this case, it is better to handle in units of date or month instead of units of seconds or minutes.
[0014]
Alternatively, as similarity calculation, a vector is generated in which each cluster has the access history information 11 corresponding to the cluster as a term, and the appearance frequency of the access history information is the value of the term, and the inner product value between the vectors May be used as the similarity.
[0015]
The hyperlink configuration analysis unit 22 calculates, for each cluster generated by the access tendency analysis unit 21, a hyperlink cohesion degree (hereinafter simply referred to as cohesion degree) between nodes constituting the cluster. For example, if one or more hyperlinks exist between nodes, and the degree of cohesion is defined as having a connection between the nodes, the total number of connections between nodes that make up the cluster is the total number of connections that make up the cluster. The value can be divided by the number of combinations that select two from the nodes. That is, when the total number of node-to-node connections is L and the number of nodes is N,
Cohesion degree = L / _NC ₂
Can be calculated with The value of the cohesion degree of each cluster calculated by the hyperlink configuration analysis unit 22 is passed to the hyperlink cohesion degree display unit 23.
[0016]
The hyperlink cohesion degree display unit 23 displays the cohesion degree value calculated by the hyperlink configuration analysis unit 22 in an easy-to-use form. FIG. 3 is an explanatory diagram of an example of display by the hyperlink cohesion degree display unit. For example, the cohesion degree value can be divided into several segments, and the number of clusters entering each segment can be displayed in a bar graph. In FIG. 3, two Web sites (Web site A and Web site B) are displayed side by side with the distribution of the number of clusters on the vertical axis and the cohesion degree on the horizontal axis. As described above, since each cluster is configured based on access history information, the node group configuring each cluster has a strong tendency to be accessed before and after by the same user, for example. When the degree of cohesion of the node group is high, the user is provided with a large number of paths for transition between nodes, and thus browsing can be performed efficiently. On the other hand, if the degree of cohesion of the node group is low, browsing efficiency is deteriorated. In FIG. 3, it can be seen at a glance that the website A has many clusters with a low degree of cohesion, and the website B has many clusters with a high degree of cohesion. By this display, it can be easily determined that the hypertext system is constructed in the website B with a configuration superior to the website A.
[0017]
The hyperlink cohesion degree display unit 23 can display the cohesion degree in various display forms in addition to the display form shown in FIG. FIG. 4 is an explanatory diagram of another example of display by the hyperlink cohesion degree display unit. In the example shown in FIG. 4, the relationship between the cluster size and the cohesion degree value is displayed. Each point in the figure represents a cluster. As the size of the cluster, for example, the number of nodes constituting the cluster, the total number of words of each node constituting the cluster, the total file size of each node constituting the cluster, or the like can be used. In the example shown in FIG. 4, an example is shown in which the number of nodes constituting the cluster is used as the size of the cluster. When such a display is performed, if the cluster size is the same, it can be determined that the higher the unity, the better the configuration. Also, with such a display, it is possible to overlook the superiority or inferiority of the configuration of each cluster while paying attention to the cluster size.
[0018]
FIG. 5 is an explanatory diagram of a display example when a cluster is selected in another example of display by the hyperlink cohesion degree display unit. When the cohesion degree display as shown in FIG. 4 is performed, if a point representing a cluster is selected with a pointing device such as a mouse, as shown in FIG. 5, the URLs of the nodes constituting the selected cluster ( Identifier) can be displayed. Alternatively, the title of the node may be displayed instead of the URL. Further, when a node (displayed by URL) displayed as shown in FIG. 5 is selected with a pointing device such as a mouse, the selected node is accessed through the network, and the content of the node is acquired. May be displayed. With such a display, the website administrator can access the problem part in the website, refer to the content, and further edit it, so that the configuration in the website can be easily improved. can go.
[0019]
FIG. 6 is an explanatory diagram of another display example when a cluster is selected in another example of display by the hyperlink cohesion degree display unit. As in the example shown in FIG. 5, when the cohesion degree display as shown in FIG. 4 is performed, a point representing a cluster is selected by a pointing device such as a mouse, as shown in FIG. It is also possible to display hyperlinks between nodes constituting the selected cluster. Here, the labels of the nodes (N1, N2,..., N10 in FIG. 6) may be URLs or titles. By performing such display, the administrator of the Web site can easily find the problem part in the Web site. For example, in the display example of the hyperlink in the cluster shown in FIG. 6, since there is no hyperlink between N1 to N7 and N8 to N10, the user cannot easily go back and forth between the two groups.
[0020]
FIG. 7 is an explanatory diagram of still another example of display by the hyperlink cohesion degree display unit. In this example, the relationship between the similarity between nodes in the cluster and the value of cohesion is displayed. As the similarity between the nodes in the cluster, for example, the value of the similarity used at the time of cluster generation in the access tendency analysis unit 21 can be used. In such a display, if the similarity within the cluster is the same, it can be determined that the higher the cohesion, the better the configuration. With such a display, it is possible to overlook the superiority or inferiority of the configuration of each cluster while paying attention to the similarity between nodes in the cluster.
[0021]
The hyperlink cohesion degree display unit 23 can display the cohesion degree in an arbitrary form regardless of the display forms shown in the above examples.
[0022]
FIG. 8 is a block diagram showing a second embodiment of the present invention. In the figure, the same parts as those in FIG. Reference numeral 24 denotes an identifier acquisition unit. The identifier acquisition unit 24 acquires from the access history information 11 an identifier (for example, an IP address) of a computer that has accessed for a predetermined period for each node on the Web server 1.
[0023]
As in the first embodiment, the access trend analysis unit 21 performs clustering processing on the node group on the Web server 1 using the access history information 11 corresponding to each node group. When calculating the similarity in this clustering process, the degree of duplication of computer identifiers acquired by the identifier acquisition unit 24 can be used.
[0024]
FIG. 9 is a flowchart illustrating an example of processing in the identifier acquisition unit. First, in S31, only access related information within a predetermined period is extracted from the access history information 11 and the rest are discarded. In S32, the number (different number) N of different identifiers among all computer identifiers (IP addresses) existing in the access history information 11 extracted in S31 is obtained. Next, in S33, a vector having a dimension number N is generated for each node for all nodes existing in the access history information 11. Here, each term of the vector corresponds to different computer identifiers (IP addresses) arranged in ascending order (or descending order) of the alphabet. The initial value of each term is set to 0. In step S34, the computer identifier (IP address) of the computer that has accessed each node is extracted from the access history information 11, and 1 is added to the corresponding vector term. This process is sequentially performed for all access history information 11.
[0025]
Using the N-dimensional vector thus obtained, the access tendency analysis unit 21 can perform clustering using the magnitude of the inner product value between the vectors as the similarity. Here, when the clusters are merged, if the sum of the respective vectors is taken, this becomes the vector of the merged cluster.
[0026]
The subsequent processing is the same as that in the first embodiment. The hyperlink configuration analysis unit 22 calculates the cohesion degree, and the hyperlink cohesion degree display unit 23 displays the cohesion degree. At this time, the cohesion degree can be displayed in various display forms as described above or in various other display forms.
[0027]
FIG. 10 is a block diagram showing a third embodiment of the present invention. In the figure, the same parts as those in FIG. Reference numeral 25 denotes a computer identifier shaping unit. The computer identifier shaping unit 25 collects information automatically by comprehensively accessing the Web, not from a computer operated by a user (= human) among accesses to the Web server 1. Eliminate access by computers such as robots. For example, whether or not the access is from the information collection robot depending on whether or not there is access to a special node (in the Web, for example, a file named robots.txt that is placed directly under the root) accessed by the information collection robot according to a certain convention Can be judged. Alternatively, the information collecting robot can be identified based on the presence or absence of a characteristic behavior of the information collecting robot that comprehensively accesses a large number of nodes in a short period of time, and whether or not the computer identifier is a known information collecting robot. Can do. For the computer determined to be an access from the information collecting robot, the access history information 11 related to the computer identifier can be prevented from being acquired by the identifier acquiring unit 24.
[0028]
As a result, it is possible to eliminate access by a computer that collects information automatically by comprehensively accessing the Web, such as an information collecting robot, and to remove these influences on analysis results. It is possible to obtain an analysis result that correctly reflects.
[0029]
Note that other configurations and operations in the third embodiment are the same as those in the second embodiment described above.
[0030]
FIG. 11 is a block diagram showing a fourth embodiment of the present invention. In the figure, the same parts as those in FIG. Reference numeral 26 denotes a hyperlink unity degree evaluation unit. The hyperlink cohesion degree evaluation unit 26 compares the cohesion degree value of each cluster obtained by the hyperlink configuration analysis unit 22 with a predetermined threshold value, and displays a cluster whose cohesion degree is smaller than the threshold value. Pass to part 23. Thereby, in the hyperlink cohesion degree display unit 23, it is possible to easily obtain a cluster having a low cohesion degree, that is, an inferior configuration and nodes constituting the cluster.
[0031]
FIG. 12 is an explanatory diagram of still another example of display by the hyperlink cohesion degree display unit. In the display example shown in FIG. 12, the cluster and the identifiers (URLs) of the nodes constituting the cluster are displayed in ascending order of cohesion value. The cluster is shown in the “ID” column. Here, “ID” is a number uniquely assigned to refer to the cluster. For example, when the cluster is generated in the access tendency analysis unit 21, numbers may be assigned in the order of generation. Since such a display is displayed from a part that is inferior in structure, it is possible to easily know a part that becomes a bottleneck when the user uses it. Of course, also in the fourth embodiment, the cohesion degree can be displayed in various display forms as shown in the first embodiment or other display forms. Further, the identifier acquisition unit 24 and the computer identifier shaping unit 25 described in the second and third embodiments may be provided.
[0032]
The above-described embodiment can also be realized by a computer program. In that case, the program, data used by the program, and the like can be recorded in a computer-readable storage medium. A storage medium is a signal format that causes a state of change in energy such as magnetism, light, electricity, etc. according to the description of a program to a reader provided in the hardware resources of a computer. Thus, the description content of the program can be transmitted to the reading device. For example, a magnetic disk, an optical disk, a CD-ROM, a memory built in a computer, and the like.
[0033]
【The invention's effect】
As is clear from the above description, according to the present invention, clustering is performed on the nodes constituting the hypertext system based on the user access history information, and the obtained hypertext between the nodes constituting each cluster is obtained. Calculate link cohesion. Based on the degree of hyperlink cohesion, it is possible to easily determine the superiority or inferiority of the configuration of the hypertext system. For example, there is an effect that a Web administrator can change a portion having a configuration problem to construct a hypertext system having a better configuration.
[Brief description of the drawings]
FIG. 1 is a configuration diagram showing a first embodiment of the present invention.
FIG. 2 is an explanatory diagram of an example of the Web.
FIG. 3 is an explanatory diagram of an example of display by a hyperlink cohesion degree display unit.
FIG. 4 is an explanatory diagram of another example of display by a hyperlink cohesion degree display unit.
FIG. 5 is an explanatory diagram of a display example when a cluster is selected in another example of display by the hyperlink cohesion degree display unit.
FIG. 6 is an explanatory diagram of another display example when a cluster is selected in another example of display by the hyperlink cohesion degree display unit.
FIG. 7 is an explanatory diagram of still another example of display by a hyperlink cohesion degree display unit.
FIG. 8 is a block diagram showing a second embodiment of the present invention.
FIG. 9 is a flowchart illustrating an example of processing in an identifier acquisition unit.
FIG. 10 is a configuration diagram showing a third embodiment of the present invention.
FIG. 11 is a block diagram showing a fourth embodiment of the present invention.
FIG. 12 is an explanatory diagram of still another example of display by the hyperlink cohesion degree display unit.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Web server, 2 ... Hypertext analysis apparatus, 11 ... Access history information, 21 ... Access tendency analysis part, 22 ... Hyperlink structure analysis part, 23 ... Hyperlink cohesion degree display part, 24 ... Identifier acquisition part, 25 ... Computer identifier shaping unit, 26... Hyperlink cohesion degree evaluation unit.

Claims

Access trend analysis means for clustering the nodes constituting the hypertext system based on access history information to the hypertext system, and between the nodes constituting the cluster for each cluster obtained by the access trend analysis means A hypertext analysis apparatus, comprising: a hyperlink configuration analysis unit that calculates a hyperlink cohesion degree of the hyperlink; and a hyperlink cohesion degree display unit that displays the hyperlink cohesion degree obtained by the hyperlink configuration analysis unit.

Identifier acquisition means for permitting duplication and holding identifiers of computers that have been accessed for a predetermined period for each node constituting the hypertext system, and the access tendency analysis means includes the identifier acquisition means The hypertext analysis apparatus according to claim 1, wherein clustering is performed by determining a degree of duplication of identifiers held.

And an identifier shaping unit that specifies an identifier of a computer that collects information automatically by comprehensively accessing the hypertext system and deletes the identifier from the list held by the identifier acquisition unit. The hypertext analysis apparatus according to claim 2, wherein the apparatus is a hypertext analysis apparatus.

Hyperlink cohesion degree evaluation means for determining whether or not the hyperlink cohesion degree satisfies a predetermined condition, and the hyperlink cohesion degree display means is configured to display a hyperlink according to a determination result in the hyperlink cohesion degree evaluation means. 4. The hypertext analysis apparatus according to claim 1, wherein a link cohesion degree is displayed.

5. The hyperlink cohesion degree display means further displays a node constituting the cluster obtained by the access tendency analysis means and a hyperlink between the nodes. The hypertext analysis apparatus according to item 1.

Based on the access history information to the hypertext system, the access tendency analysis means performs clustering on the nodes constituting the hypertext system, and for each obtained cluster, the degree of hyperlink cohesion between the nodes constituting the cluster is determined. A hypertext analysis method, characterized in that the hyperlink structure analysis means calculates and the hyperlink cohesion display means displays the calculated hyperlink cohesion degree.

Access trend analysis processing for clustering nodes constituting the hypertext system based on access history information to the hypertext system, and between the nodes constituting the cluster for each cluster obtained by the access trend analysis processing Hyperlink cohesion degree calculation processing for calculating the hyperlink cohesion degree, and hyperlink cohesion degree display for displaying the hyperlink cohesion degree obtained by the hyperlink constitution analysis process and indicating the superiority or inferiority of the hypertext system configuration A storage medium storing a hypertext analysis program for causing a computer to execute processing.