JP3705330B2

JP3705330B2 - Hypertext structure change support device and method, and storage medium storing hypertext structure change support program

Info

Publication number: JP3705330B2
Application number: JP34507198A
Authority: JP
Inventors: 裕樹加藤; 雄大中山
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-12-04
Filing date: 1998-12-04
Publication date: 2005-10-12
Anticipated expiration: 2018-12-04
Also published as: JP2000172699A

Description

【０００１】
【発明の属する技術分野】
本発明は、インターネットでの情報提供を行うＷｅｂサイトにおいて、Ｗｅｂサイトで公開されているコンテンツ（ハイパーテキスト情報）に対するユーザのアクセス傾向とコンテンツ分布傾向との類似度からハイパーテキスト構造の変更支援を行う装置および方法と、その支援装置をコンピュータで実現するための支援プログラムを記録した記憶媒体に関するものである。
【０００２】
【従来の技術】
近年、ＷｏｒｌｄＷｉｄｅＷｅｂ（以下Ｗｅｂと略す）において情報提供を行う需要が増大している。企業においてもＷｅｂで情報提供を行うためのＷｅｂサイトを持ち、Ｗｅｂサイトで自社の製品の広告活動、あるいは電子商取引などの手段による物品、サービスの売買が盛んに行われつつある。そのような中で、Ｗｅｂにアクセスしてきたユーザの動向を知り、ユーザ動向に合わせて提供すべき情報の構造を速やかに変更していくための技術に関する需要が高まっている。
【０００３】
Ｗｅｂサイトで提供される情報はコンテンツと呼ばれ、ハイパーテキスト（たとえば、ＨＴＭＬ）によって表現される。個々のハイパーテキスト情報（以降ノードと呼ぶ）はリンクによって結ばれる。Ｗｅｂサイトにアクセスしたいユーザは、リンクをたどることによってＷｅｂサイト中のノードにアクセスし、様々な情報を入手することができる。またＷｅｂサーバは、ユーザがＷｅｂサーバにアクセスした履歴を記録することができる。記録された内容を以降ではログと呼ぶ。例えば、ログには情報にアクセスしてきたコンピュータ（ユーザがＷｅｂサーバにアクセスするために使用しているコンピュータ、又はユーザの利用情報を中継しているコンピュータ）のＩＰアドレス、アクセスしてきた時刻、情報にアクセスするためにＷｅｂサーバに送った命令が記録されている。命令には、ユーザがアクセスしたノードのサーバ上での識別子（例えばＵＲＬ）が含まれる。
【０００４】
アクセスログを利用してユーザの動向を知る従来技術としては、例えば、Ｍｉｎｇ−Ｓｙａｎほか，“ＤａｔａＭｉｎｉｎｇｆｏｒＰａｔｈＴｒａｖｅｒｓａｌＰａｔｔｅｒｎｓｏｎｔｈｅＷｅｂ”，ＩＥＥＥＴｒａｎｓ．ｏｎＫｎｏｗｌｅｄｇｅａｎｄＤａｔａＥｎｇｉｎｅｅｒｉｎｇ，１９９８等に記載されている技術がある。この文献に記載されている技術は、ＭＦ（ＭａｘｉｍａｌＦｏｒｗａｒｄｒｅｆｅｒｅｎｃｅｓ）という系列を取り出す技術である。図１５は、Ｗｅｂサイト構造とユーザのアクセス経路の一例の説明図である。図１５に示した例では、ＡないしＨの各ノードが図示したようにリンクされているとする。このとき、あるユーザが図中の太線で示したように、Ａ→Ｂ→Ｃ→Ｄ→Ｅ→Ｆ→Ｇの順にノードを閲覧したとする。この場合には、［ＡＢＣＤ］、［ＡＢＥ］、［ＡＦＧ］というＭＦが得られる。アクセスのあったユーザすべてに対してＭＦを求め、ＭＦの部分列の出現回数をカウントすることでユーザのアクセスパターンを得ることができる。
【０００５】
この技術では、ユーザがリンクをたどったノードの直線的な系列を得ることができる。しかし、ユーザがあるノードから戻ったノードを起点にたどり直した系列については、異なる系列として得られる。そのため、ユーザのアクセスしたノード群を一塊として得ることは困難である。
【０００６】
別の従来技術として、例えば、Ｊ．Ｂｏｒｇｅｓほか，“ＭｉｎｉｎｇＡｓｓｏｃｉａｔｉｏｎＲｕｌｅｓｉｎＨｙｐｅｒｔｅｘｔＤａｔａｂａｓｅｓ”，ＩｎＰｒｏｃ．ｏｆＫＤＤ−９８に記載されている技術がある。この技術は、ログからリンクで結ばれたノード間のユーザの遷移数を取り出し、ユーザのアクセスしたノード系列の相関ルールを得る技術である。この技術では、ノード間のユーザの遷移数のみを見ているため、同一のユーザが必ずその系列をたどっているとは限らない。図１６は、Ｗｅｂサイト構造とノード間のユーザの遷移数の一例の説明図である。図中、リンクをたどった方向を矢線によって示し、そのたどった回数を数値により示している。この従来技術によれば、図１６に示した例においては、例えば、Ｂ→ＡのリンクをたどったユーザはＡ→Ｆのリンクをたどる傾向があるという相関関係が得られる。しかし、同じユーザがノードＢ，Ａ，Ｆすべてを訪れていることを示すものではなく、ユーザのアクセス動向を正確に反映しているとはいえない。
【０００７】
加えて、上記２つの従来技術ではログからユーザのアクセスパターンを得ることで頻度の高いアクセス経路を知ることはできるが、コンテンツをあわせて統合的に処理を行っていないため、アクセスパターンをもとにどのコンテンツを変更すべきかといったＷｅｂサイトの維持、管理に関する支援を行うことは困難であった。
【０００８】
【発明が解決しようとする課題】
本発明は、上述した事情に鑑みてなされたもので、ユーザのアクセス傾向と、コンテンツの分布傾向間の類似度とを比較することによって、ユーザの動向を把握するとともに、ユーザの動向に合わせたＷｅｂサイトのコンテンツの変更の支援を可能としたハイパーテキスト構造変更支援装置および方法と、その支援装置をコンピュータで実現するための支援プログラムを記録した記憶媒体を提供することを目的とするものである。
【０００９】
【課題を解決するための手段】
本発明は、情報システムによって管理されたハイパーテキスト情報群に対するユーザによるアクセス頻度情報に基づいてハイパーテキスト集合から階層的なログクラスタを生成し、また、ハイパーテキスト情報群の個々のハイパーテキストの類似度を計算し該類似度に基づいてハイパーテキスト集合からコンテンツクラスタを生成し、ログクラスタとコンテンツクラスタとの間で対応づけできない部分構造を特定し、その対応付けできなかったコンテンツクラスタ内の要素であるハイパーテキストを提示するものである。ログクラスタとコンテンツクラスタとの間で対応づけできなかった場合には、例えば同じような内容のハイパーテキストに対してユーザからの参照頻度が明らかに異なり、Ｗｅｂサイト構造が不適切である、あるいは、リンクが適切に設けられていないなどの可能性がある。このようなハイパーテキストを提示することによって、ハイパーテキストの構造の変更を促し、構造変更を支援することができる。
【００１０】
【発明の実施の形態】
図１は、本発明のハイパーテキスト構造変更支援装置の実施の一形態を示すブロック図である。図中、１はＷｅｂサーバ、２はログクラスタリング部、３はコンテンツクラスタリング部、４は意図ずれ検出部、５は意図ずれ提示部、１１はログ記録部、１２はコンテンツ提供部である。
【００１１】
Ｗｅｂサーバ１は、ユーザに提供したい情報（コンテンツ）を貯えており、ネットワーク上で情報を発信する。Ｗｅｂサーバ１は、ログ記録部１１およびコンテンツ提供部１２を有している。コンテンツ提供部１２は、ユーザのアクセスに従ってコンテンツを提供する。また、ログ記録部１１は、ユーザからのアクセスがあるごとに、ユーザを識別するためのユーザ識別子（ＩＰアドレス）、アクセスした時刻、ユーザのアクセスしたハイパーテキスト（以降、ノードと呼ぶ）のあるアドレス（ＵＲＬ）を記録する。Ｗｅｂサーバ１で記録されるＩＰアドレスは、実際にユーザが利用しているクライアントコンピュータのほか、クライアントコンピュータからのアクセスを代行するプロキシサーバのアドレスであることもある。後者の場合は、複数のユーザが同一のＩＰアドレスを用いることになる。以下の説明では、Ｗｅｂサーバ１にアクセスしたコンピュータのＩＰアドレスと日付によりユーザの識別を行うものとする。すなわち同じＩＰアドレスからのアクセスであってもアクセスした日付が異なる場合には、異なるユーザとして識別する。
【００１２】
本発明において用いるログは、上述のものに限らず、例えば特開平１０−２２４３４９号公報に記載されているように、プロキシサーバのログをあわせて用いることで、ユーザを識別したものでもよい。また、例えば特開平１０−２０７８３８号公報に記載されているＪａｖａアプレット等のクライアント側にアクセス通知の機構をもつ仕組みによって記録されたログでも、ユーザが識別でき、アクセスした情報の場所、アクセスした順序がわかるものであればよい。
【００１３】
ログクラスタリング部２は、Ｗｅｂサーバ１からノードとリンクで形成されるリンク構造と、ログを取得し、ログに基づいてノードをクラスタリングしてログクラスタを形成する。このログクラスタリング部２の詳細は後述する。
【００１４】
コンテンツクラスタリング部３は、Ｗｅｂサーバ１からコンテンツを取得し、例えば内容の類似度に基づいてクラスタリングして、コンテンツクラスタを形成する。このコンテンツクラスタの形成時には、リンク構造を参照せずに行う。
【００１５】
意図ずれ検出部４は、ログクラスタリング部２によって得られたログクラスタに対して、コンテンツクラスタリング部３によって得られたコンテンツクラスタを対応づける。このとき、対応づけを行えなかったコンテンツクラスタを、意図がずれている可能性があるとして検出する。
【００１６】
意図ずれ提示部５は、意図ずれ検出部４において対応づけを行えずに意図がずれている可能性があるとして検出されたコンテンツクラスタ内の要素であるハイパーテキストを、Ｗｅｂサーバ１の管理者などに提示する。
【００１７】
図２は、本発明のハイパーテキスト構造変更支援装置の実施の一形態におけるログクラスタリング部の一例を示すブロック図である。図中、２１はアクセス情報記憶部、２２はアクセス数カウント部、２３はクラスタ構成部、２４はクラスタ生成高速化部、２５はクラスタ生成制限部である。
【００１８】
アクセス情報記憶部２１は、ログの情報から、識別子と、その識別子をもつユーザがアクセスしたノードを時系列順に記録した情報とを組にして記憶する。ここでの識別子は、上述のように同じＩＰアドレスからのアクセスであってもアクセスした日付が異なる場合には、異なるユーザとして識別して付与した識別子である。
【００１９】
アクセス数カウント部２２は、与えられたハイパーテキスト集合すべてに与えられた順序制約を満たしてアクセスした履歴をもつ識別子の数を、アクセス情報記憶部２１に記憶されている情報を用いて計算する。
【００２０】
クラスタ構成部２３は、予め定められた起点ノードのみからなるクラスタを頂点とし、起点ノードからリンクを１つたどることで到達可能でありかつログが取得されているノード集合から一つずつ取り出したノードと起点ノードの組からなるクラスタ集合を一つ下位の階層の新たなクラスタ候補として生成する。生成したクラスタ候補が、アクセス数カウント部２２により予め与えられた閾値以上のアクセス数を持つことが計算された場合には、新たなクラスタとして起点ノードの子クラスタとして生成する。生成した子クラスタについても、それぞれについて、子クラスタに含まれるノード集合からリンクを一つたどることで到達可能なノード集合に含まれかつログが取得されておりかつその子クラスタには含まれていないノードを一つずつ取り出し、子クラスタに含まれるノード集合に加えることでさらに一つ下位の階層のクラスタ候補を生成する。さらに、クラスタ候補が予め与えられた閾値以上のアクセス数を持つことが計算された場合には、子クラスタを親クラスタとする新たな子クラスタを生成する過程を新たな子クラスタが生成されなくなるまで繰り返す。このようにして、階層的クラスタの生成を行う。
【００２１】
クラスタ生成高速化部２４は、クラスタ構成部２３において階層的クラスタを生成する際に、新たに生成されるクラスタ候補に含まれるノード集合の部分集合をもつ上位階層のクラスタにおいて、その上位階層のクラスタに対するユーザからのアクセス数が予め定められた閾値以下である場合に、そのクラスタ候補を生成しないようにする。これによって、生成しないクラスタ候補以後のクラスタ生成を打ち切って処理時間を短縮することができる。
【００２２】
クラスタ生成制限部２５は、予め定められたノードについては、新たなクラスタ候補を生成する際に、そのノードのみからリンクを一つたどることで到達可能なノード集合については候補を生成する際に用いないようにする。これによって、生成する子クラスタ候補の数を制限することができる。
【００２３】
次に、本発明のハイパーテキスト構造変更支援装置の実施の一形態における動作の一例について説明する。図３は、ログクラスタリング部２における動作の一例を示すフローチャートである。また、図４は、Ｗｅｂサーバ１に貯えられているコンテンツの構造の具体例の説明図である。以下の説明では、図４に示すようなコンテンツが図示のようなリンクによって関連づけられているものとし、この図４を用いながら説明を行う。なお、図３に示す主な処理は、クラスタ構成部２３において行われる。
【００２４】
まずＳ３１において、クラスタに含まれるすべてのノードを訪れたユーザ数に関するアクセス数の閾値を読み込む。またＳ３２において、クラスタを生成する際にリンクをたどらないノードを記した非継承ノード集合を読み込む。この例では、アクセス数の閾値を２０、非継承ノード集合を［サイトの入り口］として予め与える。なお、集合は‘［’と‘］’で囲んで示すものとする。
【００２５】
またＳ３３において、Ｗｅｂサーバ１から、Ｗｅｂサイト上のノードのリンク関係を事前に抽出しておいたリンクテーブルを読み込む。このリンクテーブルを生成する際に、そのＷｅｂサイトでログを取得することのできない他のサイトのノードへのリンクは取り除かれる。
【００２６】
併せてＳ３４において、予め与えられた起点ノードについて読み込む。この例では、図４に示す構造を持つＷｅｂサイトにおいて起点ノードであるトップページ「サイトの入り口」を用いる。トップページからであれば、Ｗｅｂサイトが公開しているすべてのノードにリンクをたどることでアクセスできる。
【００２７】
次にＳ３５において、起点ノードのみからなるクラスタを初期クラスタとして生成し、クラスタ集合に加える。図４に示す例では、クラスタ集合は［サイトの入り口］となる。
【００２８】
Ｓ３６において、クラスタ集合が空であるか否かを判定する。クラスタ集合が空であれば、処理が終了したことを示すので、Ｓ３９に進んで、それまでに生成されたクラスタ集合をログクラスタとして出力して、ログクラスタリング部２の処理を終える。また、クラスタ集合が空でなければＳ３７へ進んでクラスタ候補の生成を行う。最初の状態では、初期クラスタが生成されているため、Ｓ３７のクラスタ候補の生成処理に移る。
【００２９】
Ｓ３７において、クラスタ集合からクラスタを取り出し、そのクラスタと、そのクラスタ中に含まれるノードからのリンク先ノードとの組み合わせにより、クラスタ候補を生成する。図４に示した例では、ノード「サイトの入り口」はノード「プリンタ」、「コピー」にリンクされているので、新たなクラスタ候補としてノード集合が［サイトの入り口，プリンタ］からなる子クラスタと、［サイトの入り口，コピー］からなる子クラスタがクラスタ候補として生成される。
【００３０】
次にＳ３８において、生成されたクラスタ候補がクラスタとなりうる基準を満たすか否かを調べ、クラスタ基準を満たすものを選出する。図５は、クラスタ候補がクラスタ基準を満たすか否かを判定する処理の一例を示すフローチャートである。Ｓ４１において、同じノード集合を持つクラスタが存在するか否かを判定し、同じノード集合を持つクラスタが存在する場合には候補から外す。
【００３１】
次にＳ４２において、クラスタ候補の部分集合をもつクラスタが過去にクラスタ基準を満たさないとして除かれているか否かを判定し、クラスタ候補の部分集合をもつクラスタが過去に除かれている場合には、そのクラスタ候補を候補から外す。この処理は、クラスタ生成高速化部２４において行われる。
【００３２】
次にＳ４３において、クラスタに含まれるノード集合すべてにアクセスし、かつ起点ノードに一番目にアクセスしているユーザの数（ユーザ識別子の数）をアクセス数カウント部２２を用いてカウントする。その数がＳ３１で読み込んだアクセス数の閾値未満の場合には、クラスタ候補から外す。ユーザ数のカウント方法としては、例えば、各ノードにアクセスしてきた順序までが一致するもののみをカウントする手法によっても行うことができる。この場合には、クラスタに含まれるノード集合は順序情報も併せ持ち、順序情報も一致するときのみ同じクラスタとみなすこととする。
【００３３】
Ｓ３８では、このように図５に示した判定基準を合格したクラスタ候補をクラスタ集合に追加する。そしてＳ３６に戻る。
【００３４】
Ｓ３６に戻り、クラスタ集合が空でなければＳ３７およびＳ３８の処理を繰り返すことになる。このとき、新たに生成されたクラスタ集合に複数のクラスタが存在する場合には、クラスタ数の一番小さいものを取り出し、同様にクラスタ候補の生成を行う。以降、クラスタ候補の生成を行う場合には、クラスタに含まれるノードからリンクを一つたどることでアクセス可能なノードを用いて行う。これは、ユーザがアクセスした際に、ブラウザのバック機能を使ってノードを戻り、他のページにアクセスしていることも考えられるためである。
【００３５】
ただし、Ｗｅｂサーバのコンテンツ中でインデックスとなっているノードについては、リンク先を子クラスタ候補の生成に用いないように、クラスタ生成制限部２５により制限し、組み合わせの数の増大を防ぐことができる。図４に示す例では、ノード「サイトの入り口」がインデックスとなっているとすれば、以降のクラスタを生成する際には、クラスタ［サイトの入り口］からクラスタ［コピー，プリンタ］への展開は行われない。このインデックスノードについては、ユーザからの明示的な入力によってクラスタを生成しないようにすることができる。また、ここではコンテンツの類似度によるクラスタリングとの対応づけを行うため、コンテンツ間の類似度が低ければ意図ずれ検出部４では対応するコンテンツクラスタが割り当てられないため、それらのノードを含むクラスタが生成されなくてもよい。そこで、事前にあるノードからリンクで繋がれているノード間の類似度を計算し、類似度の平均が閾値以下であるノードを検出し、検出されたノードをインデックスノードとしてクラスタの生成を制限することが可能である。
【００３６】
また、Ｓ３７においてクラスタ候補を生成する際には、既に同一のノード集合を持つ子クラスタが他の親クラスタから生成されている場合には、既に生成された子クラスタを、その子クラスタを生成しようとした親クラスタの子クラスタとする関係を生成することができる。
【００３７】
図６は、ログクラスタの一例の説明図である。上述のようなログクラスタリング部２における処理によって、図４に示したコンテンツの構造から図６に示すようなログクラスタが出力される。すなわち、まず初期クラスタとしてクラスタ１［サイトの入り口］が生成された後、ノード［サイトの入り口］からリンクを一つたどれるノード「プリンタ」、「コピー」から、クラスタ２［サイトの入り口，プリンタ］と、クラスタ３［サイトの入り口，コピー］が生成される。同様にして、クラスタ２からクラスタ４，クラスタ５が生成され、クラスタ３からクラスタ６が生成される。クラスタ３をもとにクラスタ候補となったクラスタ候補１［サイトの入り口，コピー，機種Ｄ］は、アクセス数が２０未満であるため、クラスタとはならない。
【００３８】
クラスタ４からは、クラスタ８およびクラスタ９が生成される。このとき、クラスタ５からも子クラスタ候補としてクラスタ８と同じクラスタが生成される。この場合には、上述のように、すでに生成されているクラスタ８に対して、クラスタ５の子クラスタとしての関係も生成する。同様に、クラスタ８からクラスタ１０が生成されるが、クラスタ９の子クラスとしての関係も生成される。
【００３９】
クラスタ６からは、ノード集合［サイトの入り口，コピー，機種Ｃ，機種Ｄ］を持つクラスタ候補２が生成できるが、このノード集合の部分集合を持つクラスタ候補１が以前に基準を満たさないため、図５のＳ４２の条件によって取り除かれ、クラスタ候補から外される。
【００４０】
このようにして生成されたクラスタは、図６に示すように有向非循環グラフ（ＤＡＧ）構造を持ち、それぞれがどのクラスタを子クラスタとして持つのか、どのクラスタを親クラスタとして持つかという関係が生成されている。なお、生成したログクラスタも親子の関係を有する木構造となるため、図６に示した例では図４と類似した構造となっているが、コンテンツの構造とは別の構造としてログクラスタが形成されることに留意されたい。
【００４１】
次に、コンテンツクラスタリング部３によりコンテンツクラスタを生成する処理について説明する。コンテンツクラスタリング部３は、各ノードに対して例えば特願平９−１５３３８７号で示された手法等を用いて各ノードのプロファイルを生成し、すべての２つのノードの組み合わせについて類似度を計算し、記憶する。そして、計算された類似度をもとに、既知の手法でクラスタリングを行う。クラスタリングの手法としては、例えばＥｌｌｅｎＭ．Ｖｏｏｒｈｅｅｓ，“ＩＭＰＬＥＭＥＮＴＩＮＧＡＧＧＬＯＭＥＲＡＴＩＶＥＨＩＥＲＡＲＣＨＩＣＣＬＵＳＴＥＲＩＮＧＡＬＧＯＲＩＴＨＭＳＦＯＲＵＳＥＩＮＤＯＣＵＭＥＮＴＲＥＴＲＩＥＶＡＬ”，ＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇ＆Ｍａｎａｇｅｍｅｎｔ，Ｖｏｌ．２２，Ｎｏ．６，ｐｐ．４６５−４７６，１９８６などに記載されているＨｉｅｒａｒｃｈｉｃａｌＣｌｕｓｔｅｒｉｎｇＡｌｇｏｒｉｔｈｍのＣｏｍｐｌｅｔｅｌｉｎｋを用いてクラスタリングを行うことができる。
【００４２】
図７は、コンテンツクラスタの一例の説明図である。図４に示したようなコンテンツがＷｅｂサーバ１に貯えられているとき、これらのコンテンツを階層的にクラスタリングすることによって、図７に示すようなコンテンツクラスタの構造が得られる。すなわち、ノード「機種Ａ」とノード「機種Ａの機能」の内容が類似しているため、クラスタ［機種Ａ，機種Ａの機能Ａ］が得られる。また、これらのノードとノード「機種Ｂ」とが類似していることからクラスタ［機種Ａ，機種Ａの機能Ａ，機種Ｂ］が得られる。一方、ノード「機種Ｃ」とノード「機種Ｄ」の内容が類似していることからクラスタ［機種Ｃ，機種Ｄ］が得られる。最後にこれらのノードが類似していることから、クラスタ［機種Ａ，機種Ａの機能Ａ，機種Ｂ，機種Ｃ，機種Ｄ］が得られる。このようにして、［機種Ａ，機種Ａの機能Ａ］，［機種Ａ，機種Ａの機能Ａ，機種Ｂ］，［機種Ｃ，機種Ｄ］，［機種Ａ，機種Ａの機能Ａ，機種Ｂ，機種Ｃ，機種Ｄ］の４つのクラスタが得られる。
【００４３】
なお、このコンテンツクラスタリング部３の処理は、各ノード間のリンクは考慮せず、例えば内容についての類似度によりクラスタリングする。そのため、図４に示した例では内容的に類似したコンテンツがリンクによって接続されているが、コンテンツクラスタリングによってリンク関係では離れたノードが１つのクラスタにまとめられることもある。あるいは、何らかの理由でリンクがないノードについても、コンテンツクラスタに含めることができる。
【００４４】
図８は、意図ずれ検出部４および意図ずれ提示部５における動作の一例を示すフローチャートである。まずＳ５１およびＳ５２において、上述のようにして生成されたログクラスタおよびコンテンツクラスタを読み込む。ここでは、図６に示したログクラスタおよび図７に示したコンテンツクラスタを読み込んだものとする。
【００４５】
次にＳ５３において、ログクラスタの葉クラスタ（そのクラスタのもつノード集合に新たにノードを追加したクラスタの生成が行われなかったクラスタ）に含まれるコンテンツクラスタを検出する。図６に示したログクラスタの葉クラスタはクラスタ６およびクラスタ１０である。これらのログクラスタに含まれるコンテンツクラスタとして、クラスタ１０からはコンテンツクラスタ［機種Ａ，機種Ａの機能Ａ］，［機種Ａ，機種Ａの機能Ａ，機種Ｂ］が検出される。以下、コンテンツクラスタ［機種Ａ，機種Ａの機能Ａ，機種Ｂ］をコンテンツクラスタ１、コンテンツクラスタ［機種Ａ，機種Ａの機能Ａ］をコンテンツクラスタ２と呼ぶことにする。なお、クラスタ６に含まれるコンテンツクラスタは検出されない。
【００４６】
Ｓ５４において、Ｓ５３で検出したコンテンツクラスタを含むログクラスタの葉クラスタを集合１に加える。コンテンツクラスタ１およびコンテンツクラスタ２はログクラスタのうちの葉クラスタ１０に含まれる。そのため、集合１にはログクラスタのうちクラスタ１０が加えられる。
【００４７】
Ｓ５５において、集合１が空であるか否かを判定し、空になった場合には処理を終了する。集合１が空でなければ以下の処理を行う。
【００４８】
Ｓ５６において、集合１からログクラスタの１つを取り出す。そしてＳ５７において、取り出したログクラスタから親ログクラスタへコンテンツクラスタを対応づけてみる。図９は、ログクラスタとコンテンツクラスタとの対応関係の説明図である。上述のように、集合１にはログクラスタのうちのクラスタ１０が含まれているので、このクラスタ１０を集合１から取り出す。ここで、クラスタ１０が含むコンテンツクラスタをコンテンツクラスタ１であるものとして図９に示している。そして、図９に矢線で示すように、クラスタ１０から親ログクラスタであるクラスタ８およびクラスタ９へ、コンテンツクラスタ１を対応づけてみる。
【００４９】
このとき、Ｓ５８において、親ログクラスタでは含まれないノードがコンテンツクラスタに存在するか否かを判定する。親ログクラスタでは含まれないノードがコンテンツクラスタに存在する場合には、Ｓ５９における判定に進む。図１０は、コンテンツクラスタとログクラスタの重なり方の一例の説明図である。この例では、クラスタ１０から親ログクラスタであるクラスタ９にコンテンツクラスタ１を対応づけた場合、図１０に示すようにコンテンツクラスタ１に存在するノード「機種Ｂ」がクラスタ９に含まれていない。そのため、Ｓ５９に進む。このＳ５８における判定において、ログクラスタとコンテンツクラスタとのずれを検出することができる。すなわち、ユーザによるノードのアクセスと、コンテンツの内容の類似性が一致していない部分の存在を検出することができる。
【００５０】
Ｓ５９において、コンテンツクラスタと親ログクラスタの関係が予め定められた形式であるか否かを判定する。図１１、図１２は、コンテンツクラスタとログクラスタのずれの一例の説明図である。予め定められた形式としては、例えば図１１や図１２に示した形式を定めておくことができる。図１１に示した形式は、ノード２の内容の一部についてノード３において詳しく説明しているような例であり、サイトを作った側としてはノード２から更にノード３へのアクセスを期待しているリンク形式である。しかし、実際にはノード３へのアクセスが行われていないことを示す。図１２に示した形式は、ノード２、ノード３ではノード１の内容について詳しく説明しているものであり、サイトを作った側としてはノード１、ノード２に同程度のアクセス数を期待しているか、あるいはどちらにユーザがアクセスするかによってユーザの選択点を知るのに重要となるリンク形式である。図１２に示す形式では、ノード２はアクセスされているものの、ノード３へのアクセスはほとんどないことを示している。ここで、図１０に示した例は、図１２に示すリンク形式であるので、親ログクラスタとコンテンツクラスタとは予め定められた形式であるものと判断される。
【００５１】
Ｓ５９において予め定められた形式であると判断された場合には、意図ずれが生じているものとして、Ｓ６１において、評価値を計算し、管理者などに提示する。評価値は、例えば、親クラスに分類されるアクセス件数÷子クラスタに分類されるアクセス件数で表すことができる。評価値が大きいほど、コンテンツを変更することでユーザのアクセス数を増やすことのできる可能性が高いこと、あるいは、ユーザの選択がはっきりしていることを示している。図６あるいは図９に示したように、クラスタ９にアクセス傾向が分類されるアクセス数は５０件、クラス１０にアクセス傾向が分類されるアクセス数は３０件であるので、評価値は１．７となる。
【００５２】
図１３は、提示される評価結果の一例の説明図である。上述のようにして計算した評価値を用い、例えば図１３に示すような形式で管理者などに提示することができる。サイトを変更すべきポイントとして、ログクラスタから外れたコンテンツクラスタ内のノード、該ノードへのリンク元のノードを強調して表示し、ログクラスタに含まれるノードの構造と該ログクラスタに含まれなくなったノードの関係を示す。このとき、各ノードはノードの内容をＷｅｂブラウザと同じ形式で提示するか、あるいは表示サイズが限られる場合には、ページのタイトルのみで提示する。なお、図１３においては、図示の都合上、強調表示にはハッチングを付して示している。また、表示方法は任意であり、図１３に示す形態にとらわれることなく、わかりやすいように表示させることができる。
【００５３】
上述の説明では、Ｓ５７においてクラスタ１０の親ログクラスタとしてクラスタ９へコンテンツクラスタを対応づけた場合について説明した。クラスタ１０の親ログクラスタとして、もう一つの親ログクラスタであるクラスタ８にコンテンツクラスタを対応づけた場合について説明する。この場合、コンテンツクラスタ１に含まれるノード「機種Ａの機能Ａ」がクラスタ８には含まれていないため、Ｓ５８からＳ５９へ進む。クラスタ８とコンテンツクラスタ１との関係は図１１に示すリンク形式に相当するので、Ｓ６１において評価値が計算され、評価結果が提示される。図１４は、提示される評価結果の別の例の説明図である。この場合、ノード「機種Ａ」とノード「機種Ａの機能Ａ」とが強調表示され、その間のリンクに評価値を表示している。これによって、サイトの変更ポイントを管理者などに示すことができる。この場合も、表示形態は任意である。
【００５４】
Ｓ５８においてコンテンツクラスタのノードが親クラスタにすべて含まれる場合や、Ｓ５９において、親クラスタとコンテンツクラスタとの関係が所定の形式ではない場合には、評価値の算出や評価結果の表示を行わずに、Ｓ６０へ進む。また、Ｓ６１において評価値の算出および評価結果の表示を行った後も、Ｓ６０へ進む。
【００５５】
Ｓ６０において、親クラスタがまだコンテンツクラスタを含むか否かを判定し、含む場合にはＳ６２で親クラスタを集合１に加える。クラスタ１０の親クラスタであるクラスタ９は、コンテンツクラスタ２［機種Ａの機能Ａ，機種Ａ］を含むので、クラスタ９が集合１に加えられる。なお、もう一つのクラスタ１０の親クラスタであるクラスタ８は、コンテンツクラスタを含まないので、集合１には加えられない。
【００５６】
そして、Ｓ５５へ戻る。この例では、集合１にはクラスタ９がまだ存在する。そのため、Ｓ５６以降の処理を繰り返すことになる。クラスタ９の親ログクラスタであるクラスタ４にコンテンツクラスタ２を対応づけるが、クラスタ４にはノード「機種Ａの機能Ａ」が含まれない。この場合の親ログクラスタとコンテンツクラスタとの関係は図１１に示した形式であるので、Ｓ６１において評価値を算出し、評価結果の表示を行う。
【００５７】
このクラスタ４はコンテンツクラスタを含まないため、集合１へ加えられず、Ｓ５５に戻って集合１が空となり、処理を終了する。
【００５８】
このようにして、コンテンツの内容からサイトを作った側として同様のアクセスを意図したノードと、実際にアクセスされたノードとのずれを検出して表示することができる。例えば図１３に示したような表示を参照することで、管理者は機種Ａの人気が高いことを知ることができる。あるいは、機種Ｂについてのアクセスを増加させるような対策を講じることができる。また、例えば図１４に示すような表示を参照することによって、機種Ａの機能Ａについてもっとアクセスしてもらえるような対策を講じることができる。
【００５９】
上述の実施の形態は、コンピュータプログラムによっても実現することが可能である。その場合、そのプログラムおよびそのプログラムが用いるデータなどは、コンピュータが読み取り可能な記憶媒体に記憶することも可能である。記憶媒体とは、コンピュータのハードウェア資源に備えられている読取装置に対して、プログラムの記述内容に応じて、磁気、光、電気等のエネルギーの変化状態を引き起こして、それに対応する信号の形式で、読取装置にプログラムの記述内容を伝達できるものである。例えば、磁気ディスク、光ディスク、ＣＤ−ＲＯＭ、コンピュータに内蔵されるメモリ等である。
【００６０】
【発明の効果】
以上の説明から明らかなように、本発明によれば、ユーザのアクセス傾向とＷｅｂサイトのコンテンツ作成者の意図とのずれを検出し、Ｗｅｂサイトのコンテンツの変更を行う際の支援を行うことができ、ユーザの動向にあわせて速やかにＷｅｂサイトの維持、管理を行うことができるという効果がある。
【図面の簡単な説明】
【図１】本発明のハイパーテキスト構造変更支援装置の実施の一形態を示すブロック図である。
【図２】本発明のハイパーテキスト構造変更支援装置の実施の一形態におけるログクラスタリング部の一例を示すブロック図である。
【図３】ログクラスタリング部２における動作の一例を示すフローチャートである。
【図４】Ｗｅｂサーバ１に貯えられているコンテンツの構造の具体例の説明図である。
【図５】クラスタ候補がクラスタ基準を満たすか否かを判定する処理の一例を示すフローチャートである。
【図６】ログクラスタの一例の説明図である。
【図７】コンテンツクラスタの一例の説明図である。
【図８】意図ずれ検出部４および意図ずれ提示部５における動作の一例を示すフローチャートである。
【図９】ログクラスタとコンテンツクラスタとの対応関係の説明図である。
【図１０】コンテンツクラスタとログクラスタの重なり方の一例の説明図である。
【図１１】コンテンツクラスタとログクラスタのずれの一例の説明図である。
【図１２】コンテンツクラスタとログクラスタのずれの別の例の説明図である。
【図１３】提示される評価結果の一例の説明図である。
【図１４】提示される評価結果の別の例の説明図である。
【図１５】Ｗｅｂサイト構造とユーザのアクセス経路の一例の説明図である。
【図１６】Ｗｅｂサイト構造とノード間のユーザの遷移数の一例の説明図である。
【符号の説明】
１…Ｗｅｂサーバ、２…ログクラスタリング部、３…コンテンツクラスタリング部、４…意図ずれ検出部、５…意図ずれ提示部、１１…ログ記録部、１２…コンテンツ提供部、２１…アクセス情報記憶部、２２…アクセス数カウント部、２３…クラスタ構成部、２４…クラスタ生成高速化部、２５…クラスタ生成制限部。[0001]
BACKGROUND OF THE INVENTION
The present invention provides support for changing a hypertext structure based on the degree of similarity between a user's access tendency and content distribution tendency with respect to content (hypertext information) published on the website in a website that provides information on the Internet. The present invention relates to an apparatus and method, and a storage medium recording a support program for realizing the support apparatus by a computer.
[0002]
[Prior art]
In recent years, there has been an increasing demand for providing information on the World Wide Web (hereinafter abbreviated as Web). Businesses also have Web sites for providing information on the Web, and buying and selling of goods and services by means of advertising activities of their products or means such as electronic commerce are being actively performed on the Web sites. Under such circumstances, there is an increasing demand for technology for knowing the trends of users who have accessed the Web and quickly changing the structure of information to be provided in accordance with the user trends.
[0003]
Information provided on a website is called content and is expressed in hypertext (for example, HTML). Individual hypertext information (hereinafter referred to as nodes) is connected by links. A user who wants to access a website can access a node in the website by following a link and obtain various information. The Web server can record a history of accessing the Web server by the user. The recorded content is hereinafter referred to as a log. For example, the log includes the IP address of the computer that has accessed the information (the computer that the user is using to access the Web server, or the computer that is relaying the user's usage information), the time of access, and the information A command sent to the Web server for access is recorded. The command includes an identifier (for example, URL) on the server of the node accessed by the user.
[0004]
As a conventional technique for knowing a user's trend using an access log, for example, Ming-Syan et al., “Data Mining for Path Traversal Patterns on the Web”, IEEE Trans. on Knowledge and Data Engineering, 1998, and the like. The technique described in this document is a technique for taking out a series called MF (Maximum Forward references). FIG. 15 is an explanatory diagram of an example of a Web site structure and a user access route. In the example shown in FIG. 15, it is assumed that the nodes A to H are linked as illustrated. At this time, it is assumed that a certain user browses nodes in the order of A → B → C → D → E → F → G as indicated by the bold line in the figure. In this case, MFs [ABCD], [ABE], and [AFG] are obtained. The user's access pattern can be obtained by obtaining the MF for all users who have accessed and counting the number of appearances of the MF subsequence.
[0005]
With this technique, it is possible to obtain a linear sequence of nodes that a user has followed a link. However, the sequence in which the user has traced back from the node returned from a certain node is obtained as a different sequence. Therefore, it is difficult to obtain a group of nodes accessed by the user as a lump.
[0006]
As another conventional technique, for example, J. Org. Borges et al., “Minning Association Rules in Hypertext Databases”, In Proc. There is a technique described in of KDD-98. In this technique, the number of user transitions between nodes connected by a link is extracted from a log, and a correlation rule for a node sequence accessed by the user is obtained. In this technique, since only the number of user transitions between nodes is observed, the same user does not always follow the sequence. FIG. 16 is an explanatory diagram of an example of the number of user transitions between the website structure and the nodes. In the figure, the direction following the link is indicated by an arrow line, and the number of times the link is followed is indicated by a numerical value. According to this prior art, in the example shown in FIG. 16, for example, a correlation that a user who follows a link of B → A tends to follow a link of A → F is obtained. However, this does not indicate that the same user is visiting all of the nodes B, A, and F, and cannot accurately reflect the access trend of the user.
[0007]
In addition, in the above two conventional technologies, it is possible to know the frequent access route by obtaining the user access pattern from the log, but since the processing is not integrated with the content, the access pattern is based on the access pattern. It has been difficult to provide support related to the maintenance and management of Web sites, such as which content should be changed.
[0008]
[Problems to be solved by the invention]
The present invention has been made in view of the above-described circumstances. By comparing the user's access tendency and the similarity between content distribution trends, the user's trend is grasped and the user's trend is matched. It is an object of the present invention to provide a hypertext structure change support device and method capable of supporting the change of Web site content, and a storage medium storing a support program for realizing the support device on a computer. .
[0009]
[Means for Solving the Problems]
The present invention generates a hierarchical log cluster from a hypertext set based on access frequency information by a user to a hypertext information group managed by an information system, and the degree of similarity between individual hypertexts in the hypertext information group. The content cluster is generated from the hypertext set based on the similarity, the partial structure that cannot be matched between the log cluster and the content cluster is specified, and the elements in the content cluster that could not be matched It presents hypertext. If the log cluster cannot be associated with the content cluster, for example, the frequency of reference from the user is clearly different for hypertext having the same content, the website structure is inappropriate, or There is a possibility that the link is not provided properly. By presenting such hypertext, a change in the structure of the hypertext can be promoted and the structure change can be supported.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 is a block diagram showing an embodiment of the hypertext structure change support device of the present invention. In the figure, 1 is a Web server, 2 is a log clustering unit, 3 is a content clustering unit, 4 is an intentional error detecting unit, 5 is an intentional error presenting unit, 11 is a log recording unit, and 12 is a content providing unit.
[0011]
The Web server 1 stores information (content) that the user wants to provide to the user and transmits the information on the network. The Web server 1 includes a log recording unit 11 and a content providing unit 12. The content providing unit 12 provides content according to user access. In addition, each time the user accesses the log recording unit 11, the user identifier (IP address) for identifying the user, the access time, and the hypertext accessed by the user (hereinafter referred to as a node) Record (URL). The IP address recorded by the Web server 1 may be the address of a proxy server acting as a proxy for access from the client computer in addition to the client computer actually used by the user. In the latter case, a plurality of users use the same IP address. In the following description, it is assumed that the user is identified by the IP address and date of the computer that has accessed the Web server 1. That is, even if accesses are made from the same IP address, if the access dates are different, they are identified as different users.
[0012]
The log used in the present invention is not limited to the log described above, and may be a log identifying a user by using a proxy server log as described in, for example, Japanese Patent Laid-Open No. 10-224349. Further, for example, even a log recorded by a mechanism having an access notification mechanism on the client side such as a Java applet described in Japanese Patent Laid-Open No. 10-207838 can be identified by the user, the location of the accessed information, and the access order Anything that can be understood.
[0013]
The log clustering unit 2 acquires a link structure formed by nodes and links and a log from the Web server 1, and clusters the nodes based on the log to form a log cluster. Details of the log clustering unit 2 will be described later.
[0014]
The content clustering unit 3 acquires content from the Web server 1 and performs clustering based on, for example, content similarity to form a content cluster. The content cluster is formed without referring to the link structure.
[0015]
The intention shift detection unit 4 associates the content cluster obtained by the content clustering unit 3 with the log cluster obtained by the log clustering unit 2. At this time, the content cluster that could not be matched is detected as a possibility that the intention is shifted.
[0016]
The intention disagreement presentation unit 5 displays the hypertext, which is an element in the content cluster, detected as a possibility that the intention disagreement is not possible in the intention disagreement detection unit 4 without being associated with each other. To present.
[0017]
FIG. 2 is a block diagram showing an example of a log clustering unit in the embodiment of the hypertext structure change support device of the present invention. In the figure, 21 is an access information storage unit, 22 is an access number counting unit, 23 is a cluster configuration unit, 24 is a cluster generation acceleration unit, and 25 is a cluster generation restriction unit.
[0018]
The access information storage unit 21 stores an identifier and information in which the nodes accessed by the user having the identifier are recorded in chronological order from the log information. The identifier here is an identifier that is identified and assigned as a different user when the access date is different even when accessing from the same IP address as described above.
[0019]
The access number counting unit 22 calculates the number of identifiers having a history of access that satisfies the order constraint given to all given hypertext sets using the information stored in the access information storage unit 21.
[0020]
The cluster configuration unit 23 has a cluster composed only of a predetermined starting node as a vertex, and nodes that are reachable by following one link from the starting node and that are extracted one by one from the node set from which logs are acquired A cluster set consisting of a set of starting nodes is generated as a new cluster candidate one layer below. When it is calculated that the generated cluster candidate has an access count equal to or greater than a threshold given in advance by the access count counter 22, a new cluster is generated as a child cluster of the starting node. Each of the generated child clusters is also included in the node set that can be reached by following one link from the node set included in the child cluster, and the log has been acquired and is not included in the child cluster. Are extracted one by one and added to the node set included in the child cluster to generate a cluster candidate at a lower hierarchy. Furthermore, when it is calculated that the cluster candidate has the number of accesses equal to or greater than a predetermined threshold value, the process of generating a new child cluster with the child cluster as the parent cluster is performed until no new child cluster is generated. repeat. In this way, a hierarchical cluster is generated.
[0021]
The cluster generation acceleration unit 24, when generating a hierarchical cluster in the cluster configuration unit 23, in a higher-level cluster having a subset of node sets included in a newly generated cluster candidate, The cluster candidate is not generated when the number of accesses from the user is equal to or less than a predetermined threshold. Thereby, it is possible to shorten the processing time by stopping the cluster generation after the cluster candidates that are not generated.
[0022]
When generating a new cluster candidate for a predetermined node, the cluster generation restriction unit 25 is used for generating a candidate for a node set that can be reached by following one link only from that node. Not to be. Thereby, the number of child cluster candidates to be generated can be limited.
[0023]
Next, an example of operation | movement in one Embodiment of the hypertext structure change assistance apparatus of this invention is demonstrated. FIG. 3 is a flowchart showing an example of the operation in the log clustering unit 2. FIG. 4 is an explanatory diagram of a specific example of the structure of content stored in the Web server 1. In the following description, it is assumed that the content as shown in FIG. 4 is associated by the link as shown in the figure, and the description will be made with reference to FIG. The main processing shown in FIG. 3 is performed in the cluster configuration unit 23.
[0024]
First, in S31, a threshold value for the number of accesses relating to the number of users who visited all the nodes included in the cluster is read. In S32, a non-inherited node set in which nodes that do not follow a link are generated when a cluster is generated is read. In this example, the threshold of the number of accesses is 20 and the non-inherited node set is given in advance as [site entrance]. It should be noted that the set is indicated by being surrounded by '[' and ']'.
[0025]
In S33, the link table in which the link relation of the nodes on the Web site is extracted in advance is read from the Web server 1. When this link table is generated, links to nodes of other sites that cannot obtain logs at the Web site are removed.
[0026]
At the same time, in step S34, the starting node given in advance is read. In this example, the top page “site entrance”, which is the starting node, is used in the Web site having the structure shown in FIG. From the top page, it can be accessed by following links to all nodes published by the website.
[0027]
Next, in S35, a cluster consisting only of the origin node is generated as an initial cluster and added to the cluster set. In the example shown in FIG. 4, the cluster set is [site entrance].
[0028]
In S36, it is determined whether or not the cluster set is empty. If the cluster set is empty, it indicates that the processing has been completed, and thus the process proceeds to S39, where the cluster set generated so far is output as a log cluster, and the processing of the log clustering unit 2 is completed. If the cluster set is not empty, the process proceeds to S37 to generate a cluster candidate. In the initial state, since an initial cluster has been generated, the process proceeds to cluster candidate generation processing in S37.
[0029]
In S37, a cluster is extracted from the cluster set, and a cluster candidate is generated by a combination of the cluster and a link destination node from a node included in the cluster. In the example shown in FIG. 4, since the node “site entrance” is linked to the nodes “printer” and “copy”, the node set is a child cluster consisting of [site entrance, printer] as a new cluster candidate. , [Site Entrance, Copy] are generated as cluster candidates.
[0030]
Next, in S38, it is checked whether or not the generated cluster candidate satisfies a criterion that can be a cluster, and one that satisfies the cluster criterion is selected. FIG. 5 is a flowchart illustrating an example of processing for determining whether or not a cluster candidate satisfies a cluster criterion. In S41, it is determined whether or not there is a cluster having the same node set. If there is a cluster having the same node set, it is excluded from the candidates.
[0031]
Next, in S42, it is determined whether or not a cluster having a subset of cluster candidates has been excluded as not satisfying the cluster criteria in the past, and if a cluster having a subset of cluster candidates has been excluded in the past, The cluster candidate is excluded from the candidates. This process is performed in the cluster generation acceleration unit 24.
[0032]
Next, in S43, the number of users (the number of user identifiers) accessing all the node sets included in the cluster and accessing the starting node first is counted using the access number counting unit 22. If the number is less than the threshold value of the number of accesses read in S31, it is excluded from the cluster candidates. As a method of counting the number of users, for example, a method of counting only those in which the order of accessing each node matches can be performed. In this case, the node sets included in the cluster also have order information, and are regarded as the same cluster only when the order information matches.
[0033]
In S38, cluster candidates that pass the determination criteria shown in FIG. 5 are added to the cluster set. Then, the process returns to S36.
[0034]
Returning to S36, if the cluster set is not empty, the processes of S37 and S38 are repeated. At this time, if there are a plurality of clusters in the newly generated cluster set, the cluster with the smallest number of clusters is taken out and cluster candidates are generated in the same manner. Thereafter, cluster candidates are generated using nodes that are accessible by following one link from the nodes included in the cluster. This is because when the user accesses, it may be possible to return to the node using the back function of the browser and access another page.
[0035]
However, the nodes that are indexes in the Web server contents can be restricted by the cluster generation restriction unit 25 so that the link destination is not used for generation of child cluster candidates, thereby preventing an increase in the number of combinations. . In the example shown in FIG. 4, assuming that the node “site entrance” is an index, when a subsequent cluster is generated, the expansion from the cluster [site entrance] to the cluster [copy, printer] Not done. With respect to this index node, it is possible not to generate a cluster by an explicit input from the user. Further, here, since the correspondence with the clustering based on the similarity of the contents is performed, if the similarity between the contents is low, the corresponding content cluster is not assigned in the intention deviation detection unit 4, and therefore a cluster including those nodes is generated. It does not have to be done. Therefore, calculate the similarity between nodes connected by links from a certain node in advance, detect the nodes whose average similarity is below the threshold, and limit the generation of clusters using the detected nodes as index nodes It is possible.
[0036]
Further, when generating a cluster candidate in S37, if a child cluster having the same node set has already been generated from another parent cluster, an attempt is made to generate the child cluster that has already been generated. It is possible to generate a relationship as a child cluster of the parent cluster.
[0037]
FIG. 6 is an explanatory diagram of an example of a log cluster. By the processing in the log clustering unit 2 as described above, a log cluster as shown in FIG. 6 is output from the content structure shown in FIG. That is, after cluster 1 [site entrance] is generated as an initial cluster, nodes 2 “printer” and “copy” that follow one link from node [site entrance], cluster 2 [site entrance, printer] Then, cluster 3 [site entrance, copy] is generated. Similarly, cluster 4 and cluster 5 are generated from cluster 2, and cluster 6 is generated from cluster 3. Cluster candidate 1 [site entrance, copy, model D], which is a cluster candidate based on cluster 3, is not a cluster because the number of accesses is less than 20.
[0038]
From cluster 4, cluster 8 and cluster 9 are generated. At this time, the same cluster as the cluster 8 is generated from the cluster 5 as a child cluster candidate. In this case, as described above, a relationship as a child cluster of the cluster 5 is also generated for the already generated cluster 8. Similarly, a cluster 10 is generated from the cluster 8, but a relationship as a child class of the cluster 9 is also generated.
[0039]
From cluster 6, cluster candidate 2 having a node set [site entrance, copy, model C, model D] can be generated, but cluster candidate 1 having a subset of this node set has not previously met the criteria, It is removed by the condition of S42 in FIG.
[0040]
The clusters generated in this way have a directed acyclic graph (DAG) structure as shown in FIG. 6, and there is a relationship between which cluster has a child cluster and which cluster has a parent cluster. Has been generated. Since the generated log cluster has a tree structure having a parent-child relationship, the structure shown in FIG. 6 is similar to that shown in FIG. 4, but the log cluster is formed as a structure different from the content structure. Note that this is done.
[0041]
Next, processing for generating a content cluster by the content clustering unit 3 will be described. The content clustering unit 3 generates a profile of each node for each node using, for example, the method disclosed in Japanese Patent Application No. 9-153387, calculates the similarity for all two node combinations, Remember. Based on the calculated similarity, clustering is performed using a known method. As a clustering technique, for example, Ellen M. et al. Vorhees, “IMPLEMENTING AGGLOMERATIVE HIERARCHIC CLUSTERING ALGORITHMS FOR USE IN DOCUMENT RETRIVAL”, Information Processing & Management, Vol. 22, no. 6, pp. Clustering can be performed using a complete link of Hierarchical Clustering Algorithm described in 465-476, 1986 and the like.
[0042]
FIG. 7 is an explanatory diagram of an example of a content cluster. When contents as shown in FIG. 4 are stored in the Web server 1, the contents cluster structure as shown in FIG. 7 is obtained by hierarchically clustering these contents. That is, since the contents of the node “model A” and the node “function of model A” are similar, a cluster [model A, function A of model A] is obtained. Further, since these nodes are similar to the node “model B”, a cluster [model A, function A of model A, model B] is obtained. On the other hand, since the contents of the node “model C” and the node “model D” are similar, a cluster [model C, model D] is obtained. Finally, since these nodes are similar, a cluster [model A, function A of model A, model B, model C, model D] is obtained. In this way, [Model A, Model A Function A], [Model A, Model A Function A, Model B], [Model C, Model D], [Model A, Model A Function A, Model B , Model C, model D] are obtained.
[0043]
Note that the processing of the content clustering unit 3 does not consider the links between the nodes, and performs clustering based on, for example, the similarity of the contents. For this reason, in the example shown in FIG. 4, content that is similar in content is connected by a link, but nodes that are distant from each other in the link relationship may be combined into one cluster by content clustering. Alternatively, a node that has no link for some reason can also be included in the content cluster.
[0044]
FIG. 8 is a flowchart showing an example of operations in the intention deviation detection unit 4 and the intention deviation presentation unit 5. First, in S51 and S52, the log cluster and the content cluster generated as described above are read. Here, it is assumed that the log cluster shown in FIG. 6 and the content cluster shown in FIG. 7 are read.
[0045]
Next, in S53, a content cluster included in a leaf cluster of the log cluster (a cluster in which a cluster in which a node is newly added to the node set of the cluster has not been generated) is detected. The leaf clusters of the log cluster shown in FIG. Content clusters [model A, function A of model A] and [model A, function A of model A, model B] are detected from the cluster 10 as content clusters included in these log clusters. Hereinafter, the content cluster [model A, function A of model A, model B] is referred to as content cluster 1, and the content cluster [model A, function A of model A] is referred to as content cluster 2. Note that the content cluster included in the cluster 6 is not detected.
[0046]
In S54, the leaf cluster of the log cluster including the content cluster detected in S53 is added to the set 1. Content cluster 1 and content cluster 2 are included in leaf cluster 10 of the log clusters. Therefore, cluster 10 is added to set 1 among the log clusters.
[0047]
In S55, it is determined whether or not the set 1 is empty. If the set 1 is empty, the process ends. If set 1 is not empty, the following processing is performed.
[0048]
In S56, one of the log clusters is extracted from the set 1. In step S57, the content cluster is associated with the parent log cluster from the extracted log cluster. FIG. 9 is an explanatory diagram of the correspondence between log clusters and content clusters. As described above, since the set 1 includes the clusters 10 of the log clusters, the cluster 10 is extracted from the set 1. Here, the content cluster included in the cluster 10 is shown as the content cluster 1 in FIG. Then, as indicated by the arrow in FIG. 9, the content cluster 1 is correlated from the cluster 10 to the clusters 8 and 9 which are the parent log clusters.
[0049]
At this time, in S58, it is determined whether or not a node not included in the parent log cluster exists in the content cluster. If a node not included in the parent log cluster exists in the content cluster, the process proceeds to the determination in S59. FIG. 10 is an explanatory diagram of an example of how content clusters and log clusters overlap. In this example, when the content cluster 1 is associated with the cluster 9 that is the parent log cluster from the cluster 10, the node “model B” existing in the content cluster 1 is not included in the cluster 9 as shown in FIG. 10. Therefore, it progresses to S59. In the determination in S58, a shift between the log cluster and the content cluster can be detected. That is, it is possible to detect the presence of a portion where the user's access to the node does not match the similarity of the contents.
[0050]
In S59, it is determined whether or not the relationship between the content cluster and the parent log cluster is in a predetermined format. FIG. 11 and FIG. 12 are explanatory diagrams of an example of the difference between the content cluster and the log cluster. As the predetermined format, for example, the formats shown in FIGS. 11 and 12 can be determined. The format shown in FIG. 11 is an example in which a part of the contents of the node 2 is described in detail in the node 3, and the side that created the site expects access from the node 2 to the node 3 further. It is a link format. However, it indicates that the node 3 is not actually accessed. In the format shown in FIG. 12, the contents of node 1 are explained in detail in node 2 and node 3, and the side that created the site expects the same number of accesses to node 1 and node 2. This is a link format that is important to know the user's selection point depending on whether or not the user accesses. The format shown in FIG. 12 indicates that node 2 is being accessed, but node 3 is hardly accessed. Here, since the example shown in FIG. 10 is the link format shown in FIG. 12, the parent log cluster and the content cluster are determined to be in a predetermined format.
[0051]
If it is determined in S59 that the format is predetermined, an evaluation value is calculated and presented to an administrator or the like in S61, assuming that an intentional deviation has occurred. The evaluation value can be represented by, for example, the number of accesses classified into the parent class / the number of accesses classified into the child clusters. The larger the evaluation value, the higher the possibility that the number of user accesses can be increased by changing the content, or the clearer the user's selection. As shown in FIG. 6 or FIG. 9, since the number of accesses whose access tendency is classified into the cluster 9 is 50 and the number of accesses whose access tendency is classified as the class 10 is 30, the evaluation value is 1.7. It becomes.
[0052]
FIG. 13 is an explanatory diagram of an example of the presented evaluation result. Using the evaluation value calculated as described above, for example, it can be presented to the administrator or the like in the format shown in FIG. As points to change the site, highlight the nodes in the content cluster that are out of the log cluster and the link source node to the node, and the node structure included in the log cluster and not included in the log cluster Indicates the relationship of the nodes. At this time, each node presents the contents of the node in the same format as the Web browser, or presents only the page title when the display size is limited. In FIG. 13, for the convenience of illustration, the highlighted display is hatched. Further, the display method is arbitrary, and the display can be made easy to understand without being limited to the form shown in FIG.
[0053]
In the above description, the case where the content cluster is associated with the cluster 9 as the parent log cluster of the cluster 10 in S57 has been described. A case will be described in which a content cluster is associated with cluster 8, which is another parent log cluster, as a parent log cluster of cluster 10. In this case, since the node “function A of model A” included in the content cluster 1 is not included in the cluster 8, the process proceeds from S58 to S59. Since the relationship between the cluster 8 and the content cluster 1 corresponds to the link format shown in FIG. 11, an evaluation value is calculated in S61 and an evaluation result is presented. FIG. 14 is an explanatory diagram of another example of the presented evaluation result. In this case, the node “model A” and the node “function A of model A” are highlighted, and the evaluation value is displayed on the link between them. Thereby, the change point of the site can be shown to the administrator or the like. Also in this case, the display form is arbitrary.
[0054]
If all the nodes of the content cluster are included in the parent cluster in S58, or if the relationship between the parent cluster and the content cluster is not in a predetermined format in S59, the evaluation value is not calculated and the evaluation result is not displayed. , Go to S60. Also, after calculating the evaluation value and displaying the evaluation result in S61, the process proceeds to S60.
[0055]
In S60, it is determined whether or not the parent cluster still includes the content cluster. If so, the parent cluster is added to the set 1 in S62. Since the cluster 9 which is the parent cluster of the cluster 10 includes the content cluster 2 [function A of model A, model A], the cluster 9 is added to the set 1. Note that the cluster 8 that is the parent cluster of the other cluster 10 does not include the content cluster, and thus is not added to the set 1.
[0056]
Then, the process returns to S55. In this example, cluster 1 still exists in set 1. Therefore, the processing after S56 is repeated. The content cluster 2 is associated with the cluster 4 that is the parent log cluster of the cluster 9, but the node “model A function A” is not included in the cluster 4. Since the relationship between the parent log cluster and the content cluster in this case is in the format shown in FIG. 11, the evaluation value is calculated in S61 and the evaluation result is displayed.
[0057]
Since this cluster 4 does not include a content cluster, it is not added to the set 1, and the process returns to S55 where the set 1 becomes empty and the processing is terminated.
[0058]
In this way, it is possible to detect and display a shift between a node intended for the same access as the site creator from the content and a node that is actually accessed. For example, by referring to the display as shown in FIG. 13, the administrator can know that the model A is popular. Alternatively, it is possible to take measures to increase access to the model B. Further, for example, by referring to the display as shown in FIG. 14, it is possible to take a measure that allows the function A of the model A to be accessed more.
[0059]
The above-described embodiment can also be realized by a computer program. In that case, the program, data used by the program, and the like can be stored in a computer-readable storage medium. A storage medium is a signal format that causes a state of change in energy such as magnetism, light, electricity, etc. according to the description of a program to a reader provided in the hardware resources of a computer. Thus, the description content of the program can be transmitted to the reading device. For example, a magnetic disk, an optical disk, a CD-ROM, a memory built in a computer, and the like.
[0060]
【The invention's effect】
As is clear from the above description, according to the present invention, it is possible to detect a deviation between the user's access tendency and the intention of the content creator of the website and provide support when changing the content of the website. It is possible to maintain and manage the website promptly according to the user's trend.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an embodiment of a hypertext structure change support device of the present invention.
FIG. 2 is a block diagram showing an example of a log clustering unit in the embodiment of the hypertext structure change support device of the present invention.
FIG. 3 is a flowchart illustrating an example of an operation in the log clustering unit 2;
FIG. 4 is an explanatory diagram of a specific example of the structure of content stored in the Web server 1;
FIG. 5 is a flowchart illustrating an example of processing for determining whether or not a cluster candidate satisfies a cluster criterion.
FIG. 6 is an explanatory diagram of an example of a log cluster.
FIG. 7 is an explanatory diagram of an example of a content cluster.
FIG. 8 is a flowchart showing an example of operations in the intention deviation detection unit 4 and the intention deviation presentation unit 5;
FIG. 9 is an explanatory diagram of a correspondence relationship between a log cluster and a content cluster.
FIG. 10 is an explanatory diagram of an example of how content clusters and log clusters overlap.
FIG. 11 is an explanatory diagram of an example of a difference between a content cluster and a log cluster.
FIG. 12 is an explanatory diagram of another example of a shift between a content cluster and a log cluster.
FIG. 13 is an explanatory diagram of an example of the presented evaluation result.
FIG. 14 is an explanatory diagram of another example of presented evaluation results.
FIG. 15 is an explanatory diagram of an example of a Web site structure and a user access route;
FIG. 16 is an explanatory diagram of an example of the number of user transitions between a website structure and nodes;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Web server, 2 ... Log clustering part, 3 ... Content clustering part, 4 ... Intentional deviation detection part, 5 ... Intentional deviation presentation part, 11 ... Log recording part, 12 ... Content provision part, 21 ... Access information storage part, 22 ... Access count section, 23 ... Cluster configuration section, 24 ... Cluster generation acceleration section, 25 ... Cluster generation restriction section.

Claims

Log clustering means for generating a hierarchical log cluster from a hypertext set based on access frequency information by a user to a hypertext information group managed by an information system, and similarity of individual hypertexts in the hypertext information group Corresponding between content clustering means for calculating and generating content clusters from a hypertext set based on the similarity, log clustering results obtained by the log clustering means and content clustering results obtained by the content clustering means Intentional deviation detection means for identifying a partial structure that cannot be attached, and intentional intention presentation means for presenting hypertext that is an element in a content cluster that could not be matched by the intentional deviation detection means Hypertext structure changing support device according to claim Rukoto.

The log clustering means includes one user identifier for accessing the hypertext information group, one for a user having the same user identifier who has accessed for a predetermined period from an access history in which the time and location of the accessed information are recorded. An access information storage means for storing information that is recorded in a time-series order without duplication and information that is accessed by a user having the identifier by giving an identifier and storing the identifier as a set, and given to all the given hypertext sets The access number counting means for calculating the number of identifiers having a history of access satisfying the order restriction, and the hypertext based on the number of identifiers calculated by the access number counting means Cluster configuration means for generating a hierarchical cluster from the set, and the cluster The star composing means has a cluster consisting only of a predetermined starting hypertext as a vertex, one from a hypertext set that can be reached by following one link from the starting hypertext and whose access history is acquired. A cluster set composed of a pair of hypertext and starting hypertext taken out is generated as a new cluster candidate in a lower hierarchy, and the cluster candidate has an access count equal to or greater than a threshold given in advance by the access count counting means. If it is calculated, it is generated as a child cluster of the starting hypertext as a new cluster, and each child cluster is reached by following one link from the hypertext set included in the child cluster. Included in possible hypertext collections One hypertext that has been acquired in one access history and not included in the child cluster is extracted one by one and added to the hypertext set included in the child cluster to generate a cluster candidate in the next lower hierarchy If it is calculated that the cluster candidate has the number of accesses equal to or greater than the predetermined threshold, the new child cluster generates a process of generating a new child cluster having the child cluster as a parent cluster. The hypertext structure change support device according to claim 1, wherein a hierarchical cluster obtained by repeating until it is not generated is generated.

The log clustering means further includes: a cluster for a higher layer having a subset of a hypertext set included in a cluster candidate newly generated when forming a hierarchical cluster; A cluster generation speed-up unit that does not generate the cluster candidate when the number of accesses is equal to or less than a predetermined threshold, and a link from only the hypertext when generating a new cluster candidate for a predetermined hypertext. The hypertext set that can be reached by tracing one by one is provided with cluster generation limiting means for limiting the number of child cluster candidates to be generated by not using the hypertext set when generating candidates. Hypertext structure change support device.

When the candidate of the child cluster generated from the parent cluster has already been generated as a child cluster of another parent cluster, the cluster configuration means has already been generated with the parent cluster that is attempting to newly generate the child cluster. 3. The hypertext structure change support device according to claim 2, wherein a parent-child relationship is generated with the child cluster.

The intention shift detecting means is based on a similarity between hypertexts obtained by the content clustering means for a log cluster set having no child cluster in a cluster having a hierarchical structure obtained by the log clustering means. A content cluster included in each cluster of the log cluster set is selected from the content cluster set, and the content cluster is associated with the parent log cluster of the log cluster associated with the selected content cluster. Whether or not the hypertext is included in the parent log cluster, there is a hypertext that is not included, and the hypertext includes a hypertext set included in the higher-level log cluster and a predetermined link relationship And The hypertext is detected when the number of accesses to the upper log cluster and the change in the number of accesses to the corresponding log cluster are in a predetermined relationship. The hypertext having a predetermined relationship is presented to the user, and the intention deviation detection unit and the intention deviation presentation unit repeat the above processing when a content cluster included in the parent log cluster exists. The hypertext structure change support device according to claim 1, wherein detection / presentation of intention shift is repeated until there is no content cluster included in the parent log cluster or a predetermined stop condition is satisfied.

The intention shift detection means uses a cluster having a hierarchical structure based on the similarity between hypertexts, and if even one element of a lower layer cluster is not included, The hypertext structure change support device according to claim 5, wherein a cluster of a hierarchy is excluded.

The cluster generation restriction unit calculates a similarity between hypertexts connected by links from the same link source hypertext, and the link source hypertext whose average value of the similarities is equal to or less than a predetermined threshold value 4. The hypertext structure change support device according to claim 3, wherein the detected hypertext is used as the predetermined hypertext.

The log clustering means generates a hierarchical log cluster from the hypertext set based on the access frequency information by the user to the hypertext information group managed by the information system, and the similarity of each hypertext in the hypertext information group is determined. Based on the calculated similarity, the content clustering means generates a content cluster from the hypertext set, and the intentional deviation detecting means can identify and associate a partial structure that cannot be matched between the log cluster and the content cluster. A hypertext structure change support method, characterized in that an unintentional presentation means presents hypertext that is an element in a content cluster that has not been present.

The process of generating the log cluster in the log clustering means is the same user who has accessed for a predetermined period from the access history in which the user identifier accessing the hypertext information group, the time, and the location of the accessed information are recorded One identifier is given to a user having an identifier, information accessed by the user having the identifier is recorded in a time-series order without duplication, and the identifier is stored as a set of access information. The number of identifiers having a history of access that satisfies the order constraints given to all text sets is calculated as the number of accesses using the access information, and a cluster consisting only of a predetermined starting hypertext is used as a vertex, Reached by following one link from the origin hypertext A cluster set composed of a combination of hypertext and starting hypertext extracted one by one from the hypertext set for which access history has been acquired is created as a new cluster candidate in the next lower hierarchy, and the cluster candidate Is calculated as a child cluster of the origin hypertext as a new cluster, and each of the child clusters is also included in the child cluster. One hypertext that is included in the hypertext set that can be reached by following one link from the hypertext set and that has an access history and is not included in the child cluster is taken out one by one, Add one more to the included hypertext set If a cluster candidate of a higher rank is generated and it is calculated that the cluster candidate has the number of accesses equal to or greater than the predetermined threshold value, a new child cluster is generated with the child cluster as a parent cluster. 9. The hypertext structure change support method according to claim 8, wherein a hierarchical cluster obtained by repeating the process until no new child cluster is generated is generated.

A process for identifying a partial structure that cannot be associated between the log cluster and the content cluster in the intention disagreement detection unit, and presenting the hypertext that is an element in the content cluster that could not be associated with the intention disagreement presentation unit. In the cluster having a hierarchical structure obtained by the log clustering process, the log cluster set having no child cluster is compared with the log from the content cluster set obtained by the content clustering process based on the similarity between hypertexts. Hypertext in the content cluster is selected by selecting a content cluster included in each cluster of the cluster set and associating the content cluster with the parent log cluster of the log cluster associated with the selected content cluster Calculating whether it is included in the parent log cluster, there is a hypertext that is not included, and the hypertext is in a predetermined link relationship with a hypertext set included in the upper log cluster, and When the number of accesses to the higher-level log cluster and the change in the number of accesses to the corresponding log cluster are in a predetermined relationship, the hypertext is detected as being out of intention, and the hypertext is determined in advance. If there is a content cluster that is included in the parent log cluster and the content is included in the parent log cluster, the content that is included in the parent log cluster is detected and presented as an intentional error by repeating the above process. A cluster no longer exists or meets a predefined outage condition Hypertext structure changing support method according to claim 8, characterized in that repeated.

Log clustering processing for generating a hierarchical log cluster from a hypertext set based on access frequency information by a user to a hypertext information group managed by an information system, and similarity of individual hypertexts in the hypertext information group Corresponding between content clustering processing that calculates and generates content clusters from hypertext sets based on the similarity, log clustering results obtained by the log clustering processing, and content clustering results obtained by the content clustering processing Unintentional detection processing that identifies partial structures that cannot be associated, and intention-intention presentation processing that presents hypertext that is an element in the content cluster that could not be associated A storage medium storing hypertext structure change support program for.

The log clustering process includes one identifier for a user having the same user identifier that has been accessed for a predetermined period from an access history in which the user identifier, time, and location of the accessed information are recorded. The information accessed by the user having the identifier is recorded as access information by combining the information recorded without allowing duplication in time-series order and the identifier, and given to all the given hypertext sets. The number of identifiers having a history of access satisfying the order restriction is calculated as the number of accesses using the access information, and a cluster consisting only of a predetermined starting hypertext is used as a vertex, and a link from the starting hypertext is 1 It is reachable by tracing and access history is acquired Generating a cluster set consisting of a hypertext taken from the hypertext set one by one and a starting hypertext as a new cluster candidate at a lower hierarchy, and the cluster candidate is accessed in a predetermined threshold or more. If it is calculated to have a number, a new cluster is generated as a child cluster of the starting hypertext, and each child cluster is also followed by one link from the hypertext set included in the child cluster. The hypertext that is included in the hypertext set that can be reached and the access history has been acquired and that is not included in the child cluster is extracted one by one and added to the hypertext set included in the child cluster. Generate a cluster candidate in the next lower hierarchy, and If it is calculated that the star candidate has the number of accesses equal to or greater than the predetermined threshold, the process of generating a new child cluster with the child cluster as a parent cluster is performed until no new child cluster is generated. 12. The storage medium storing a hypertext structure change support program according to claim 11, wherein a hierarchical cluster obtained by repeating is generated.

The intention shift detection process is obtained by the content clustering process based on the similarity between hypertexts for a set of log clusters having no child cluster in a cluster having a hierarchical structure obtained by the log clustering process. A content cluster included in each cluster of the log cluster set is selected from the content cluster set, and the content cluster is associated with the parent log cluster of the log cluster associated with the selected content cluster. Whether or not the hypertext is included in the parent log cluster, there is a hypertext that is not included, and the hypertext includes a hypertext set included in the higher-level log cluster and a predetermined link relationship And The hypertext is detected when the number of accesses to the higher-level log cluster and the change in the number of accesses to the corresponding log cluster are in a predetermined relationship, and the intention deviation presentation process is predetermined as the hypertext. Hypertext having the same relationship is presented to the user, and the intentional deviation detection process and the intentional deviation presentation process detect the intentional deviation by repeating the above process when there is a content cluster included in the parent log cluster. 12. The storage medium storing the hypertext structure change support program according to claim 11, wherein the presentation is repeated until there is no content cluster included in the parent log cluster or a predetermined stop condition is satisfied.