JP3774145B2

JP3774145B2 - Web site internal structure estimation device, internal structure estimation method, program for the method, and recording medium recording the program

Info

Publication number: JP3774145B2
Application number: JP2001389448A
Authority: JP
Inventors: 憲一森
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-12-21
Filing date: 2001-12-21
Publication date: 2006-05-10
Anticipated expiration: 2021-12-21
Also published as: JP2003186883A

Description

【０００１】
【発明の属する技術分野】
本発明は、ＷＷＷ（ワールド・ワイド・ウェブ）で提供されるＷｅｂページを検索し、Ｗｅｂサイト単位で利用者に提供するためのＷｅｂページの検索方式に係り、特にＷｅｂページ集合から親ページとそれにリンクしたページをＷｅｂサイト単位の内部構造として推定する装置、方法に関する。
【０００２】
【従来の技術】
ＷＷＷは、Ｗｅｂページ集合をリンク関係を持たせたハイパーメディア情報として蓄積しており、利用者はＵＲＬ（ユニバーサル・リソース・ロケータ）で規定されるＷｅｂサーバにたどり着き、Ｗｅｂページの検索とそのリンク関係を辿ることで目的とするページ集合の情報を取得することができる。
【０００３】
上記のＵＲＬが不明な場合や検索範囲を拡張する場合のＷｅｂ情報検索手段として検索エンジンがあり、この検索エンジンは、あるＷｅｂページのＨＴＭＬ（ハイパー・テキスト・マークアップ・ランゲージ）を解析し、そこから張られているリンク先のＵＲＬを得るという一連の処理を繰り返すことで多数のＷｅｂページを収集し、収集したＷｅｂページに対してキーワードやトピックスを指定することで目的とするページ情報を抽出するようにしている。
【０００４】
【発明が解決しようとする課題】
ＷＷＷにおけるＷｅｂ情報は、ページ単位でなく、Ｗｅｂサイトというページ集合を単位として構成されており、検索目的に適合した情報のみを得るにはＷｅｂサイト単位に検索できることが要望される。なお、Ｗｅｂサイトとは、トップページ（代表ページ）と、それからリンクを辿って到達可能なページ群のうち、トップページと同一のサーバに属し、トップページとの関連が高いページを意味する。
【０００５】
しかし、従来のＵＲＬを基にしたＷｅｂページの検索方法では、サーバ単位およびそれにリンクされるＷｅｂページの検索になってしまい、目的とするページ情報とそれに関連するページ情報を得ること、つまりＷｅｂサイト単位の情報検索には多くの手間を必要とする。
【０００６】
一方、検索エンジンを利用した検索では、収集されるページが大量になることが多く、その整理、選別、加工がなされていないため、利用者には検索目的に適合したＷｅｂサイト単位の情報抽出が極めて困難となってしまう。
【０００７】
検索目的に適合した情報抽出方法として、文献「サイテーション・エンジン：リンク解析を用いたＷＷＷランキングシステム、高野元、久保信也、ｐｐ.9〜16、情報処理学会研究報告2000-DBS-120（データベースシステム研究会）、社団法人情報処理学会、2000年1月」では、収集されたページ群から代表ページ（トップページ）と呼ばれるページを外部サーバからのリンク参照数を用いて判断しているが、この手法では外部サーバから多くリンクされていないページはＷｅｂサイトのトップページとして抽出できない。また、代表ページを判断するために、十分なリンク（つまりページ）を収集しておく必要がある。
【０００８】
また、特開平９−２３１２３８号公報に開示される方法は、ページ検索結果を主題分析によりクラスタリングするに過ぎないため、目的に適合したＷｅｂサイト単位の情報を得ることができない。
【０００９】
Ｗｅｂサイトの内部構造が分かれば、検索したページ同士の階層関係の認識を容易にしたり、目的とするページに辿りつくためのナビゲーションを容易にするため、以上のような不都合が解消できる。
【００１０】
Ｗｅｂサイトの内部構造推定手法として、前記の文献では、Ｗｅｂページ同士の関係を有向グラフで表現し、この有向グラフの幅優先探索に基づき、各ページへのトップページからの最短パスで構成される木構造を推定することを提案している。しかし、この方法では、同文献にもあるように、元のグラフが完全グラフのような構造をもつ場合にはうまく機能しないという問題がある。
【００１１】
さらに、特開２００１−６０１６５では、情報木を使用した判定に、ディレクトリ間のリンク関係を考慮し、ページの親子関係を決定する方法を提案しているが、リンク関係のみの判定になるため内部構造のより正確な判定ができない。
【００１２】
本発明の目的は、Ｗｅｂサイトの内部構造を適確に推定できるようにし、この内部構造を基にして検索目的に適合したＷｅｂサイト単位の情報検索を容易にするＷｅｂサイトの内部構造推定装置、方法、プログラムおよび記録媒体を提供することにある。
【００１３】
【課題を解決するための手段】
本発明は、前記課題を解決するため、検索したＷｅｂページ集合の各ページ、リンクについてメタ情報とリンク分類木を利用してリンクの各リンクタイプへの分類尤度を獲得しておき、さらにメタ情報とページタイプからページクラスを抽出したページ分類木を獲得して各ページのページタイプの分類尤度を抽出しておく。その後、Ｗｅｂページが属する全てのサーバについて、上記の分類尤度を基にしてＷｅｂサイトのトップページ候補集合を得てそれらの親ページを推定し、この推定でも親ページが未決のページ集合の中からディレクトリの最も浅い階層に存在しかつトップページ、インデックスページ、メニューページ尤度の和が最大のページをトップページ候補としてそれらの親ページを推定し、これら推定した親ページとこれにリンクするページ集合をＷｅｂサイト単位の内部構造として推定するようにしたもので、以下の装置、方法、プログラム、記録媒体を特徴とする。
【００１４】
（装置の発明）
（１）ＷＷＷ上のＷｅｂページ集合を収集し、このＷｅｂページ集合から親ページとそれにリンクしたページをＷｅｂサイト単位の内部構造として推定する装置であって、
前記ページ集合の各ページおよびリンクについて、表層知識に基づくメタ情報を抽出し、かつリンク分類木を利用して各リンクのリンクタイプへの分類尤度を抽出しておくリンクタイプ分類尤度抽出手段と、
前記ページ集合の各ページのページ分類木を獲得し、このページ分類木を基に、前記ページ集合の各ページのページタイプへの分類尤度を抽出しておくページタイプ分類尤度抽出手段と、
前記ページ集合が属する全てのサーバについて、親ページ推定処理されていないサーバに属するページ集合を取り出し、以下の第１の手段〜第３の手段、
Ｗｅｂサイトのトップページ候補集合を獲得する第１の手段と、
前記各トップページ候補を起点とし、前記リンクタイプ分類尤度を基に各ページの親ページを決定する第２の手段と、
前記親ページが未決のページ集合の中からディレクトリの最も浅い階層に属しかつ前記ページタイプ分類尤度のトップページ尤度とインデクスページ尤度およびメニューページ尤度の和が最大のページをトップページ候補として取り出し、このトップページ候補から前記リンクタイプ分類尤度を基に親ページを決定する第３の手段と、
を繰り返すことで親ページを推定することを特徴とする。
【００１５】
（２）前記リンクタイプ分類尤度抽出手段は、リンクタイプが付与されたＷｅｂページ集合を用意し、このページ集合の各ページからリンクのメタ情報を抽出し、リンクのメタ情報とリンクタイプの組みを学習データとしてリンククラスを学習したリンク分類木を獲得し、前記ページ集合の各リンクの全てのメタ情報とＵＲＬおよびページが存在するディレクトリ階層を抽出し、前記ページ集合の各リンクから抽出するリンク情報に対するリンクのメタ情報を得、リンク分類木を基にしてある１組のメタ情報から各リンクタイプへの分類尤度を得ることを特徴とする。
【００１６】
（３）前記ページ分類尤度抽出手段は、ページ集合の各ページについて、ページタイプとそのメタ情報の組の集合を学習データとしてページクラスを学習したページ分類木を獲得し、このページ分類木を基に各ページのメタ情報を入力し、各ページタイプへの分類尤度を抽出することを特徴とする。
【００１７】
（４）前記第１の手段は、前記ページ集合の各ページが所属する各サーバ毎に、以下の手段、
・当該サーバに属し、ディレクトリ階層が０かつ当該階層に位置する指定ファイル名を持つページを当該サーバ名におけるトップページ候補と推定する手段Ａと、
・前記トップページタイプ分類尤度を基にトップページ候補が存在するディレクトリ階層を順次下げてトップページ候補が存在するディレクトリ階層を決定する手段Ｂと、
・前記手段Ｂで決定されたディレクトリ階層に所属し、下位階層にファイルが存在するファイル名をもち、前記ページタイプへの分類尤度の和が最大のページをトップページ候補として各ディレクトリ階層毎に決定する手段Ｃと、
・前記ディレクトリ階層の１段階下のディレクトリ階層に属し、前記ページタイプのトップページ分類尤度が閾値以上のページをトップページ候補とする手段Ｄと、
を繰り返すことを特徴とする。
【００１８】
（５）前記第２の手段は、トップページ候補のリンク先ページのうち収集データに含まれるリンク先ページリストを得、リンク先ページの親ページが決まっているか否かを判定し、親ページが決まっていないときは現在処理中のページを親ページとし、リンク先ページが属するサーバの親ページが決まっている場合、現在の親ページからのリンクとリンク元ページからのリンクのうち、幹リンク尤度が大きい方のリンク元を親ページとする処理を繰り返すことを特徴とする。
【００１９】
（方法の発明）
（６）ＷＷＷ上のＷｅｂページ集合をデータベースに収集し、このＷｅｂページ集合から親ページとそれにリンクしたページをＷｅｂサイト単位の内部構造としてコンピュータ・ソフトウェアによる情報処理によって推定する方法であって、
前記コンピュータ・ソフトウェアは、
前記ページ集合の各ページおよびリンクについて、表層知識に基づくメタ情報を抽出し、かつリンク分類木を利用して各リンクのリンクタイプへの分類尤度を抽出しておくリンクタイプ分類尤度抽出過程と、
前記ページ集合の各ページのページ分類木を獲得し、このページ分類木を基に、前記ページ集合の各ページのページタイプへの分類尤度を抽出しておくページタイプ分類尤度抽出過程と、
前記ページ集合が属する全てのサーバについて、親ページ推定処理されていないサーバに属するページ集合を取り出し、以下の第１の過程〜第３の過程、
Ｗｅｂサイトのトップページ候補集合を獲得する第１の過程と、
前記各トップページ候補を起点とし、前記リンクタイプ分類尤度を基に各ページの親ページを決定する第２の過程と、
前記親ページが未決のページ集合の中からディレクトリの最も浅い階層に属しかつ前記ページタイプ分類尤度のトップページ尤度とインデクスページ尤度およびメニューページ尤度の和が最大のページをトップページ候補として取り出し、このトップページ候補から前記リンクタイプ分類尤度を基に親ページを決定する第３の過程と、
を繰り返すことで親ページを推定することを特徴とする。
【００２０】
（７）前記リンクタイプ分類尤度抽出過程は、リンクタイプが付与されたＷｅｂページ集合を用意し、このページ集合の各ページからリンクのメタ情報を抽出し、リンクのメタ情報とリンクタイプの組みを学習データとしてリンククラスを学習したリンク分類木を獲得し、前記ページ集合の各リンクの全てのメタ情報とＵＲＬおよびページが存在するディレクトリ階層を抽出し、前記ページ集合の各リンクから抽出するリンク情報に対するリンクのメタ情報を得、リンク分類木を基にしてある１組のメタ情報から各リンクタイプへの分類尤度を得ることを特徴とする。
【００２１】
（８）前記ページ分類尤度抽出過程は、ページ集合の各ページについて、ページタイプとそのメタ情報の組の集合を学習データとしてページクラスを学習したページ分類木を獲得し、このページ分類木を基に各ページのメタ情報を入力し、各ページタイプへの分類尤度を抽出することを特徴とする。
【００２２】
（９）前記第１の過程は、前記ページ集合の各ページが所属する各サーバ毎に、以下の過程、
・当該サーバに属し、ディレクトリ階層が０かつ当該階層に位置する指定ファイル名を持つページを当該サーバ名におけるトップページ候補と推定する過程Ａと、
・前記トップページタイプ分類尤度を基にトップページ候補が存在するディレクトリ階層を順次下げてトップページ候補が存在するディレクトリ階層を決定する過程Ｂと、
・前記過程Ｂで決定されたディレクトリ階層に所属し、下位階層にファイルが存在するファイル名をもち、前記ページタイプへの分類尤度の和が最大のページをトップページ候補として各ディレクトリ階層毎に決定する過程Ｃと、
・前記ディレクトリ階層の１段階下のディレクトリ階層に属し、前記ページタイプのトップページ分類尤度が閾値以上のページをトップページ候補とする過程Ｄと、
を繰り返すことを特徴とする。
【００２３】
（１０）前記第２の過程は、トップページ候補のリンク先ページのうち収集データに含まれるリンク先ページリストを得、リンク先ページの親ページが決まっているか否かを判定し、親ページが決まっていないときは現在処理中のページを親ページとし、リンク先ページが属するサーバの親ページが決まっている場合、現在の親ページからのリンクとリンク元ページからのリンクのうち、幹リンク尤度が大きい方のリンク元を親ページとする処理を繰り返すことを特徴とする。
【００２４】
（プログラムの発明）
（１１）ＷＷＷ上のＷｅｂページ集合を収集し、このＷｅｂページ集合から親ページとそれにリンクしたページをＷｅｂサイト単位の内部構造として推定する方法をコンピュータで実行可能にしたプログラムであって、
リンクタイプが付与されたＷｅｂページ集合を用意し、このページ集合の各ページからリンクのメタ情報を抽出し、リンクのメタ情報とリンクタイプの組みを学習データとしてリンククラスを学習したリンク分類木を獲得し、前記ページ集合の各リンクの全てのメタ情報とＵＲＬおよびページが存在するディレクトリ階層を抽出し、前記ページ集合の各リンクから抽出するリンク情報に対するリンクのメタ情報を得、リンク分類木を基にしてある１組のメタ情報から各リンクタイプへの分類尤度を得るリンクタイプ分類尤度抽出手順と、
ページ集合の各ページについて、ページタイプとそのメタ情報の組の集合を学習データとしてページクラスを学習したページ分類木を獲得し、このページ分類木を基に各ページのメタ情報を入力し、各ページタイプへの分類尤度を抽出するページ分類尤度抽出手順と、
前記ページ集合が属する全てのサーバについて、親ページ推定処理されていないサーバに属するページ集合を取り出し、Ｗｅｂサイトのトップページ候補集合を獲得する第１の手順と、前記各トップページ候補を起点とし、前記リンクタイプ分類尤度を基に各ページの親ページを決定する第２の手順と、前記親ページが未決のページ集合の中からディレクトリの最も浅い階層に属しかつ前記ページタイプ分類尤度のトップページ尤度とインデクスページ尤度およびメニューページ尤度の和が最大のページをトップページ候補として取り出し、このトップページ候補から前記リンクタイプ分類尤度を基に親ページを決定する第３の手順とを繰り返すことで親ページを推定し、
前記第１の手順は、前記ページ集合の各ページが所属する各サーバ毎に、以下の手順、
・当該サーバに属し、ディレクトリ階層が０かつ当該階層に位置する指定ファイル名を持つページを当該サーバ名におけるトップページ候補と推定する手順Ａと、
・前記トップページタイプ分類尤度を基にトップページ候補が存在するディレクトリ階層を順次下げてトップページ候補が存在するディレクトリ階層を決定する手順Ｂと、
・前記手順Ｂで決定されたディレクトリ階層に所属し、下位階層にファイルが存在するファイル名をもち、前記ページタイプへの分類尤度の和が最大のページをトップページ候補として各ディレクトリ階層毎に決定する手順Ｃと、
・前記ディレクトリ階層の１段階下のディレクトリ階層に属し、前記ページタイプのトップページ分類尤度が閾値以上のページをトップページ候補とする手順Ｄとを繰り返し、
前記第２の手順は、トップページ候補のリンク先ページのうち収集データに含まれるリンク先ページリストを得、リンク先ページの親ページが決まっているか否かを判定し、親ページが決まっていないときは現在処理中のページを親ページとし、リンク先ページが属するサーバの親ページが決まっている場合、現在の親ページからのリンクとリンク元ページからのリンクのうち、幹リンク尤度が大きい方のリンク元を親ページとする処理を繰り返し、
前記第３の手順は、前記親ページが未決のページ集合の中からディレクトリの最も浅い階層に属しかつ前記ページタイプ分類尤度のトップページ尤度とインデクスページ尤度およびメニューページ尤度の和が最大のページをトップページ候補として取り出し、このトップページ候補から前記リンクタイプ分類尤度を基に親ページを決定するを繰り返すことで親ページを推定する、
各手順をコンピュータに実行させることを特徴とする。
【００２５】
（記録媒体の発明）
（１２）ＷＷＷ上のＷｅｂページ集合を収集し、このＷｅｂページ集合から親ページとそれにリンクしたページをＷｅｂサイト単位の内部構造として推定する方法を、コンピュータで実行可能にしたプログラムを記録した記録媒体であって、
リンクタイプが付与されたＷｅｂページ集合を用意し、このページ集合の各ページからリンクのメタ情報を抽出し、リンクのメタ情報とリンクタイプの組みを学習データとしてリンククラスを学習したリンク分類木を獲得し、前記ページ集合の各リンクの全てのメタ情報とＵＲＬおよびページが存在するディレクトリ階層を抽出し、前記ページ集合の各リンクから抽出するリンク情報に対するリンクのメタ情報を得、リンク分類木を基にしてある１組のメタ情報から各リンクタイプへの分類尤度を得るリンクタイプ分類尤度抽出手順と、
ページ集合の各ページについて、ページタイプとそのメタ情報の組の集合を学習データとしてページクラスを学習したページ分類木を獲得し、このページ分類木を基に各ページのメタ情報を入力し、各ページタイプへの分類尤度を抽出するページ分類尤度抽出手順と、
前記ページ集合が属する全てのサーバについて、親ページ推定処理されていないサーバに属するページ集合を取り出し、Ｗｅｂサイトのトップページ候補集合を獲得する第１の手順と、前記各トップページ候補を起点とし、前記リンクタイプ分類尤度を基に各ページの親ページを決定する第２の手順と、前記親ページが未決のページ集合の中からディレクトリの最も浅い階層に属しかつ前記ページタイプ分類尤度のトップページ尤度とインデクスページ尤度およびメニューページ尤度の和が最大のページをトップページ候補として取り出し、このトップページ候補から前記リンクタイプ分類尤度を基に親ページを決定する第３の手順とを繰り返すことで親ページを推定し、
前記第１の手順は、前記ページ集合の各ページが所属する各サーバ毎に、以下の手順、
・当該サーバに属し、ディレクトリ階層が０かつ当該階層に位置する指定ファイル名を持つページを当該サーバ名におけるトップページ候補と推定する手順Ａと、
・前記トップページタイプ分類尤度を基にトップページ候補が存在するディレクトリ階層を順次下げてトップページ候補が存在するディレクトリ階層を決定する手順Ｂと、
・前記手順Ｂで決定されたディレクトリ階層に所属し、下位階層にファイルが存在するファイル名をもち、前記ページタイプへの分類尤度の和が最大のページをトップページ候補として各ディレクトリ階層毎に決定する手順Ｃと、
・前記ディレクトリ階層の１段階下のディレクトリ階層に属し、前記ページタイプのトップページ分類尤度が閾値以上のページをトップページ候補とする手順Ｄとを繰り返し、
前記第２の手順は、トップページ候補のリンク先ページのうち収集データに含まれるリンク先ページリストを得、リンク先ページの親ページが決まっているか否かを判定し、親ページが決まっていないときは現在処理中のページを親ページとし、リンク先ページが属するサーバの親ページが決まっている場合、現在の親ページからのリンクとリンク元ページからのリンクのうち、幹リンク尤度が大きい方のリンク元を親ページとする処理を繰り返し、
前記第３の手順は、前記親ページが未決のページ集合の中からディレクトリの最も浅い階層に属しかつ前記ページタイプ分類尤度のトップページ尤度とインデクスページ尤度およびメニューページ尤度の和が最大のページをトップページ候補として取り出し、このトップページ候補から前記リンクタイプ分類尤度を基に親ページを決定するを繰り返すことで親ページを推定する、
各手順をコンピュータに実行させるプログラムを記録したことを特徴とする。
【００２６】
【発明の実施の形態】
図１は、本発明の実施形態を示すＷｅｂサイト単位検索のためのＷｅｂページの内部構造推定フローである。内部構造推定システムとしては、検索したページデータやリンクデータおよび推定した内部構造データを保存・管理できるデータベースと、図１の推定フローの実行に必要なソフトウェアを搭載するコンピュータとによって構築する。
【００２７】
図１において、まず、従来の検索エンジンで収集したすべてのＷｅｂページ集合に対して、各ページおよびそのリンクについて、表層知識に基づくメタ情報を抽出し、また、リンク分類木を利用して各リンクのリンクタイプへの分類尤度を求めてデータベースに格納する（Ｓ１）。ここでのリンク分類木とリンクタイプと分類尤度について、図２を参照して以下に説明する。
【００２８】
リンク分類木は、Ｗｅｂページの各ページに対してリンクタイプの分類尤度を与えるもので、その獲得は、図２の（ａ）に示すように、まず、人手でリンクタイプが付与されたＷｅｂページ集合（学習ページ）を用意する（Ｓ１Ａ）。このときのリンクタイプとしては、例えば、幹リンクサイトの木構造の枝となるリンクとその他のリンク（上記の分類に該当しないリンク）とする。
【００２９】
次に、学習データの各ページからのリンクのメタ情報を抽出する（Ｓ１Ｂ）。なお、リンクのメタ情報には、例えば、次のような項目とする。リンク元と先のディレクトリ距離（「/」をルートとするディレクトリツリーにおけるディレクトリＡとディレクトリＢの距離を意味し、ツリーにおけるノードＡからノードＢに最短距離で到達する際に辿るエッジの本数）。リンク元ページの存在するディレクトリに特定の語を含むか否か。アンカー文字列に特定の語を含むか否か。リンク元ページのファイル名。リンク先ページのファイル名。リンク元ページのファイル名とディレクトリの一致／不一致の関係。リンク先ページのファイル名とディレクトリの一致／不一致の関係。リンクタグ（Ａタグ）のターゲット属性があるか否か。
【００３０】
最後に、機械学習アルゴリズムに対して、リンクのメタ情報とリンクタイプの組みを学習データとし、リンククラスを学習したリンク分類木として獲得する（Ｓ１Ｃ）。
【００３１】
なお、ページタイプとしては、例えば、トップページ、インデクスページ（ディレクトリのインデクスを集めたページ）、メニューページ（インデクスタイプ以外でコンテンツの一覧を示すページ）、リンク集ページ（他のサイトへのリンクページ）、その他ページ（上記の分類に該当しないページ）とする。
【００３２】
また、機械学習アルゴリズムとしては、例えば、Ｃ４.5：Programs for Machine Learning,J.Ross Quinlan著,Morgan Kaufmann Publishers ISBN 1-55860-238-0を利用する。
【００３３】
次に、リンクタイプの分類尤度獲得は、図２の（ｂ）にループＳＬ１とＳＬ２として示すように、ループＳＬ１では全てのページに対して、ページのメタ情報を抽出し、そのＵＲＬ、ページの存在するディレクトリ階層と共にページデータベースに格納する。（Ｓ１Ｄ）。
【００３４】
なお、ページのメタ情報には、例えば、次のような項目がある。タイトル文字列に特定の語（ホームページなど）を含むか否か。カウンタＣＧＩ（コモン・ゲートウエイ・インターフェース）が存在するか否か。フレームページであるか否か。内容に特定の語（このページはなど）を含むか否か。ＵＲＬ文字列中に特定の文字列（ホームページなど）を含むか否か。アンカー文字列に「戻る」等の文字列を含むリンク中、親ディレクトリへのリンクがあるか否か。アンカー文字列に「戻る」等の文字列を含むリンク数。外部サーバ上のページへのリンク数。同一ディレクトリ内のページへのリンク数。下位ディレクトリ内ページへのリンク数。兄弟ディレクトリ内ページへのリンク数。
【００３５】
ループＳＬ２では、ループＳＬ１が処理中のページに対して、当該ページからのリンク情報を全てを処理対象とし、当該ページからリンク情報を抽出し、そのリンクに対するリンクのメタ情報を得（Ｓ１Ｅ）、リンク分類木を基にして、ある１組のメタ情報から各リンクタイプへの分類尤度を得（Ｓ１Ｆ）、得られた分類尤度をリンクデータベースに格納する（Ｓ１Ｇ）。
【００３６】
以上の処理により、図９中に示すように、リンクデータベース中には各ページ別にリンク分類尤度が獲得される。このとき得られるリンク分類尤度は、当該ページが１つのリンクタイプに分類される確率であり、あるページにおける各リンクタイプの分類尤度の総和は１となる。例えば、あるページの幹リンクへの分類尤度が０.７５、その他のリンクへの分類尤度が０.２５となる。
【００３７】
図１に戻って、予め獲得するページ分類木を利用して、各ページのページタイプへの分類尤度を得、ページデータベースに格納する（Ｓ２）。ここでのページ分類木とページタイプと分類尤度について、図３を参照して以下に説明する。
【００３８】
ページ分類木は、図９中に構造例を示すように、葉以外の各ノードには分類のための質問（リンク数がＮ本以上など）、葉ノードには決定されたページのタイプを持ち、リンクは例えばノードの質問に対するｙｅｓ／ｎｏの分岐を意味するもので、その獲得は、図３の（ａ）に示すように、まず、人手でページタイプが付与されたＷｅｂページ集合（学習ページ）を用意する（Ｓ３Ａ）。このときのページタイプとしては、例えば、トップページ、インデクスページ（ディレクトリのインデクスを集めたページ）、メニューページ（インデクスタイプ以外でコンテンツの一覧を示すページ）、リンク集ページ（他のサイトへのリンクページ）、その他ページ（上記の分類に該当しないページ）とする。
【００３９】
次に、ページの表層特徴に基づくメタ情報を抽出し（Ｓ３Ｂ）、機械学習アルゴリズムに対して、ページのメタ情報とページタイプの組みを学習データとし、ページクラスを学習したページ分類木として獲得する（Ｓ３Ｃ）。なお、機械学習アルゴリズムとしては、例えば、前記のＣ４.5を利用する。
【００４０】
次に、ページタイプの分類尤度獲得は、図３の（ｂ）にループＳＬ３として示すように、全てのページに対して、ページの表層特徴に基づくメタ情報を抽出し（Ｓ３Ｄ）、ページ分類木にページのメタ情報を入力し、各ページタイプへの分類尤度を得（Ｓ３Ｅ）、得られた分類尤度をページデータベースに格納する（Ｓ３Ｆ）。
【００４１】
以上の処理により、図９中に示すように、ページデータベース中には各ページ別にページタイプの分類尤度が格納される。このとき得られる分類尤度は、当該ページが１つのページタイプに分類される確率であり、あるページにおける各ページタイプの分類尤度の総和は１となる。例えば、あるページのトップページへの分類尤度が０.７５、メニューページへの分類尤度が０.２、その他ページへの分類尤度が０.０５となる。
【００４２】
再び、図１に戻って、ループＬ１〜Ｌ３では、前記までの処理Ｓ１，Ｓ２を準備段階とし、Ｗｅｂサイトの内部構造推定を実行する。
【００４３】
まず、ループＬ１では、全てのサーバを処理対象とし、親ページ推定を行っていないサーバに属するページ集合を取り出す（Ｓ３）。この処理には、前処理として、従来の検索エンジンで収集したすべてのＷｅｂページ集合に対して、ページ単位でそれが属するＷｅｂサーバ名を検索し、図９中に示すように、これらをサーバ名データベースに格納しておく。
【００４４】
次に、ループＬ２では、処理Ｓ３で取り出した親ページ推定を行っていないサーバに属する全てのページを対象とし、それらが属するサーバの親ページ推定処理が終了するまで、Ｗｅｂサイトのトップページ発見手法を用いてサイトのトップページ候補集合を抽出する（Ｓ４）。
【００４５】
この処理フローを図４に示す。図４において、まず、処理中のサーバに属し、ディレクトリ階層が０、かつファイル名が「/」または「index.html」のページがページデータベースに存在すれば、そのページは最上位階層に位置するとしてトップページ候補とする（Ｓ４Ａ）。例えば、サーバ名がＸＸになるページのうち、サーバ名ＸＸのディレクトリ階層が図９中の階層「０」で、ページのファイル名が「/」や「index.html」であれば、そのページを当該サーバ名におけるトップページ候補と推定する。
【００４６】
次に、当該サーバ名ＸＸでトップページ候補が存在するディレクトリ階層を決定する（Ｓ４Ｂ）。このディレクトリ階層の決定処理フローを図５に示す。
【００４７】
図５において、ページデータベースに格納され、サーバ名ＸＸをもつ各ページについて、第１ディレクトリ階層（階層１）に属するページを抽出し、これらページのうち、ファイル名が「index.html，/,」、ディレクトリ名と同名のファイル名を持つページ群を参照し、このページ群がもつ分類尤度をデータとする演算で階層１のトップページクラスの分類尤度の平均値を得る（Ｓ４Ｂ１）。続いて、処理Ｓ４Ｂ１で求めたページ群のトップページクラスの分類尤度の平均値が一定値（例えば０.６）以上あるか否かを判定し（Ｓ４Ｂ２）、平均値が一定値以上あればトップページが存在するディレクトリ階層を「階層１」と決定する（Ｓ４Ｂ３）。次に、平均値が一定値に満たない場合、階層１にトップページが存在しないとして、サーバ名ＸＸをもつ各ページについて、第２ディレクトリ階層（階層２）に属するページを抽出し、これらページのうち、ファイル名が「index.html，/,」、ディレクトリ名と同名のファイル名を持つページ群を参照し、このページ群がもつ分類尤度をデータとする演算で階層２のトップページクラスの分類尤度の平均値を得る（Ｓ４Ｂ４）。続いて、処理Ｓ４Ｂ４で求めたページ群のトップページクラスの分類尤度の平均値が一定値（例えば０.５５）以上あるか否かを判定し（Ｓ４Ｂ５）、平均値が一定値以上あればトップページが存在するディレクトリ階層を「階層２」と決定する（Ｓ４Ｂ６）。また、平均値が一定値に満たない場合、サーバ名ＸＸのトップページディレクトリ階層を「０」にする（Ｓ４Ｂ７）。
【００４８】
図４に戻って、処理Ｓ４Ｂの実行で、トップページディレクトリ階層を決定した後、このトップページディレクトリ階層に所属し、ファイル名が「index.html，/,」、トップページ分類尤度とインデックスページ分類尤度およびメニューページ分類尤度の和を最大優先度として、各ディレクトリ毎にトップページを決定する（Ｓ４Ｃ）。
【００４９】
この処理Ｓ４Ｃの詳細を図６で説明する。初めに、トップページディレクトリ階層に属する全ファイルのうち、ファイル名が「/,index.html」である、もしくはトップページ分類尤度とインデックスページ分類尤度およびメニューページ分類尤度の和が一定値以上のページ群のＵＲＬをページデータベースからソートして取り出す（Ｓ４Ｃ１）。次に、ループＳＬ４では、全てのディレクトリに対して処理Ｓ４Ｃ２〜Ｓ４Ｃ７を繰り返し実行し、各ディレクトリ毎のトップページを決定する。これら処理を以下に説明する。
【００５０】
処理Ｓ４Ｃ１でソートされたＵＲＬ列の最初のＵＲＬを取り出し（Ｓ４Ｃ２）、そのＵＲＬからトップページディレクトリ階層に当たるディレクトリ名を取り出す（Ｓ４Ｃ３）。例えば、トップページディレクトリ階層が「１」、ＵＲＬが「http://a.com/dir1/index.html」の場合、文字列「http://a.com/dir1/」が取り出すディレクトリ名になる。次に、トップページディレクトリ階層のディレクトリ名が処理Ｓ４Ｃ３で取り出したＵＲＬのそれと前方一致するＵＲＬを、処理Ｓ４Ｃ１でソートされたＵＲＬ列から取り出す（Ｓ４Ｃ４）。例えば、ソートされたＵＲＬ列のうち、文字列「http://a.com/dir1/」に前方一致するものを取り出す。次に、トップページディレクトリ階層のディレクトリに存在するページ集合の中で、ファイル名が「/」または「index.htm(1)」であるページが存在するか否かを判定し（Ｓ４Ｃ５）、存在する場合は同じディレクトリに属するページの中で最高位のものであるとして、そのページをトップページとする（Ｓ４Ｃ６）。また、存在しない場合には、ディレクトリ中の全ページの内、トップページ分類尤度とインデックスページ分類尤度およびメニューページ分類尤度の和が最大のページをトップページとする（Ｓ４Ｃ７）。ただし、分類尤度の和が最大のページが複数存在する場合は、それらページのＵＲＬ文字列順にソートして最初のページをトップページとする。
【００５１】
再び図４に戻って、処理Ｓ４Ｃでは当該ディレクトリ階層の１階層下のディレクトリに属し、トップページ分類尤度が閾値以上のページをトップページとする（Ｓ４Ｄ）。この処理Ｓ４Ｄの詳細は、図７に示すように、トップディレクトリ階層の１階層下のディレクトリに属し、トップページクラス分類尤度が閾値以上のページをデータベースから取り出し（Ｓ４Ｄ１）、取り出したページをトップページとする（Ｓ４Ｄ２）。
【００５２】
図１に戻って、処理Ｓ４で獲得された全てのトップページ候補に対して、ループＬ３ではその親ページを推定するため、親ページの推定がなされていないページ集合を取り出し（Ｓ５）、このページ集合の親ページを決定する（Ｓ６）。
【００５３】
この親ページ決定処理フローを図８に示す。同図において、取り出したトップページをキューに入れ（Ｓ６Ａ）、このキューが空になるまで（Ｓ６Ｂ）、キューからページを１つ取り出し、このページのリンク先ページのうち収集データに含まれるリンク先ページリストを得る（Ｓ６Ｃ）。このリンク先ページが属するサーバの親ページが決まっているか否かを判定し（Ｓ６Ｄ）、親ページが決まっていないときは現在処理中のページを親ページとする（Ｓ６Ｅ）。また、リンク先ページが属するサーバの親ページが決まっている場合、現在の親ページからのリンクとリンク元ページからのリンクのうち、幹リンク尤度が大きい方のリンク元を親ページとする（Ｓ６Ｆ）。これらＳ６Ｅ，Ｓ６Ｆで親ページを決定した後、リンク先ページをキューに入れ（Ｓ６Ｇ）、次のキュー内のページについて同様に繰り返すことで各トップページ候補の親ページを推定する。
【００５４】
再び図１に戻って、トップページ候補集合として検出されないで、その親ページの推定が未決のページ集合の中から、ディレクトリの最も浅い階層に存在し、かつトップページ、インデクスページ、メニューページの分類尤度の和が最大のページをトップページ候補として取り出し（Ｓ７）、これらトップページ候補の全てに対して処理Ｓ６と同様の処理でそれぞれ親ページを決定する（Ｓ８）。
【００５５】
なお、本発明は、図１等に示した方法の一部又は全部をコンピュータで実行できるようにしたプログラムを配布することが可能である。また、このプログラムをコンピュータが読み取り可能な記録媒体、例えば、ＦＤ（フロッピーディスク：登録商標）や、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して提供し、配布することが可能である。
【００５６】
【発明の効果】
以上のとおり、本発明によれば、検索したＷｅｂページ集合の各ページ、リンクについてメタ情報とリンク分類木を利用してリンクの各リンクタイプへの分類尤度を獲得しておき、さらにメタ情報とページタイプからページクラスを抽出したページ分類木を獲得して各ページのページタイプの分類尤度を抽出しておき、Ｗｅｂページが属する全てのサーバについて、上記の分類尤度を基にしてＷｅｂサイトのトップページ候補集合を得てそれらの親ページを推定し、この推定でも親ページが未決のページ集合の中からディレクトリの最も浅い階層に存在しかつトップページ、インデックスページ、メニューページ尤度の和が最大のページをトップページ候補としてそれらの親ページを推定し、これら推定した親ページとこれにリンクするページ集合をＷｅｂサイト単位の内部構造として推定するようにしたため、Ｗｅｂサイトの内部構造を適確に推定できるようにし、この内部構造を基にして検索目的に適合したＷｅｂサイト単位の情報検索ができる。
【図面の簡単な説明】
【図１】本発明の実施形態を示す内部構造推定フロー。
【図２】実施形態におけるリンク分類木の獲得と分類尤度獲得のフロー。
【図３】実施形態におけるページ分類木の獲得と分類尤度獲得のフロー。
【図４】実施形態におけるトップページ候補獲得のフロー。
【図５】実施形態におけるトップページのディレクトリ階層決定フロー。
【図６】実施形態におけるディレクトリ毎のトップページ決定フロー。
【図７】実施形態におけるディレクトリ毎のトップページ決定フロー。
【図８】実施形態における親ページ決定フロー。
【図９】実施形態におけるデータベースのデータ構造とページ分類木とディレクトリ階層の例。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a Web page search method for searching a Web page provided on the WWW (World Wide Web) and providing it to a user on a Web site basis. The present invention relates to an apparatus and a method for estimating a linked page as an internal structure of a Web site unit.
[0002]
[Prior art]
The WWW stores a set of Web pages as hypermedia information having a link relationship. The user arrives at a Web server specified by a URL (Universal Resource Locator), searches for the Web page, and the link relationship. By following the above, information on the target page set can be acquired.
[0003]
There is a search engine as a Web information search means when the above URL is unknown or the search range is expanded. This search engine analyzes HTML (Hyper Text Markup Language) of a Web page, A large number of Web pages are collected by repeating a series of processes for obtaining URLs linked to URLs, and target page information is extracted by specifying keywords and topics for the collected Web pages. I am doing so.
[0004]
[Problems to be solved by the invention]
Web information on the WWW is configured not on a page basis but on a set of pages called a website, and in order to obtain only information suitable for the search purpose, it is desired to be able to search on a website basis. Note that the Web site means a page that belongs to the same server as the top page and is highly related to the top page among the top page (representative page) and a group of pages that can be reached by following the link.
[0005]
However, in the conventional Web page search method based on the URL, the search is for the server unit and the Web page linked to it, and the target page information and related page information are obtained, that is, the Web site. Retrieval of unit information requires a lot of work.
[0006]
On the other hand, a search using a search engine often collects a large number of pages and is not organized, sorted, or processed. Therefore, the user can extract information in units of websites suitable for the search purpose. It becomes extremely difficult.
[0007]
Information extraction methods suitable for search purposes include the literature “Citation Engine: WWW Ranking System Using Link Analysis, Hajime Takano, Shinya Kubo, pp.9-16, IPSJ SIG 2000-DBS-120 (Database System) Study Group), Information Processing Society of Japan (January, 2000), uses the number of link references from external servers to determine a page called a representative page (top page) from the collected pages. According to the technique, a page that is not frequently linked from an external server cannot be extracted as the top page of the website. In addition, in order to determine the representative page, it is necessary to collect sufficient links (that is, pages).
[0008]
Further, the method disclosed in Japanese Patent Application Laid-Open No. 9-231238 merely clusteres page search results by subject analysis, and therefore cannot obtain information in units of Web sites suitable for the purpose.
[0009]
If the internal structure of the Web site is known, the above inconveniences can be solved because it is easy to recognize the hierarchical relationship between the searched pages and to facilitate navigation to reach the target page.
[0010]
As a method for estimating the internal structure of a Web site, in the above document, the relationship between Web pages is expressed by a directed graph, and a tree structure composed of the shortest path from the top page to each page is expressed based on a breadth-first search of the directed graph. It is proposed to estimate. However, this method has a problem that it does not work well when the original graph has a structure like a complete graph, as described in the same document.
[0011]
Furthermore, Japanese Patent Laid-Open No. 2001-60165 proposes a method for determining the parent-child relationship of pages in consideration of the link relationship between directories in the determination using the information tree. The structure cannot be determined more accurately.
[0012]
It is an object of the present invention to accurately estimate the internal structure of a website, and to easily search for information on a website basis suitable for the search purpose based on the internal structure. To provide a method, a program, and a recording medium.
[0013]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the present invention obtains a classification likelihood for each link type of a link by using meta information and a link classification tree for each page and link of the retrieved Web page set, A page classification tree obtained by extracting the page class from the information and the page type is acquired, and the classification likelihood of the page type of each page is extracted. After that, for all servers to which the Web page belongs, the top page candidate set of the Web site is obtained based on the above-mentioned classification likelihood, and the parent page is estimated. The parent page is estimated with the top page, index page, and menu page likelihood sum being the largest as the top page candidate, and the estimated parent page and the set of pages linked to it. Is estimated as the internal structure of each Web site, and is characterized by the following apparatus, method, program, and recording medium.
[0014]
  (Invention of the device)
  (1) Collect a set of Web pages on the WWW, and estimate a parent page and a linked page from the set of Web pages as an internal structure for each Web site.apparatusBecause
  Link type classification likelihood extraction that extracts meta information based on surface knowledge for each page and link of the page set and extracts the classification likelihood to the link type of each link using a link classification treemeansWhen,
  Page type classification likelihood extraction for acquiring a page classification tree for each page of the page set and extracting a classification likelihood for each page type of the page set based on the page classification treemeansWhen,
  For all servers to which the page set belongs, take out a page set belonging to a server that has not been subjected to parent page estimation processing, andFirst to third means,
  The first to obtain the top page candidate set of the websitemeansWhen,
  A second page for determining a parent page of each page based on the link type classification likelihood based on each top page candidatemeansWhen,
  The page whose parent page belongs to the shallowest hierarchy of the directory from the set of undecided pages and has the largest sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is set as a top page candidate. A third page is extracted and a parent page is determined from the top page candidate based on the link type classification likelihood.meansWhen,
The parent page is estimated by repeating.
[0015]
  (2) Link type classification likelihood extractionmeansPrepares a web page set with a link type, extracts link meta information from each page of the page set, and learns a link class using link meta information and link type pairs as learning data Obtain a tree, extract all meta information and URLs of each link of the page set, and a directory hierarchy where the page exists, obtain link meta information for the link information extracted from each link of the page set, and link classification A classification likelihood for each link type is obtained from a set of meta information based on a tree.
[0016]
  (3) Extracting the page classification likelihoodmeansFor each page of the page set, the page classification tree that learned the page class is acquired using the set of the page type and its meta information as learning data, and the meta information of each page is input based on this page classification tree. The classification likelihood for each page type is extracted.
[0017]
  (4) The firstmeansFor each server to which each page of the page set belongs,means,
  -A page that belongs to the server and has a specified file name that has a directory hierarchy of 0 and is located in the hierarchy is estimated as a top page candidate for the server name.meansA and
  Based on the top page type classification likelihood, the directory hierarchy where the top page candidates exist is sequentially lowered to determine the directory hierarchy where the top page candidates exist.meansB and
  ・ The abovemeansA page belonging to the directory hierarchy determined in B and having a file name in which a file exists in a lower hierarchy and having the maximum classification likelihood for the page type is determined for each directory hierarchy as a top page candidate.meansC
  A page that belongs to the directory hierarchy one level below the directory hierarchy and has a top page classification likelihood of the page type equal to or higher than a threshold is set as a top page candidate.meansD,
It is characterized by repeating.
[0018]
  (5) The secondmeansObtains the linked page list included in the collected data among the linked pages of the top page candidate, determines whether the parent page of the linked page is determined, and is currently processing when the parent page is not determined Is the parent page, and the parent page of the server to which the linked page belongs is determined, the link source with the higher trunk link likelihood among the links from the current parent page and the link source page It is characterized by repeating the process of setting the parent page.
[0019]
  (Invention of method)
  (6) Web page set on WWWIn the databaseCollect the parent page from this web page set and the page linked to it as the internal structure of each websiteBy information processing by computer softwareA method of estimating,
  The computer software is
  A link type classification likelihood extraction process in which meta information based on surface knowledge is extracted for each page and link of the page set, and a classification likelihood to each link type is extracted using a link classification tree When,
  A page type classification likelihood extraction process for obtaining a page classification tree for each page of the page set and extracting a classification likelihood to a page type of each page of the page set based on the page classification tree;
  For all servers to which the page set belongs, a page set belonging to a server that has not been subjected to parent page estimation processing is extracted, and the following first to third steps:
  A first process of acquiring a top page candidate set for a website;
  Starting from each of the top page candidates and determining a parent page of each page based on the link type classification likelihood;
  The page whose parent page belongs to the shallowest hierarchy of the directory from the set of undecided pages and has the largest sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is set as a top page candidate. A third step of taking out and determining a parent page from the top page candidate based on the link type classification likelihood;
The parent page is estimated by repeating.
[0020]
(7) The link type classification likelihood extraction process prepares a Web page set to which a link type is assigned, extracts link meta information from each page of the page set, and sets the link meta information and the link type. A link classification tree obtained by learning a link class using learning data as a learning data, all meta information of each link of the page set, a URL and a directory hierarchy in which the page exists are extracted, and a link is extracted from each link of the page set It is characterized in that meta information of a link to information is obtained, and a classification likelihood for each link type is obtained from a set of meta information based on a link classification tree.
[0021]
(8) In the page classification likelihood extraction process, for each page of the page set, a page classification tree obtained by learning a page class is acquired using a set of a page type and its meta information as learning data, and the page classification tree is obtained. Based on this, the meta information of each page is input, and the classification likelihood for each page type is extracted.
[0022]
(9) The first process includes the following processes for each server to which each page of the page set belongs:
A process A in which a page having a specified file name that belongs to the server and has a directory hierarchy of 0 and located in the hierarchy is estimated as a top page candidate in the server name;
A process B for determining a directory hierarchy in which a top page candidate exists by sequentially lowering a directory hierarchy in which a top page candidate exists based on the top page type classification likelihood;
A page belonging to the directory hierarchy determined in the process B and having a file name in which a file exists in a lower hierarchy and having a maximum sum of classification likelihoods to the page type is set as a top page candidate for each directory hierarchy. Process C to determine;
A process D in which a page belonging to the directory hierarchy one level below the directory hierarchy and having a top page classification likelihood of the page type is equal to or greater than a threshold value is a top page candidate;
It is characterized by repeating.
[0023]
(10) In the second step, a linked page list included in the collected data is obtained from the linked pages of the top page candidates, and it is determined whether or not the parent page of the linked page is determined. If the page currently being processed is the parent page and the parent page of the server to which the link destination page belongs is determined, the main link likelihood of the link from the current parent page and the link from the link source page is determined. It is characterized by repeating the process of setting the link source having the larger degree as the parent page.
[0024]
  (Invention of the program)
  (11) A program that enables a computer to execute a method of collecting a set of Web pages on the WWW and estimating a parent page and a linked page from the set of Web pages as an internal structure of each Web site.
  Prepare a Web page set with a link type, extract link meta information from each page of this page set, and create a link classification tree that has learned link classes using the link meta information and link type combination as learning data. And acquiring all meta information of each link in the page set, a URL and a directory hierarchy in which the page exists, obtaining link meta information for the link information extracted from each link in the page set, and obtaining a link classification tree A link type classification likelihood extraction procedure for obtaining a classification likelihood for each link type from a set of meta-information based on the base information;
  For each page of the page set, the page classification tree that learned the page class is acquired using the set of page type and its meta information as learning data, and the meta information of each page is input based on this page classification tree. Page classification likelihood extraction procedure to extract classification likelihood to page type,
  With respect to all servers to which the page set belongs, a page set belonging to a server that has not been subjected to parent page estimation processing is taken out, and a first procedure for acquiring a top page candidate set of a Web site, and starting from each top page candidate, A second procedure for determining a parent page of each page based on the link type classification likelihood, and a top of the page type classification likelihood when the parent page belongs to the shallowest hierarchy of a directory from an undecided page set A third procedure for taking out a page having the maximum sum of page likelihood, index page likelihood and menu page likelihood as a top page candidate, and determining a parent page from the top page candidate based on the link type classification likelihood. Repeat to estimate the parent page,
  The first procedure is as follows for each server to which each page of the page set belongs:
  A procedure A for estimating a page belonging to the server and having a specified file name located in the hierarchy and having a directory hierarchy of 0 as a top page candidate in the server name;
  A procedure B for determining a directory hierarchy in which a top page candidate exists by sequentially lowering a directory hierarchy in which a top page candidate exists based on the top page type classification likelihood;
  A page belonging to the directory hierarchy determined in the procedure B and having a file name having a file in a lower hierarchy and having the maximum sum of classification likelihoods for the page type is set as a top page candidate for each directory hierarchy. Procedure C to determine,
  The procedure D is repeated with the page belonging to the directory hierarchy one level below the directory hierarchy and having a top page classification likelihood of the page type being a threshold value or more as a top page candidate,
  The second procedure obtains a linked page list included in the collected data among the linked pages of the top page candidates, determines whether or not the parent page of the linked page is determined, and the parent page is not determined. When the page currently being processed is the parent page and the parent page of the server to which the linked page belongs is determined, the trunk link likelihood of the link from the current parent page and the link from the link source page is large Repeat the process of using the other link source as the parent page,
  In the third procedure, the parent page belongs to the shallowest directory of the set of undecided pages, and the sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is maximum. The page is extracted as a top page candidate, and the parent page is estimated by repeating determining the parent page based on the link type classification likelihood from the top page candidate.
Let the computer execute each stepIt is characterized by that.
[0025]
  (Invention of recording medium)
  (12) A recording medium on which is recorded a program enabling a computer to execute a method of collecting a set of Web pages on the WWW and estimating a parent page and a linked page from the set of Web pages as an internal structure for each Web site Because
  Prepare a Web page set with a link type, extract link meta information from each page of this page set, and create a link classification tree that has learned link classes using the link meta information and link type combination as learning data. And acquiring all meta information of each link in the page set, a URL and a directory hierarchy in which the page exists, obtaining link meta information for the link information extracted from each link in the page set, and obtaining a link classification tree A link type classification likelihood extraction procedure for obtaining a classification likelihood for each link type from a set of meta-information based on the base information;
  For each page of the page set, the page classification tree that learned the page class is acquired using the set of page type and its meta information as learning data, and the meta information of each page is input based on this page classification tree. Page classification likelihood extraction procedure to extract classification likelihood to page type,
  With respect to all servers to which the page set belongs, a page set belonging to a server that has not been subjected to parent page estimation processing is taken out, and a first procedure for acquiring a top page candidate set of a Web site, and starting from each top page candidate, A second procedure for determining a parent page of each page based on the link type classification likelihood, and a top of the page type classification likelihood when the parent page belongs to the shallowest hierarchy of a directory from an undecided page set A third procedure for taking out a page having the maximum sum of page likelihood, index page likelihood and menu page likelihood as a top page candidate, and determining a parent page from the top page candidate based on the link type classification likelihood. Repeat to estimate the parent page,
  The first procedure is as follows for each server to which each page of the page set belongs:
  A procedure A for estimating a page belonging to the server and having a specified file name located in the hierarchy and having a directory hierarchy of 0 as a top page candidate in the server name;
  A procedure B for determining a directory hierarchy in which a top page candidate exists by sequentially lowering a directory hierarchy in which a top page candidate exists based on the top page type classification likelihood;
  A page belonging to the directory hierarchy determined in the procedure B and having a file name having a file in a lower hierarchy and having the maximum sum of classification likelihoods for the page type is set as a top page candidate for each directory hierarchy. Procedure C to determine,
  Repeating the procedure D in which a page belonging to the directory hierarchy one step below the directory hierarchy and having a top page classification likelihood of the page type is a threshold or more is set as a top page candidate,
  The second procedure obtains a linked page list included in the collected data among the linked pages of the top page candidates, determines whether or not the parent page of the linked page is determined, and the parent page is not determined. When the page currently being processed is the parent page, and the parent page of the server to which the linked page belongs is determined, the trunk link likelihood of the link from the current parent page and the link from the link source page is large Repeat the process of using the other link source as the parent page,
  In the third procedure, the parent page belongs to the shallowest directory of the set of undecided pages, and the sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is maximum. The page is extracted as a top page candidate, and the parent page is estimated by repeating determining the parent page based on the link type classification likelihood from the top page candidate.
Let the computer execute each stepThe program is recorded.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 is an internal structure estimation flow of a Web page for Web site unit search showing an embodiment of the present invention. The internal structure estimation system is constructed by a database capable of storing and managing the retrieved page data and link data and the estimated internal structure data, and a computer equipped with software necessary for executing the estimation flow of FIG.
[0027]
In FIG. 1, first, meta information based on surface knowledge is extracted for each page and its link for all Web page sets collected by a conventional search engine, and each link is also made using a link classification tree. The classification likelihood to the link type is obtained and stored in the database (S1). The link classification tree, link type, and classification likelihood here will be described below with reference to FIG.
[0028]
The link classification tree gives a link type classification likelihood to each page of the Web page. As shown in FIG. 2A, first, the link classification tree is obtained by manually adding the link type to the Web. A page set (learning page) is prepared (S1A). As a link type at this time, for example, a link that is a branch of a tree structure of a trunk link site and other links (links that do not correspond to the above classification) are used.
[0029]
Next, link meta information from each page of the learning data is extracted (S1B). Note that the link meta-information includes the following items, for example. Link source and destination directory distance (meaning the distance between the directory A and the directory B in the directory tree having “/” as the root, and the number of edges traced when reaching the node B from the node A in the shortest distance). Whether to include a specific word in the directory where the link source page exists. Whether to include a specific word in the anchor string. The file name of the link source page. The file name of the linked page. The relationship between the file name of the link source page and the directory match / mismatch. The relationship between the file name of the linked page and the directory match / mismatch. Whether there is a target attribute for the link tag (A tag).
[0030]
Finally, for the machine learning algorithm, a set of link meta information and link type is used as learning data, and a link class is acquired as a learned link classification tree (S1C).
[0031]
Page types include, for example, a top page, an index page (a page that collects directory indexes), a menu page (a page that displays a list of contents other than the index type), and a link collection page (a link page to another site). , And other pages (pages that do not fall into the above categories).
[0032]
As a machine learning algorithm, for example, C4.5: Programs for Machine Learning, written by J. Ross Quinlan, Morgan Kaufmann Publishers ISBN 1-55860-238-0 is used.
[0033]
Next, link type classification likelihood acquisition is performed by extracting page meta information for all pages in the loop SL1, as shown in FIG. Along with the existing directory hierarchy in the page database. (S1D).
[0034]
The page meta information includes the following items, for example. Whether to include a specific word (such as a homepage) in the title string. Whether a counter CGI (common gateway interface) exists. Whether it is a frame page. Whether the content contains a specific word (this page, etc.). Whether or not a specific character string (such as a home page) is included in the URL character string. Whether or not there is a link to the parent directory among links that include a character string such as “return” in the anchor character string. The number of links that include a string such as "Back" in the anchor string. Number of links to pages on the external server. Number of links to pages in the same directory. Number of links to pages in the lower directory. Number of links to pages in sibling directories.
[0035]
In the loop SL2, for the page being processed by the loop SL1, all the link information from the page is processed, the link information is extracted from the page, and the meta information of the link for the link is obtained (S1E), Based on the link classification tree, the classification likelihood for each link type is obtained from a certain set of meta information (S1F), and the obtained classification likelihood is stored in the link database (S1G).
[0036]
Through the above processing, as shown in FIG. 9, the link classification likelihood is acquired for each page in the link database. The link classification likelihood obtained at this time is the probability that the page is classified into one link type, and the sum of the classification likelihoods of each link type on a certain page is 1. For example, the classification likelihood to a trunk link of a certain page is 0.75, and the classification likelihood to another link is 0.25.
[0037]
Returning to FIG. 1, using the page classification tree acquired in advance, the classification likelihood for each page type is obtained and stored in the page database (S2). The page classification tree, page type, and classification likelihood here will be described below with reference to FIG.
[0038]
As shown in the structural example of FIG. 9, the page classification tree has a classification question for each node other than the leaf (eg, the number of links is N or more), and the leaf node has the determined page type. The link means, for example, a branch of yes / no to the question of the node. As shown in FIG. 3A, the link is obtained by first collecting a set of web pages (learning pages) to which a page type is manually assigned. ) Is prepared (S3A). The page types at this time include, for example, a top page, an index page (a page in which directory indexes are collected), a menu page (a page showing a list of contents other than the index type), a link collection page (a link page to another site) ), Other pages (pages that do not fall into the above categories).
[0039]
Next, the meta information based on the surface layer feature of the page is extracted (S3B), and the combination of the page meta information and the page type is used as learning data for the machine learning algorithm, and the page class is acquired as a learned page classification tree. (S3C). As the machine learning algorithm, for example, the above C4.5 is used.
[0040]
Next, the page type classification likelihood acquisition is performed by extracting meta information based on the surface layer characteristics of pages for all pages as shown as loop SL3 in FIG. 3B (S3D). The meta information of the page is input to the tree, the classification likelihood for each page type is obtained (S3E), and the obtained classification likelihood is stored in the page database (S3F).
[0041]
Through the above processing, as shown in FIG. 9, the page type classification likelihood is stored for each page in the page database. The classification likelihood obtained at this time is a probability that the page is classified into one page type, and the sum of the classification likelihoods of the respective page types in a certain page is 1. For example, the classification likelihood for a top page of a certain page is 0.75, the classification likelihood for a menu page is 0.2, and the classification likelihood for other pages is 0.05.
[0042]
Returning to FIG. 1 again, in the loops L1 to L3, the processes S1 and S2 described above are used as a preparation stage, and the internal structure of the Web site is estimated.
[0043]
  First, in the loop L1, a set of pages belonging to servers for which all servers are processed and parent page estimation is not performed is extracted (S3). In this process, as a pre-process, a search is made for the name of the Web server to which it belongs in page units for all Web page sets collected by a conventional search engine,FIG.These are stored in the server name database as shown.
[0044]
Next, in the loop L2, all pages belonging to the server that has not been subjected to parent page estimation taken out in step S3 are targeted, and the top page discovery method of the website is completed until the parent page estimation process of the server to which they belong is completed. Is used to extract the top page candidate set of the site (S4).
[0045]
This processing flow is shown in FIG. In FIG. 4, first, if a page belonging to the server being processed, having a directory hierarchy of 0, and having a file name of “/” or “index.html” exists in the page database, the page is positioned at the highest hierarchy. As a top page candidate (S4A). For example, if the directory name of the server name XX is the level “0” in FIG. 9 and the file name of the page is “/” or “index.html” among the pages with the server name XX, the page is changed. Estimated as a top page candidate for the server name.
[0046]
Next, the directory hierarchy where the top page candidate exists with the server name XX is determined (S4B). The directory hierarchy determination processing flow is shown in FIG.
[0047]
In FIG. 5, for each page stored in the page database and having the server name XX, a page belonging to the first directory hierarchy (hierarchy 1) is extracted, and among these pages, the file name is “index.html, /,”. Then, the page group having the same file name as the directory name is referred to, and the average value of the classification likelihood of the top page class of the hierarchy 1 is obtained by the operation using the classification likelihood of the page group as data (S4B1). Subsequently, it is determined whether or not the average value of the classification likelihood of the top page class of the page group obtained in the process S4B1 is a certain value (for example, 0.6) or more (S4B2). The directory hierarchy in which the top page exists is determined as “hierarchy 1” (S4B3). Next, if the average value is less than a certain value, assuming that there is no top page in the hierarchy 1, for each page having the server name XX, a page belonging to the second directory hierarchy (hierarchy 2) is extracted. Among them, the top page class of the hierarchy 2 is calculated by referring to the page group having the file name “index.html, /,” and the same file name as the directory name and using the classification likelihood of this page group as data. The average value of the classification likelihood is obtained (S4B4). Subsequently, it is determined whether or not the average value of the classification likelihood of the top page class of the page group obtained in the process S4B4 is a certain value (for example, 0.55) or more (S4B5). The directory hierarchy in which the top page exists is determined as “hierarchy 2” (S4B6). If the average value is less than a certain value, the top page directory hierarchy of the server name XX is set to “0” (S4B7).
[0048]
Returning to FIG. 4, after the top page directory hierarchy is determined by execution of step S4B, it belongs to this top page directory hierarchy, the file name is “index.html, /,”, the top page classification likelihood and the index page The top page is determined for each directory with the sum of the classification likelihood and the menu page classification likelihood as the maximum priority (S4C).
[0049]
Details of the process S4C will be described with reference to FIG. First, among all files belonging to the top page directory hierarchy, the file name is “/,index.html”, or the sum of the top page classification likelihood, the index page classification likelihood, and the menu page classification likelihood is a certain value or more. Are sorted out from the page database (S4C1). Next, in the loop SL4, the processes S4C2 to S4C7 are repeatedly executed for all directories to determine the top page for each directory. These processes will be described below.
[0050]
The first URL of the URL string sorted in the process S4C1 is extracted (S4C2), and the directory name corresponding to the top page directory hierarchy is extracted from the URL (S4C3). For example, when the top page directory hierarchy is “1” and the URL is “http://a.com/dir1/index.html”, the character string “http://a.com/dir1/” is the directory name to be extracted. Become. Next, the URL whose directory name in the top page directory hierarchy matches the forward URL of the URL extracted in the process S4C3 is extracted from the URL string sorted in the process S4C1 (S4C4). For example, out of the sorted URL strings, the one that matches the character string “http://a.com/dir1/” is extracted. Next, it is determined whether or not a page having a file name “/” or “index.htm (1)” exists in a set of pages existing in the directory of the top page directory hierarchy (S4C5). If this is the case, it is determined that the page is the highest page among the pages belonging to the same directory, and that page is set as the top page (S4C6). If none exists, the page with the maximum sum of the top page classification likelihood, the index page classification likelihood, and the menu page classification likelihood is set as the top page among all the pages in the directory (S4C7). However, when there are a plurality of pages with the largest sum of the classification likelihoods, the pages are sorted in the order of the URL character strings, and the first page is set as the top page.
[0051]
Returning to FIG. 4 again, in step S4C, a page that belongs to the directory one level below the directory hierarchy and has a top page classification likelihood equal to or higher than a threshold is set as a top page (S4D). As shown in FIG. 7, the details of this process S4D are extracted from the database for pages that belong to the directory one level below the top directory hierarchy and the top page class classification likelihood is equal to or greater than the threshold (S4D1). A page is set (S4D2).
[0052]
Returning to FIG. 1, in order to estimate the parent page in the loop L3 for all the top page candidates acquired in the process S4, a page set in which the parent page is not estimated is extracted (S5). The parent page of the set is determined (S6).
[0053]
This parent page determination processing flow is shown in FIG. In the figure, the extracted top page is put in the queue (S6A), and one page is taken out from the queue until the queue becomes empty (S6B), and the link destination included in the collected data among the link destination pages of this page. A page list is obtained (S6C). It is determined whether or not the parent page of the server to which the linked page belongs is determined (S6D). If the parent page is not determined, the currently processed page is set as the parent page (S6E). Also, when the parent page of the server to which the linked page belongs is determined, the link source with the larger trunk link likelihood among the links from the current parent page and the link source page is set as the parent page ( S6F). After the parent page is determined in S6E and S6F, the linked page is put in the queue (S6G), and the parent page of each top page candidate is estimated by repeating the same for the next page in the queue.
[0054]
Referring back to FIG. 1 again, the parent page is not detected as a top page candidate set, and the parent page estimate is present in the shallowest hierarchy of the directory, and the top page, index page, and menu page classification likelihoods. The page with the highest sum is extracted as a top page candidate (S7), and the parent page is determined for each of these top page candidates by the same process as the process S6 (S8).
[0055]
In the present invention, it is possible to distribute a program that allows a computer to execute part or all of the method shown in FIG. Further, the program can be recorded and provided on a computer-readable recording medium such as FD (floppy disk: registered trademark), MO, ROM, memory card, CD, DVD, removable disk, and distributed. Is possible.
[0056]
【The invention's effect】
As described above, according to the present invention, the classification likelihood for each link type of the link is acquired using the meta information and the link classification tree for each page and link of the retrieved Web page set, and further the meta information The page classification tree obtained by extracting the page class from the page type is obtained, the classification likelihood of the page type of each page is extracted, and all the servers to which the Web page belongs are Web-based based on the above-described classification likelihood. The site's top page candidate set is obtained and its parent page is estimated. Even in this estimation, the parent page exists in the shallowest hierarchy of the directory from the pending page set and the sum of the top page, index page, and menu page likelihoods. The parent page is estimated with the largest page as the top page candidate, and the estimated parent page and the linked page are estimated. Due to be estimated the set as internal structure of the Web site level, so as to estimate the internal structure of the Web site to accurately can information retrieval Web site units conforming to the search object by the internal structure based.
[Brief description of the drawings]
FIG. 1 is an internal structure estimation flow showing an embodiment of the present invention.
FIG. 2 is a flow of link classification tree acquisition and classification likelihood acquisition in the embodiment.
FIG. 3 is a flowchart of page classification tree acquisition and classification likelihood acquisition in the embodiment.
FIG. 4 is a flow of acquiring a top page candidate in the embodiment.
FIG. 5 is a flowchart of determining a top page directory hierarchy in the embodiment.
FIG. 6 is a top page determination flow for each directory in the embodiment.
FIG. 7 is a top page determination flow for each directory in the embodiment.
FIG. 8 is a parent page determination flow in the embodiment.
FIG. 9 shows an example of a database data structure, a page classification tree, and a directory hierarchy in the embodiment.

Claims

A device that collects a set of Web pages on the WWW and estimates a parent page and a linked page from the set of Web pages as an internal structure for each Web site.
Link type classification likelihood extraction means for extracting meta information based on surface knowledge for each page and link of the page set, and extracting a link likelihood to each link type using a link classification tree When,
A page type classification likelihood extracting unit that obtains a page classification tree of each page of the page set and extracts a classification likelihood to a page type of each page of the page set based on the page classification tree;
For all servers to which the page set belongs, take out the page set belonging to the server that has not been subjected to parent page estimation processing, and the following first to third means:
A first means for acquiring a top page candidate set of a website;
A second means for determining a parent page of each page based on the link type classification likelihood, starting from each top page candidate;
The page whose parent page belongs to the shallowest hierarchy of the directory from the set of undecided pages and has the largest sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is set as a top page candidate. A third means for taking out and determining a parent page from the top page candidate based on the link type classification likelihood;
An internal structure estimating device for a Web site, wherein the parent page is estimated by repeating the above.

The link type classification likelihood extracting means prepares a web page set to which a link type is assigned, extracts link meta information from each page of the page set, and learns a set of link meta information and link type. The link classification tree having learned the link class is obtained, all the meta information of each link of the page set, the URL and the directory hierarchy in which the page exists are extracted, and the link to the link information extracted from each link of the page set 2. The internal structure estimation apparatus for a Web site according to claim 1, wherein the meta likelihood of each link type is obtained from a set of meta information based on a link classification tree.

The page classification likelihood extracting means obtains a page classification tree obtained by learning a page class for each page of the page set by using a set of a page type and its meta information as learning data, and based on the page classification tree, 3. The internal structure estimation apparatus for a Web site according to claim 1, wherein meta information of the page is input and a classification likelihood for each page type is extracted.

The first means includes the following means for each server to which each page of the page set belongs:
A means A for estimating a page belonging to the server and having a specified file name located in the hierarchy and having a directory hierarchy of 0 as a top page candidate in the server name;
Means B for determining the directory hierarchy in which the top page candidates exist by sequentially lowering the directory hierarchy in which the top page candidates exist based on the top page type classification likelihood;
A page belonging to the directory hierarchy determined by the means B and having a file name in which a file exists in a lower hierarchy and having the maximum sum of classification likelihoods for the page type is set as a top page candidate for each directory hierarchy. Means C for determining;
A means D that makes a page that belongs to a directory hierarchy one level below the directory hierarchy and has a top page classification likelihood of the page type equal to or greater than a threshold;
4. The internal structure estimation apparatus for a Web site according to claim 1, wherein:

The second means obtains a linked page list included in the collected data among the linked pages of the top page candidates, determines whether or not the parent page of the linked page is determined, and the parent page is not determined. When the page currently being processed is the parent page and the parent page of the server to which the linked page belongs is determined, the trunk link likelihood of the link from the current parent page and the link from the link source page is large 5. The internal structure estimation apparatus for a Web site according to claim 1, wherein the process of setting the link source as a parent page is repeated.

A method for collecting a set of Web pages on the WWW in a database and estimating a parent page and a linked page from the set of Web pages as an internal structure of each Web site by information processing by computer software ,
The computer software is
A link type classification likelihood extraction process in which meta information based on surface knowledge is extracted for each page and link of the page set, and a classification likelihood to each link type is extracted using a link classification tree When,
A page type classification likelihood extraction process for obtaining a page classification tree for each page of the page set and extracting a classification likelihood to a page type of each page of the page set based on the page classification tree;
For all servers to which the page set belongs, a page set belonging to a server that has not been subjected to parent page estimation processing is extracted, and the following first to third steps:
A first process of acquiring a top page candidate set for a website;
Starting from each of the top page candidates and determining a parent page of each page based on the link type classification likelihood;
The page whose parent page belongs to the shallowest hierarchy of the directory from the set of undecided pages and has the largest sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is set as a top page candidate. A third step of taking out and determining a parent page from the top page candidate based on the link type classification likelihood;
A method for estimating the internal structure of a Web site, wherein the parent page is estimated by repeating.

The link type classification likelihood extraction process prepares a Web page set to which a link type is assigned, extracts link meta information from each page of the page set, and learns a set of link meta information and link type. The link classification tree having learned the link class is obtained, all the meta information of each link of the page set, the URL and the directory hierarchy in which the page exists are extracted, and the link to the link information extracted from each link of the page set 7. The method for estimating an internal structure of a Web site according to claim 6, wherein the meta likelihood of each link type is obtained from a set of meta information based on a link classification tree.

In the page classification likelihood extraction process, for each page of the page set, a page classification tree obtained by learning a page class using a set of a page type and its meta information as learning data is obtained, and each page classification tree is based on the page classification tree. 8. The method for estimating the internal structure of a Web site according to claim 6 or 7, wherein the meta information of the page is input, and the classification likelihood for each page type is extracted.

The first process includes the following processes for each server to which each page of the page set belongs:
A process A in which a page having a specified file name that belongs to the server and has a directory hierarchy of 0 and located in the hierarchy is estimated as a top page candidate in the server name;
A process B for determining a directory hierarchy in which a top page candidate exists by sequentially lowering a directory hierarchy in which a top page candidate exists based on the top page type classification likelihood;
A page belonging to the directory hierarchy determined in the process B and having a file name in which a file exists in a lower hierarchy and having a maximum sum of classification likelihoods to the page type is set as a top page candidate for each directory hierarchy. Process C to determine;
A process D in which a page belonging to the directory hierarchy one level below the directory hierarchy and having a top page classification likelihood of the page type is equal to or greater than a threshold value is a top page candidate;
The method for estimating the internal structure of a Web site according to any one of claims 6 to 8, wherein:

The second process obtains a linked page list included in the collected data among the linked pages of the top page candidates, determines whether or not the parent page of the linked page is determined, and the parent page is not determined. When the page currently being processed is the parent page and the parent page of the server to which the linked page belongs is determined, the trunk link likelihood of the link from the current parent page and the link from the link source page is large The method for estimating the internal structure of a Web site according to any one of claims 6 to 9, wherein the process of setting the other link source as a parent page is repeated.

A program that enables a computer to execute a method of collecting a set of Web pages on the WWW and estimating a parent page and a linked page from the set of Web pages as an internal structure of each Web site.
Prepare a Web page set with a link type, extract link meta information from each page of this page set, and create a link classification tree that has learned link classes using the link meta information and link type combination as learning data. And acquiring all meta information of each link in the page set, a URL and a directory hierarchy in which the page exists, obtaining link meta information for the link information extracted from each link in the page set, and obtaining a link classification tree A link type classification likelihood extraction procedure for obtaining a classification likelihood for each link type from a set of meta-information based on the base information;
For each page of the page set, the page classification tree that learned the page class is acquired using the set of page type and its meta information as learning data, and the meta information of each page is input based on this page classification tree. Page classification likelihood extraction procedure to extract classification likelihood to page type,
With respect to all servers to which the page set belongs, a page set belonging to a server that has not been subjected to parent page estimation processing is taken out, and a first procedure for acquiring a top page candidate set of a Web site, and starting from each top page candidate, A second procedure for determining a parent page of each page based on the link type classification likelihood, and a top of the page type classification likelihood when the parent page belongs to the shallowest hierarchy of a directory from an undecided page set A third procedure for taking out a page having the maximum sum of page likelihood, index page likelihood and menu page likelihood as a top page candidate, and determining a parent page from the top page candidate based on the link type classification likelihood. Repeat to estimate the parent page,
The first procedure is as follows for each server to which each page of the page set belongs:
A procedure A for estimating a page belonging to the server and having a specified file name located in the hierarchy and having a directory hierarchy of 0 as a top page candidate in the server name;
A procedure B for determining a directory hierarchy in which a top page candidate exists by sequentially lowering a directory hierarchy in which a top page candidate exists based on the top page type classification likelihood;
A page belonging to the directory hierarchy determined in the procedure B and having a file name having a file in a lower hierarchy and having the maximum sum of classification likelihoods for the page type is set as a top page candidate for each directory hierarchy. Procedure C to determine,
The procedure D is repeated with the page belonging to the directory hierarchy one level below the directory hierarchy and having a top page classification likelihood of the page type being a threshold value or more as a top page candidate,
The second procedure obtains a linked page list included in the collected data among the linked pages of the top page candidates, determines whether or not the parent page of the linked page is determined, and the parent page is not determined. When the page currently being processed is the parent page and the parent page of the server to which the linked page belongs is determined, the trunk link likelihood of the link from the current parent page and the link from the link source page is large Repeat the process of using the other link source as the parent page,
In the third procedure, the parent page belongs to the shallowest hierarchy of the directory from the set of undecided pages, and the sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is maximum. The page is extracted as a top page candidate, and the parent page is estimated by repeating determining the parent page based on the link type classification likelihood from the top page candidate.
A program that causes a computer to execute each procedure .

A recording medium recording a program that collects a set of Web pages on the WWW and estimates a parent page and a linked page from the set of Web pages as an internal structure in units of Web sites. ,
Prepare a Web page set with a link type, extract link meta information from each page of this page set, and create a link classification tree that has learned link classes using the link meta information and link type combination as learning data. And acquiring all meta information of each link in the page set, a URL and a directory hierarchy in which the page exists, obtaining link meta information for the link information extracted from each link in the page set, and obtaining a link classification tree A link type classification likelihood extraction procedure for obtaining a classification likelihood for each link type from a set of meta-information based on the base information;
For each page of the page set, the page classification tree that learned the page class is acquired using the set of page type and its meta information as learning data, and the meta information of each page is input based on this page classification tree. Page classification likelihood extraction procedure to extract classification likelihood to page type,
With respect to all servers to which the page set belongs, a page set belonging to a server that has not been subjected to parent page estimation processing is taken out, and a first procedure for acquiring a top page candidate set of a Web site, and starting from each top page candidate, A second procedure for determining a parent page of each page based on the link type classification likelihood, and a top of the page type classification likelihood when the parent page belongs to the shallowest hierarchy of a directory from an undecided page set A third procedure for taking out a page having the maximum sum of page likelihood, index page likelihood and menu page likelihood as a top page candidate, and determining a parent page from the top page candidate based on the link type classification likelihood. Repeat to estimate the parent page,
The first procedure is as follows for each server to which each page of the page set belongs:
A procedure A for estimating a page belonging to the server and having a specified file name located in the hierarchy and having a directory hierarchy of 0 as a top page candidate in the server name;
A procedure B for determining a directory hierarchy in which a top page candidate exists by sequentially lowering a directory hierarchy in which a top page candidate exists based on the top page type classification likelihood;
A page belonging to the directory hierarchy determined in the procedure B and having a file name having a file in a lower hierarchy and having the maximum sum of classification likelihoods for the page type is set as a top page candidate for each directory hierarchy. Procedure C to determine,
The procedure D is repeated with the page belonging to the directory hierarchy one level below the directory hierarchy and having a top page classification likelihood of the page type being a threshold value or more as a top page candidate,
The second procedure obtains a linked page list included in the collected data among the linked pages of the top page candidates, determines whether or not the parent page of the linked page is determined, and the parent page is not determined. When the page currently being processed is the parent page and the parent page of the server to which the linked page belongs is determined, the trunk link likelihood of the link from the current parent page and the link from the link source page is large Repeat the process of using the other link source as the parent page,
In the third procedure, the parent page belongs to the shallowest hierarchy of the directory from the set of undecided pages, and the sum of the top page likelihood, the index page likelihood, and the menu page likelihood of the page type classification likelihood is maximum. The page is extracted as a top page candidate, and the parent page is estimated by repeating determining the parent page based on the link type classification likelihood from the top page candidate.
A recording medium on which a program for causing a computer to execute each procedure is recorded.