JP2004264927A

JP2004264927A - Web site retrieval method and device, web site retrieval program, and storage medium recording the program

Info

Publication number: JP2004264927A
Application number: JP2003052315A
Authority: JP
Inventors: Kenichi Mori; 憲一森
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-02-28
Filing date: 2003-02-28
Publication date: 2004-09-24

Abstract

<P>PROBLEM TO BE SOLVED: To enable a retrieving person to know whether or not information in a Web site is significantly updated and select a result fitted to a purpose. <P>SOLUTION: A site retrieval engine 1 collects Web pages and divides them to sites. A site retrieval device or site manager information management server 4 having the site retrieval engine 1 and a site retrieval result generation part 2 calculates update amount of each site and stores it in a site DB 3 in advance. The site retrieval engine 1 performs a site retrieval according to a retrieval request from the retrieving person to acquire a hit page and a site ID to which the page belongs, also calculates the retrieval score of each site followed by sorting in descending order, and delivers site retrieval information to the site retrieval result generation part 2. The site retrieval result generation part 2 acquires the update amounts of the sites included in the site retrieval information from the site DB 3 followed by sorting in descending order of update amount to generate a site retrieval result, and presents the result to the retrieving person. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ＷＷＷ（ワールド・ワイド・ウェブ）上のＷｅｂ情報をサイト単位で検索者に提供するためのＷｅｂ情報の検索方法および装置に係り、特に更新量に基づきサイト検索結果のランキングを行う機能を持つＷｅｂサイト検索方法および装置に関する。
【０００２】
【従来の技術】
ＷＷＷは、Ｗｅｂページ集合をリンク関係を持たせたハイパーメディア情報として蓄積しており、検索者はＵＲＬ（ユニバーサル・リソース・ロケータ）で規定されるＷｅｂサーバにたどり着き、Ｗｅｂページの検索とそのリンク関係を辿ることで目的とするＷｅｂページ集合の情報を取得することができる。
【０００３】
上記のＵＲＬが不明な場合や検索範囲を拡張する場合のＷｅｂ情報の提供システムとして、検索エンジンを利用したサービスがある。検索エンジンは、ＷＷＷページを自動収集するロボット型検索エンジンにより、ハイパーテキストのリンクを辿りながら多数のＷｅｂページを収集しサイトに分けてデータベースを構築しておき、Ｗｅｂサイトの情報を検索者にサービス提供している。
【０００４】
上記のＷｅｂ情報提供システムが提供するＷｅｂサイトにおいて、その内容が更新されたか否かの表示がない場合、検索者は、一度検索し、参照したサイトの内容が前回の訪問から更新されているかが分からず、再度訪問する意欲を喪失してしまう。これでは、サイト運営上で情報の提供効果が上がらないし、検索者は有効な情報取得機会を逸してしまう問題があった。
【０００５】
このようなシステムに対して、Ｗｅｂ情報の更新時期を提示しようとするシステムがある。その例として、下記の非特許文献１に示すｆｒｅｓｈｅｙｅがある。ｆｒｅｓｈｅｙｅでは、ファイルのタイムスタンプを見て、その順に結果を表示する機能を持つ。また、更新が比較的最近発生した情報のみを検索する新着検索機能を持つ。
【０００６】
【非特許文献１】
ｆｒｅｓｈｅｙｅ（インターネットＵＲＬ：ｈｔｔｐ：／／ｗｗｗ．ｆｒｅｓｈｅｙｅ．ｃｏ．ｊｐ／）、平成１５年１月７日インターネット検索
【０００７】
【発明が解決しようとする課題】
しかしながら、ｆｒｅｓｈｅｙｅが持つ、それらの機能は、ファイルのタイムスタンプ情報のみを見ているため、そのファイルに対する更新がどの程度大きいものであるかを示すことはできないという問題がある。また、それらの機能による方法では、サイト全体でどの程度の更新があったかを知ることはできないという問題がある。
【０００８】
本発明は、上記従来技術の問題を解決するためになされたものであり、サイト検索において、検索者が、サイトの情報に大きな更新があったかどうかを検索結果から知り、目的にあった結果を選択できるようにした、サイト検索の技術を提供することが課題である。
【０００９】
【課題を解決するための手段】
上記の課題を解決するため、本発明は、事前にＷｅｂページを収集してサイトに分け、サイト毎に当該サイトの更新量を計算し、サイトデータベースに保存しておく過程と、検索者の検索要求に基づくサイト検索端末装置からの検索要求に応じてページ検索を行い、ヒットしたページについてサイトデータベースに対してサイト検索を行い、複数のサイトを含むサイト検索情報を取得する過程と、前記サイト検索情報に含まれる複数のサイトの更新量を前記サイトデータベースから取得する過程と、前記サイト検索情報を前記取得したサイトの更新量の大きい順にソートする過程と、前記ソートしたサイト検索情報を基にサイト検索結果を生成し前記サイト検索端末装置に送信する過程とを、備えることを特徴とするＷｅｂサイト検索方法を、その解決の手段とする。
【００１０】
あるいは、上記のＷｅｂサイト検索方法において、前記サイトの更新量を計算し、サイトデータベースに保存しておく過程では、更新前、更新後のサイト内のファイル名リストを比較し、両方のリストに出現するファイルの数を更新後のファイル数で割った数により、または更新されたサイト内のファイルのうち、更新フラグが１のファイル数を全ファイル数で割った数により、サイトの更新量を計算することを特徴とするＷｅｂサイト検索方法を、その解決の手段とする。
【００１１】
あるいは、上記のＷｅｂサイト検索方法において、サイト検索情報を取得する過程では、サイト検索への適合度合いを示す検索スコアを計算し、サイト検索情報に含めることとし、前記ソートする過程では、更新量が同一の複数のサイトが存在する場合、前記検索スコアを基にソートするか、または既に前記検索スコアを基にソートされたサイト検索情報を出力することを特徴とするＷｅｂサイト検索方法を、その解決の手段とする。
【００１２】
あるいは、上記のＷｅｂサイト検索方法において、前記ソートする過程では、サイト検索結果に各サイトの更新量を併せて提示することを特徴とするＷｅｂサイト検索方法を、その解決の手段とする。
【００１３】
あるいは、事前に収集されたＷｅｂページがサイト分けされ、サイト毎に計算された当該サイトの更新量を保存するサイトデータベースと、検索者の検索要求に基づくサイト検索端末装置からの検索要求に応じてページ検索を行い、ヒットしたページについて前記サイトデータベースに対してサイト検索を行い、複数のサイトを含むサイト検索情報を取得するサイト検索エンジンと、前記サイト検索情報に含まれる複数のサイトの更新量を前記サイトデータベースから取得する手段と、前記サイト検索情報を前記取得したサイトの更新量の大きい順にソートする手段と、前記ソートされたサイト検索情報を基にサイト検索結果を生成し、前記サイト検索端末装置に出力する手段と、を備えることを特徴とするＷｅｂサイト検索装置を、その解決の手段とする。
【００１４】
あるいは、上記のＷｅｂサイト検索装置において、前記サイトの更新量は、更新前、更新後のサイト内のファイル名リストを比較し、両方のリストに出現するファイルの数を更新後のファイル数で割った数により、または更新されたサイト内のファイルのうち、更新フラグが１のファイル数を全ファイル数で割った数により、計算された値であることを特徴とするＷｅｂサイト検索装置を、その解決の手段とする。
【００１５】
あるいは、上記のＷｅｂサイト検索装置において、前記ソートする手段は、更新量が同一の複数のサイトが存在する場合、サイト検索エンジンで計算されたサイト検索への適合度合いを示す検索スコア、またはサイト検索端末装置で計算されたサイト検索への適合度合いを示す検索スコアもとにソートするか、または、サイト検索エンジンで計算されたサイト検索への適合度合いを示す検索スコアによりソートされたサイト検索情報を出力するものであることを特徴とするＷｅｂサイト検索装置を、その解決の手段とする。
【００１６】
あるいは、上記のＷｅｂサイト検索装置において、前記ソートする手段は、ソートされた検索結果に各サイトの更新量を併せて提示するものであることを特徴とするＷｅｂサイト検索装置を、その解決の手段とする。
【００１７】
あるいは、上記のＷｅｂサイト検索方法における過程を、コンピュータに実行させるためのプログラムとしたことを特徴とするＷｅｂサイト検索プログラムを、その解決の手段とする。
【００１８】
あるいは、上記のＷｅｂサイト検索方法における過程を、コンピュータに実行させるためのプログラムとし、前記プログラムを、前記コンピュータが読み取りできる記録媒体に記録したことを特徴とするＷｅｂサイト検索プログラムを記録した記録媒体を、その解決の手段とする。
【００１９】
本発明では、事前に収集されたＷｅｂページをサイト分けし、各サイトの更新量を計算して保存しておき、サイト検索装置がサイト検索結果に含まれるサイトの更新量を取得してサイト検索結果を更新量の大きいサイト順にソートして検索者に提示することにより、検索者が、サイトの情報に大きな更新があったかどうかをサイト検索結果から知ることができるようにし、検索者が目的にあった結果を選択できるようにする。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態について図を用いて詳細に説明する。
【００２１】
図１に、本実施形態例によるＷｅｂサイト検索システムの構成を示す。本システムは、サイト検索装置の主要部としてサイト検索エンジン１と、サイト検索結果生成部２と、サイトデータベース（以下、データベースはＤＢと記す）３とを有するとともに、その他にサイト運営者情報管理サーバ４とを備える。
【００２２】
サイト検索エンジン１は、図略のサイト検索端末装置からの検索要求に応じてページ検索を行い、ヒットページについてサイトＤＢ３に対してサイト検索を行い、ヒットページＩＤとそのページのサイトＩＤを含む複数のサイト検索情報をサイト検索結果生成部２に渡す。
【００２３】
サイト検索結果生成部２は、サイト検索エンジン１からサイト検索した複数のサイトＩＤを含むサイト検索情報を受け取る手段と、事前にサイトＩＤ毎に更新量を計算し保存されているサイトＤＢ３からサイト検索情報に含まれる複数のサイトＩＤの更新量を取得する手段と、サイト検索情報を取得したサイトＩＤの更新量の大きい順にソートする手段と、ソートされたサイト検索情報からサイト検索結果を生成する手段と、生成した検索結果を当該サイト検索要求を発したサイト検索端末装置に送信する手段とを備える。
【００２４】
なお、サイト検索端末装置としては、検索者の検索要求に基づいてサイト検索エンジン１に対して検索要求を行う手段と、サイト検索結果生成部２で生成されたサイト検索結果を受信して表示装置等の出力装置に出力する出力手段とを備える。
【００２５】
サイトＤＢ３は、事前に収集したＷｅｂページをサイト分けし、サイトＩＤ毎に計算された当該サイトの更新量を保存しておき、サイト検索結果生成部２からの求めに応じてそのサイトの更新量の情報を提供する。この更新量の計算と保存は、必要なデータを取得することにより、サイト検索エンジン１や、サイト運営者情報管理サーバ４などが行う例が考えられるが、その他の装置が行っても構わない。
【００２６】
サイト運営者情報管理サーバ４は、サイト運営者が提供するＷｅｂページの作成や更新を管理し、サイト検索エンジン１にＷｅｂページの更新の有無を通知したり、必要によりサイトＤＢ３にサイトの更新量を計算して格納したりする。
【００２７】
次に、本システムで実行される本実施形態例によるＷｅｂサイト検索方法の処理手順を示す。
【００２８】
本方法は、事前処理と、サイト検索時の処理とからなる。以下、図１を参照しながら各処理手順を説明する。
【００２９】
＜事前処理＞
図２は、本実施形態例によるＷｅｂサイト検索方法における事前処理全体の処理手順を示すフローチャートである。Ｓ１１〜Ｓ１４は処理のステップを表す。事前処理は、サイト検索時の処理の前に独立に行われる。
【００３０】
まず、事前処理を行うサーバ等の装置により、大量のＷｅｂページを収集する（Ｓ１１）。本例では、事前処理を行うサーバ等の装置として、サイト検索装置が行うこととする。
【００３１】
サイト検索装置は、収集したＷｅｂページをサイト別に分ける（Ｓ１２）。このサイト分けでは、大量のＷｅｂページ集合から、Ｗｅｂサイトのトップページを推定し、この推定トップページと、それにリンクしたページからなるページ集合をサイトと決定する。
【００３２】
これらトップページ推定とサイト決定には、例えば、本願出願人が既に提案している技術（例えば、特願２００１−３８９４４７、特願２００１−３８９４４８）を利用することができる。
【００３３】
サイトのトップページ推定は、検索したＷｅｂページをサーバ別に分類しておき、メタ情報とページタイプからページクラスを抽出したページ分類木を獲得しておき、この分類木を基に各ページのページタイプの分類尤度を抽出しておき、同じサーバに属するページについて、ディレクトリ階層が０に位置するページを最優先でトップページとして推定し、階層０にトップページが存在しない場合にはトップページタイプの分類尤度を基にトップページが存在するディレクトリ階層を決定し、このディレクトリ階層に所属して下位階層にファイルが存在し、かつページタイプへの分類尤度が最大さらには閾値以上のページをトップページとして推定する。
【００３４】
また、サイトの決定は、検索したＷｅｂページ集合の各ページ、リンクについてメタ情報とリンク分類木を利用してリンクの各リンクタイプへの分類尤度を獲得しておき、さらにメタ情報とページタイプからページクラスを抽出したページ分類木を獲得して各ページのページタイプの分類尤度を抽出しておき、その後、Ｗｅｂページが属する全てのサーバについて、上記の分類尤度を基にしてＷｅｂサイトのトップページ候補集合を得てそれらの親ページを推定し、この推定でも親ページが未決のページ集合の中からディレクトリの最も浅い階層に存在しかつトップページ、インデックスページ、メニューページ尤度の和が最大のページをトップページ候補としてそれらの親ページを推定し、これら推定した親ページとこれにリンクするページ集合をサイトとする。
【００３５】
サイト木構造は、トップページからのリンクにより、各サイトのページのリンクの親子関係から推定されるので、この推定された各サイト木構造を、例えばサイトＤＢ３等に保存しておく。このサイト木構造は、検索スコアの計算に用いられる。
【００３６】
サイトＤＢ３には、サイト更新量を計算するために、サイト毎に更新前後のファイル名リストを保存しておく。さらに、ファイル毎にその更新の有無を示す更新フラグを設けておく。
【００３７】
あるいはサイト更新量を計算するために、さらにはサイト検索結果を生成するために、図略のページＤＢを設ける。ページＤＢには、収集した全ページについて、動画／画像等のマルチメディアファイルへのリンク数、ファイルサイズ（テキストの量、Ｂｙｔｅ数）、またはタグを除いたファイルサイズ、文書ベクトルを保存しておく。さらに、ページ毎に、その更新の有無を示す更新フラグを設けておき、一定周期または適当な時期に、各サイト毎に全ページの更新の有無チェックを行う（Ｓ１３）。この更新有無チェックの詳細を図３に示す。なお、文書ベクトルとは、あらかじめ定めた複数の単語のそれぞれをベクトルの次元として、文書に各単語が出現した場合にその出現数等で対応する各次元に大きさを与え、当該文書をベクトル表現したものである。
【００３８】
図３において、サイトＤＢ３に収集した全ページについて、まず、画像等のマルチメディアファイルへのリンク数、ファイルサイズ（テキストの量、Ｂｙｔｅ数）、文書ベクトルをそれぞれ変数に保存する（Ｓ２１）。
【００３９】
次に、当該ページ毎に、ページＤＢに保存しておいた画像等のマルチメディアファイルのリンク数、ファイルサイズ、文書ベクトルを得、これら値と処理Ｓ２１で得た変数の値との差分を別の変数に格納する（Ｓ２２）。なお、差分は、リンク数とファイルサイズについては整数型とし、文書ベクトルについては非類似度（２つの文書ベクトルのコサイン値の逆数）とする。
【００４０】
次に、ページ毎に、上記の差分を基にページ更新量Ｕを計算する（Ｓ２３）。このページ更新量Ｕは、リンク数の増分とファイルサイズの増分及び文書ベクトルの非類似度から以下の式で求める。
【００４１】
【数１】
Ｕ＝αＬ／Ａ＋βＭ／Ｓ＋γＮ
ただし、Ｌはリンク数の増分、Ａは更新後のリンク数、Ｍはファイルサイズの増分、Ｓは更新後のファイルサイズ、Ｎは文書ベクトルの非類似度、α、β、γは０以上１以下の実数かつα＋β＋γ＝１を満たす値とする。なお、新規作成されたページの更新量Ｕは１としておく。
【００４２】
次に、上記式で求めたページ更新量Ｕがある閾値を越えているか否かをチェックし（Ｓ２４）、越えていればページＤＢの当該ページの更新フラグに「１」を格納し（Ｓ２５）、越えていなければ当該ページの更新フラグに「０」を格納する（Ｓ２６）。
【００４３】
以上までの処理を全ページに亙って行い、各ページ毎に大幅な更新があったか否かをページＤＢの各ページのフラグとして保存しておく。
【００４４】
続いて、各サイト毎にサイトの更新量の計算と保存を行う（Ｓ１４）。このサイト更新チェックの処理の詳細を図４に示す。
【００４５】
図４において、サイト分けされた全サイトについて、まず、サイトの更新量を計算する（Ｓ３１）。この計算は、例えば、サイトＤＢ３に保存された更新前後のサイト内のファイル名リストを比較し、両方のリストに出現するファイル数を更新後のファイル数で割った値（ファイル数の比率）として求める。この計算方法では、ページ毎に更新量を計算する手間を省くことができる。他の計算方法の例としては、サイト内のファイルのうち、更新フラグが１のファイル数を更新後のファイル数で割った値として求める。なお、新規サイトの更新量は１とする。計算された更新量は、更新なしのサイトを０とし、新規サイトを１とする間の値となる。計算された更新量は、サイトＩＤ毎にサイトＤＢ３に格納される（Ｓ４２）。
【００４６】
次に、上記の処理Ｓ３１で求めたサイト更新量がある閾値を越えているか否かをチェックし（Ｓ３３）、越えていればサイトＤＢ３の当該サイトの更新フラグに「１」を格納し（Ｓ３４）、越えていなければ当該サイトの更新フラグに「０」を格納する（Ｓ３５）。なお、この処理は、サイト検索結果の生成に必要な処理であり、サイト検索結果の内容によっては不要である。
【００４７】
以上までの処理を全サイトに亙って行い、各サイトＩＤ毎に、更新量と、必要に応じて大幅な更新があったか否かを表すフラグとを、サイトＤＢ３に保存しておく。
【００４８】
サイトＤＢ３に登録されているページに対して更新要求がある場合、サイト運営者情報管理サーバ４は、その都度、更新情報のページＵＲＬをサイト検索エンジン１等に登録し、サイト検索エンジン１等によって更新情報がサイトＤＢ３に上書きされる。このページの更新により、ページＤＢには更新前のファイルサイズデータ等が残っており、サイトＤＢ３の更新ページのファイルサイズ等は変更されており、この差分がページ更新チェック（Ｓ１３）やサイト更新チェック（Ｓ１４）に際して更新の有無判定やサイトの更新量の計算に利用される。
【００４９】
なお、ページ更新チェック（Ｓ１３）やサイト更新チェック（Ｓ１４）に代えて、ページ更新要求があったときに、サイト運営者情報管理サーバ４が前記のファイルサイズの増分、ファイル数の増分、リンク数の増分等からページの更新の有無をチェックし、ページのフラグ設定をしたり、サイトの更新量や更新の有無をチェックしたりしておくことでもよい。また、文書ベクトルの非類似度をサイト運営者情報管理サーバ４で管理し、更新の有無は定期的にサイト検索エンジン１に通知するか、特定ファイルに更新情報を記述し、そのファイルをロボット型の検索エンジンに収集させることでもよい。
【００５０】
＜サイト検索時の処理＞
図５は、本実施形態例によるＷｅｂサイト検索方法におけるサイト検索時の処理を説明するフローチャートである。図中、Ｓ４１〜Ｓ４９は処理のステップを表す。
【００５１】
まず、サイト検索エンジン１は、サイト検索端末装置からの検索要求に応じて、ヒットしたＷｅｂページを収集し、各ＷｅｂページについてサイトＤＢ３に対するサイト検索を行うことで（Ｓ４１）、検索要求に適合しているサイト単位の検索情報を取得する（Ｓ４２）。このサイト検索には、例えば、本願出願人が既に提案している技術（例えば、特願２００１−３８９４４６等）を利用することができる。
【００５２】
このサイト検索では、検索にヒットしたページ情報を基にサイトＤＢ３を検索し、ページ単位にサイトＩＤとそのトップページのＵＲＬを得るとともに、検索要求への適合度合いを表すものとして、検索スコアを求める。この検索スコアの高い順にサイトをソートしてランキングされたサイト検索情報をサイト検索結果生成部２に渡す。ここでの検索スコアの計算例では、ページ毎に検索キーワードの重みを考慮して得られる原スコアとサイト木構造を基にして計算される得点をサイト毎に加算して検索スコアとする。
【００５３】
具体的には、まず、そのページに含まれるキーワードの数と、検索対象となるページ集合のうち、キーワードを含むページ数の逆数との積により原スコアｓ（ｐ）を計算する（「情報検索と言語処理」、徳永健伸著、東京大学出版会ＩＳＢＮ４−１３−０６５４０５−５、ｐｐ．２６−３２）。次に、ページ毎にサイトＤＢ３等から各ページの属するサイトＩＤとサイト木構造におけるルート（トップページ）からの深さｄ（ｐ）を得る。次に、ページ毎に、次式によりキーワードへの適合度合いとして得点ｓｃｏｒｅ（ｐ）を計算し、サイト毎に加算して検索スコアとする。
【００５４】
【数２】
ｓｃｏｒｅ（ｐ）＝ｆ（ｓ（ｐ））×（１／（α＋（ｄ（ｐ））^２×β））
ただし、α，βは０より大で１より小の定数を表し、ｆ（ｓ（ｐ））はｓ（ｐ）が高い場合にはｓｃｏｒｅ（ｐ）も高くなるような任意の関数を表す。なお、本例では、サイト検索エンジン１が検索スコアを計算する場合を示したが、サイト検索エンジン１からは原スコアを取得してサイト検索結果生成部２やその他の装置で計算してもよい。
【００５５】
Ｓ４３以降は、サイト検索結果生成部２において、更新量に基づきサイト検索情報をソートする処理の手順を示している。
【００５６】
まず、サイト検索結果生成部２は、サイト検索エンジン１から受け取ったサイト検索情報をハッシュ変数Ｒに保存する（Ｓ４３）。ここで、Ｒは、サイトＩＤから検索スコアを引くことが可能な変数あり、Ｒ｛サイトＩＤ｝はサイト検索エンジン１から検索情報として与えられるそのサイトの検索スコアである。
【００５７】
次に、各サイトＩＤをキーとしてそのサイトの更新量をサイトＤＢ３より検索し、得られた更新量をサイトＩＤとともに配列変数Ａに保存する（Ｓ４４）。
【００５８】
以下、配列変数Ａをサイトの更新量でループ１，２を繰り返してソートする（Ｓ４５）。
【００５９】
ループ１は、配列変数Ａの各要素Ａｎについて処理を繰り返す。ループ２は、配列変数Ａｎにおいてｍ＞ｎの各要素Ａｍについて処理を繰り返す。
【００６０】
まず、Ａｎ［０］をサイトＩＤとし、Ａｎ［１］をそのサイトの更新量として、ループ２では、まずＡｎ［１］＝Ａｍ［１］か否かの第１の判定を行う（Ｓ４６）。第１の判定がＹｅｓであれば、Ｒ｛Ａｎ［０］｝＜Ｒ｛Ａｍ［０］｝か否か、すなわち、サイトＡｎの検索スコアがサイトＡｍの検索スコアより小さいか否かの第２の判定を行う（Ｓ４７）。第２の判定がＹｅｓであれば、スワップ処理によりＡｎとＡｍを入れ替える（Ｓ４８）。
【００６１】
第１の判定がＮｏの場合、または第２の判定がＮｏの場合は、ループ２の処理を繰り返す。あるサイトＡｎについて全てのＡｍについてループ２が終了すると、次のサイトＡｎ＋１についてループ２の処理を繰り返すというようにしてループ１の処理を最後の１つ前のサイトＡｎまで繰り返す。以上のループ１，２の繰り返しにより、更新量でソートされたサイトの配列変数Ｂを生成する。但し、配列変数Ｂは、配列変数Ａの各要素へのポインタを格納している。
【００６２】
こうして更新量でソートされた配列変数Ｂをサイト検索結果を生成する処理に渡す（Ｓ４９）。
【００６３】
続いて、図６にサイト検索結果生成部２でのサイト検索結果を生成する処理のフローチャートを示す。図中、Ｓ５１〜Ｓ５６は処理のステップである。
【００６４】
この処理で生成するサイト検索結果には、以下を併せて示す。
【００６５】
１）サイトを代表するトップページ情報、単数の検索キーワードまたは複数の検索キーワードのａｎｄ／ｏｒ論理によりヒットしたサイト内ページ情報、および検索スコア、
２）更新情報ページへのリンク（見た目はボタン等いろいろあり得る）、ただし、更新情報がない場合はリンクは表示されない。また、更新情報ページが登録されていなければ、「更新あり」を意味する情報を表示する（テキスト、画像など）。
【００６６】
ここでは、前述のページＤＢが生成されている場合のサイト検索結果を生成する処理について説明するが、サイト検索結果としては、少なくとも、更新量でソートされてランキングされたサイト情報と各サイトでのトップページＵＲＬやヒットしたサイト内ページＵＲＬが存在するものであれば任意であり、それに応じて不要になる処理もあるので以下の処理手順は変化する。
【００６７】
なお、サイト検索結果生成部２には、サイト検索エンジン１から検索スコアでソートされたサイト検索情報が渡されるので、更新量が同じサイトについては、更新量を加味しない検索スコアによるランキングの順でサイト検索結果が生成される。もし、検索スコアでランキングされていなければ、サイト検索エンジン１で計算された検索スコア、もしくはサイト検索結果生成部２で計算した検索スコアにより、ソートの処理において、更新量が同一のサイトについては検索スコアの高いサイトが上位になるように入れ替えを行えばよい。
【００６８】
図６において、検索結果として表示する各サイトに対して、まず、検索されたサイトのＩＤをキーとし、サイトＤＢ３を検索し、そのサイトの代表となるトップページＵＲＬと、更新情報の有無を意味するサイトとヒットページのフラグと、更新情報ページのＵＲＬを得る（Ｓ５１）。
【００６９】
次に、サイトのトップページＵＲＬと、これに関連するサイト内のヒットページＵＲＬへのリンクをＨＴＭＬとして出力する（Ｓ５２）。
【００７０】
次に、当該サイトのフラグをチェックし（Ｓ５３）、フラグが「１」であれば当該サイト内ページフラグから更新情報ページがあるか否かをチェックし（Ｓ５４）、更新情報ページがなければ当該サイトには更新有りを意味する画像などをＨＴＭＬとして出力し（Ｓ５５）、更新情報ページがあるときは当該更新情報ページへのリンク（ボタン等で表す）をＨＴＭＬとして出力する（Ｓ５６）。
【００７１】
図１でのサイト検索結果は、サイト検索時の表示例を示し、検索条件を「紳士靴」とし、この条件に合ったサイトとして「Ａシューズショップ」と「Ｂ靴店」が検索されて表示され、「Ａシューズショップ」にはヒットページが表示され、さらに関連する更新情報ページがある場合として「Ｎｅｗ」等のボタン表示を行う。検索者はそのボタンをクリックすることで、「Ａシューズショップ」があらかじめ選択したページが表示される。
【００７２】
したがって、通常の検索結果としてはヒットページを表示するが、本実施形態における検索結果には、それに加えて、更新度合いが大きいページについては更新情報ページへのリンクを表示し、更新度合いが小さいページについては更新があったことを表示する。同様に、更新度合いが大きいサイトと小さいサイトについても表示形態を変えることも可能である。これにより、サイト運営者には検索者の関心を惹く自サイト情報の自動生成処理ができ、検索者には更新度合いの高いサイトをサイト検索結果から容易に知ることが可能となる。
【００７３】
なお、図１で示した各部の機能実現手段をコンピュータのプログラムで構成したり、あるいは図２〜図６で示した処理のステップをコンピュータのプログラムで構成したりして、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラム、あるいは、コンピュータにその処理のステップを実行させるためのプログラムを、そのコンピュータが読み取りできる記録媒体、例えば、フレキシブルディスクや、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。これらの記録媒体からコンピュータに前記のプログラムをインストールすることにより、あるいはネットワークからダウンロードしてコンピュータに前記のプログラムをインストールすることにより、本発明を実施することが可能となる。
【００７４】
【発明の効果】
以上で明らかなように、本発明によれば、検索者は、サイトの情報に大きな更新があったかどうかを検索結果から知り、目的にあった結果を選択することが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態例によるＷｅｂサイト検索システムの構成図である。
【図２】本発明の一実施形態例によるＷｅｂサイト検索方法における事前処理全体の処理手順を示すフローチャートである。
【図３】ページ更新チェックの詳細を示すフローチャートである。
【図４】サイト更新チェックの詳細を示すフローチャートである。
【図５】本発明の一実施形態例によるＷｅｂサイト検索方法におけるサイト検索時の処理を説明するフローチャートである。
【図６】本発明の一実施形態例によるＷｅｂサイト検索方法におけるサイト検索結果を生成する処理を示すフローチャートである。
【符号の説明】
１…サイト検索エンジン
２…サイト検索結果生成部
３…サイトＤＢ
４…サイト運営者情報管理サーバ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a Web information search method and apparatus for providing Web information on a WWW (World Wide Web) to searchers on a site-by-site basis, and in particular, a function of ranking site search results based on an update amount. And a Web site search method and apparatus having the same.
[0002]
[Prior art]
The WWW stores a set of Web pages as hypermedia information having a link relation, and a searcher reaches a Web server defined by a URL (Universal Resource Locator), searches for a Web page, and searches for the Web page and its link relation. , The information of the target Web page set can be obtained.
[0003]
As a system for providing Web information when the URL is unknown or when the search range is expanded, there is a service using a search engine. The search engine uses a robot-type search engine that automatically collects WWW pages, collects a large number of Web pages while following hypertext links, divides the sites into sites, builds a database, and provides Web site information to searchers. providing.
[0004]
If there is no indication as to whether or not the content has been updated on the Web site provided by the Web information providing system, the searcher searches once, and determines whether the content of the referred site has been updated since the previous visit. Without knowing, they lose their willingness to visit again. In this case, there is a problem that the effect of providing information does not increase in the operation of the site, and a searcher misses an effective information acquisition opportunity.
[0005]
In contrast to such a system, there is a system that attempts to present the update time of Web information. As an example, there is a fresheye shown in Non-Patent Document 1 below. Freshkey has a function of viewing the time stamp of a file and displaying the results in that order. It also has a new arrival search function that searches only information that has been updated relatively recently.
[0006]
[Non-patent document 1]
freshkey (Internet URL: http://www.freshkey.co.jp/), Internet search January 7, 2003
[0007]
[Problems to be solved by the invention]
However, since those functions of freshkey only look at the time stamp information of the file, there is a problem that it is not possible to indicate how large the update to the file is. In addition, there is a problem that it is not possible to know how much update has been made in the entire site by the method using those functions.
[0008]
The present invention has been made to solve the above-described problem of the related art. In a site search, a searcher knows from a search result whether or not there has been a large update in site information, and selects a result suitable for the purpose. The challenge is to provide a site search technology that can be used.
[0009]
[Means for Solving the Problems]
In order to solve the above-mentioned problems, the present invention provides a process of collecting Web pages in advance, dividing them into sites, calculating an update amount of each site for each site, storing the updated amount in a site database, and searching for a searcher. Performing a page search in response to a search request from a site search terminal device based on the request, performing a site search on a site database for a hit page, and acquiring site search information including a plurality of sites; Acquiring the update amounts of a plurality of sites included in the information from the site database, sorting the site search information in descending order of the update amount of the acquired sites, and setting the site based on the sorted site search information. Generating a search result and transmitting the search result to the site search terminal device. The means of its resolution.
[0010]
Alternatively, in the above-mentioned website search method, in the process of calculating the update amount of the site and storing the updated amount in the site database, the file name lists in the site before and after the update are compared, and the list of the file names appearing in both lists is compared. Calculates the amount of site update by dividing the number of files to be updated by the number of updated files, or by dividing the number of files in the updated site whose update flag is 1 by the total number of files. A Web site search method characterized in that the search is performed as means for solving the problem.
[0011]
Alternatively, in the Web site search method, in the step of acquiring the site search information, a search score indicating a degree of suitability for the site search is calculated and included in the site search information. In the case where a plurality of identical sites exist, a Web site search method which sorts based on the search score or outputs site search information already sorted based on the search score is solved. Means.
[0012]
Alternatively, in the Web site search method described above, a Web site search method characterized in that in the sorting step, the update amount of each site is presented together with the site search result, as a means for solving the problem.
[0013]
Alternatively, a Web page collected in advance is divided into sites, and a site database that stores an update amount of the site calculated for each site, and a search request from a site search terminal device based on a search request of a searcher, A page search is performed, a site search is performed on the site database for the hit page, and a site search engine that acquires site search information including a plurality of sites, and an update amount of the plurality of sites included in the site search information is determined. Means for acquiring from the site database, means for sorting the site search information in descending order of the amount of update of the acquired site, and generating a site search result based on the sorted site search information, the site search terminal And a means for outputting to a device. And of means.
[0014]
Alternatively, in the Web site search device, the update amount of the site is determined by comparing a list of file names in the site before and after the update, and dividing the number of files appearing in both lists by the number of files after the update. A Web site search device characterized in that the value is a value calculated by the number of files updated in the site, or by the number obtained by dividing the number of files whose update flag is 1 by the total number of files in the updated site. Means of solution.
[0015]
Alternatively, in the above-described Web site search device, when a plurality of sites having the same update amount exist, the sorting unit may include a search score indicating a degree of conformity to a site search calculated by a site search engine, or a site search. Sorting based on the search score indicating the degree of matching to the site search calculated by the terminal device, or sorting the site search information sorted by the search score indicating the degree of matching to the site search calculated by the site search engine A Web site search device characterized by output is used as a means for solving the problem.
[0016]
Alternatively, in the above-described Web site search device, the sorting unit presents the sorted search results together with the update amount of each site, and the Web site search device is a solution to the problem. And
[0017]
Alternatively, a Web site search program characterized by using a program for causing a computer to execute the process in the above Web site search method is used as a means for solving the problem.
[0018]
Alternatively, a recording medium storing a Web site search program, wherein the program in the above-described Web site search method is recorded on a computer-readable recording medium as a program for causing a computer to execute the process, is recorded. , As a means of solving the problem.
[0019]
In the present invention, the Web pages collected in advance are divided into sites, the update amount of each site is calculated and stored, and the site search device acquires the update amount of the site included in the site search result, and performs the site search. Sorting the results in the order of the site with the largest update amount and presenting it to the searcher allows the searcher to know from the site search results whether there has been a major update to the site information, and the searcher has no To select the results.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0021]
FIG. 1 shows a configuration of a Web site search system according to this embodiment. The system includes a site search engine 1, a site search result generation unit 2, and a site database (hereinafter, the database is referred to as DB) 3 as main components of the site search device, and a site operator information management server. 4 is provided.
[0022]
The site search engine 1 performs a page search in response to a search request from a site search terminal device (not shown), performs a site search on the site DB 3 for a hit page, and stores a plurality of hit pages including the hit page ID and the site ID of the page. Is passed to the site search result generation unit 2.
[0023]
The site search result generation unit 2 receives a site search information including a plurality of site IDs obtained by the site search from the site search engine 1, and performs a site search from the site DB 3 in which an update amount is calculated in advance for each site ID and stored. Means for acquiring the update amounts of a plurality of site IDs included in the information, means for sorting the site IDs for which the site search information has been acquired in descending order of update amount, and means for generating a site search result from the sorted site search information And means for transmitting the generated search result to the site search terminal device that has issued the site search request.
[0024]
As the site search terminal device, means for making a search request to the site search engine 1 based on a search request of a searcher, and a display device which receives the site search result generated by the site search result generation unit 2 and Output means for outputting to an output device such as
[0025]
The site DB 3 divides web pages collected in advance into sites, stores the update amount of the site calculated for each site ID, and updates the site in response to a request from the site search result generation unit 2. Provide information. The calculation and storage of the update amount may be performed by the site search engine 1 or the site operator information management server 4 by acquiring necessary data, but may be performed by another device.
[0026]
The site operator information management server 4 manages the creation and update of the web page provided by the site operator, notifies the site search engine 1 of the presence or absence of the update of the web page, and updates the site DB 3 to the site DB 3 as necessary. Is calculated and stored.
[0027]
Next, a processing procedure of the Web site search method according to the present embodiment, which is executed by the present system, will be described.
[0028]
The method includes pre-processing and processing at the time of site search. Hereinafter, each processing procedure will be described with reference to FIG.
[0029]
<Pre-processing>
FIG. 2 is a flowchart showing a processing procedure of the entire pre-processing in the Web site search method according to the embodiment. S11 to S14 represent processing steps. The pre-processing is performed independently before the processing at the time of site search.
[0030]
First, a large amount of Web pages are collected by a device such as a server that performs pre-processing (S11). In this example, it is assumed that the site search device performs the pre-processing as a device such as a server.
[0031]
The site search device divides the collected Web pages by site (S12). In this site division, a top page of a Web site is estimated from a large number of Web page sets, and a page set including the estimated top page and pages linked to the estimated top page is determined as a site.
[0032]
For the top page estimation and site determination, for example, technologies (for example, Japanese Patent Application Nos. 2001-38947 and 2001-389448) already proposed by the present applicant can be used.
[0033]
In order to estimate the top page of the site, the searched Web pages are classified for each server, a page classification tree obtained by extracting a page class from the meta information and the page type is obtained, and the page type of each page is obtained based on the classification tree. Is extracted, and for pages belonging to the same server, the page whose directory hierarchy is located at 0 is estimated as the top page with the highest priority. If there is no top page at hierarchy 0, the top page type Determine the directory hierarchy in which the top page exists based on the classification likelihood, and rank the top page that belongs to this directory hierarchy, has files in the lower hierarchy, and has the maximum classification likelihood for the page type or more than the threshold. Estimate as a page.
[0034]
The site is determined by using meta information and a link classification tree for each page and link of the searched Web page set to obtain the classification likelihood of each link for each link type, and further obtaining the meta information and the page type. , A page classification tree in which a page class is extracted is acquired, and the classification likelihood of the page type of each page is extracted. Thereafter, for all servers to which the Web page belongs, a Web site is created based on the above classification likelihood. The top page candidate set is obtained and their parent pages are estimated. In this estimation also, the parent page exists in the shallow hierarchy of the directory from the undetermined page set and the sum of the top page, index page, and menu page likelihood is The parent page is estimated with the largest page as the top page candidate, and the estimated parent page and the page linked thereto are estimated. A set and site.
[0035]
Since the site tree structure is estimated from the parent-child relationship of the link of the page of each site by the link from the top page, the estimated site tree structure is stored in, for example, the site DB 3 or the like. This site tree structure is used for calculating a search score.
[0036]
In the site DB 3, a list of file names before and after updating is stored for each site in order to calculate the amount of site updating. Further, an update flag indicating whether or not the file has been updated is provided for each file.
[0037]
Alternatively, a page DB (not shown) is provided to calculate the amount of site updates and to generate site search results. The page DB stores the number of links to multimedia files such as moving images / images, the file size (the amount of text, the number of bytes), the file size excluding tags, and the document vector for all collected pages. . Further, an update flag indicating whether or not the page has been updated is provided for each page, and whether or not all pages have been updated is checked for each site at a fixed cycle or at an appropriate time (S13). The details of this update presence / absence check are shown in FIG. Note that a document vector is defined as a vector dimension where each of a plurality of predetermined words is defined as a vector dimension, and when each word appears in the document, the size is given to each corresponding dimension by the number of appearances and the like, and the document is expressed in a vector. It was done.
[0038]
In FIG. 3, for all pages collected in the site DB 3, first, the number of links to a multimedia file such as an image, the file size (the amount of text, the number of bytes), and the document vector are stored in variables (S21).
[0039]
Next, for each of the pages, the number of links, the file size, and the document vector of the multimedia file such as an image stored in the page DB are obtained, and the difference between these values and the value of the variable obtained in step S21 is separated. (S22). The difference is an integer type for the number of links and the file size, and a dissimilarity (reciprocal of the cosine value of two document vectors) for the document vector.
[0040]
Next, a page update amount U is calculated for each page based on the above difference (S23). The page update amount U is obtained from the following equation from the increase in the number of links, the increase in the file size, and the dissimilarity of the document vector.
[0041]
(Equation 1)
U = αL / A + βM / S + γN
Where L is the increment of the number of links, A is the number of links after the update, M is the increment of the file size, S is the file size after the update, N is the dissimilarity of the document vector, α, β, and γ are 0 or more and 1 The following real numbers and values satisfying α + β + γ = 1 are set. The update amount U of a newly created page is set to 1.
[0042]
Next, it is checked whether or not the page update amount U obtained by the above equation exceeds a certain threshold value (S24), and if it does, "1" is stored in the update flag of the page in the page DB (S25). If not, "0" is stored in the update flag of the page (S26).
[0043]
The above processing is performed for all pages, and whether or not there is a significant update for each page is stored as a flag for each page in the page DB.
[0044]
Subsequently, the update amount of the site is calculated and stored for each site (S14). FIG. 4 shows the details of the site update check process.
[0045]
In FIG. 4, the update amount of each site is first calculated for all the divided sites (S31). This calculation is performed, for example, by comparing the list of file names in the site before and after the update stored in the site DB 3, and dividing the number of files appearing in both lists by the number of updated files (ratio of the number of files). Ask. This calculation method can save the trouble of calculating the update amount for each page. As another example of the calculation method, a value obtained by dividing the number of files whose update flag is 1 by the number of updated files among the files in the site is obtained. Note that the update amount of the new site is 1. The calculated update amount is a value between 0 for a site without update and 1 for a new site. The calculated update amount is stored in the site DB 3 for each site ID (S42).
[0046]
Next, it is checked whether or not the amount of the site update calculated in the above processing S31 exceeds a certain threshold value (S33), and if it exceeds the threshold, "1" is stored in the update flag of the site in the site DB3 (S34). If not, "0" is stored in the update flag of the site (S35). This processing is necessary for generating the site search result, and is not necessary depending on the content of the site search result.
[0047]
The above processing is performed for all sites, and the update amount and a flag indicating whether or not a large update has been performed as necessary are stored in the site DB 3 for each site ID.
[0048]
When there is an update request for a page registered in the site DB 3, the site operator information management server 4 registers the page URL of the update information in the site search engine 1 or the like, and The update information is overwritten on the site DB3. As a result of this page update, the file size data and the like before the update remain in the page DB, the file size and the like of the updated page in the site DB 3 have been changed, and this difference is used for the page update check (S13) and the site update check. At the time of (S14), it is used for determining the presence / absence of update and calculating the update amount of the site.
[0049]
In place of the page update check (S13) and the site update check (S14), when a page update request is issued, the site operator information management server 4 performs the above-described file size increment, file number increment, and link number. It is also possible to check whether or not the page has been updated from the increment or the like, set the flag of the page, or check the update amount or the presence or absence of the update of the site. Further, the dissimilarity of the document vector is managed by the site operator information management server 4, and the presence or absence of the update is notified to the site search engine 1 periodically, or the update information is described in a specific file, and the file is transferred to the robot type. May be collected by a search engine.
[0050]
<Process at the time of site search>
FIG. 5 is a flowchart for explaining processing at the time of site search in the Web site search method according to the embodiment. In the figure, S41 to S49 represent processing steps.
[0051]
First, the site search engine 1 collects hit Web pages in response to a search request from the site search terminal device, and performs a site search on the site DB 3 for each Web page (S41), thereby satisfying the search request. The search information for each site is obtained (S42). For this site search, for example, a technology already proposed by the present applicant (for example, Japanese Patent Application No. 2001-389446) can be used.
[0052]
In this site search, the site DB 3 is searched based on page information that has been hit in the search, and a site ID and a URL of the top page are obtained for each page, and a search score is obtained as a value indicating the degree of conformity to the search request. . The site search information sorted and ranked in the order of higher search scores is transferred to the site search result generator 2. In the calculation example of the search score here, an original score obtained by taking into account the weight of the search keyword for each page and a score calculated based on the site tree structure are added for each site to obtain a search score.
[0053]
Specifically, first, the original score s (p) is calculated by the product of the number of keywords included in the page and the reciprocal of the number of pages including the keyword in the set of pages to be searched (“information search”). And Linguistic Processing, "Takenobu Tokunaga, University of Tokyo Press ISBN 4-13-064055-5, pp. 26-32. Next, the site ID to which each page belongs and the depth d (p) from the root (top page) in the site tree structure are obtained from the site DB 3 or the like for each page. Next, for each page, a score score (p) is calculated as the degree of matching with the keyword by the following formula, and added to each site to obtain a search score.
[0054]
(Equation 2)
score (p) = f (s (p)) × (1 / (α + (d (p))) ² × β))
Here, α and β represent constants larger than 0 and smaller than 1, and f (s (p)) represents an arbitrary function such that when s (p) is high, score (p) also becomes high. In this example, the case where the site search engine 1 calculates the search score has been described, but the original score may be acquired from the site search engine 1 and calculated by the site search result generation unit 2 or another device. .
[0055]
S43 and subsequent steps show a procedure of a process of sorting the site search information based on the update amount in the site search result generation unit 2.
[0056]
First, the site search result generation unit 2 stores the site search information received from the site search engine 1 in a hash variable R (S43). Here, R is a variable by which a search score can be subtracted from a site ID, and R {site ID} is a search score of the site given as search information from the site search engine 1.
[0057]
Next, the update amount of the site is searched from the site DB3 using each site ID as a key, and the obtained update amount is stored in the array variable A together with the site ID (S44).
[0058]
Hereinafter, the array variable A is sorted by repeating the loops 1 and 2 by the update amount of the site (S45).
[0059]
Loop 1 repeats the process for each element An of array variable A. The loop 2 repeats the process for each element Am of m> n in the array variable An.
[0060]
First, in loop 2, a first determination is made as to whether An [1] = Am [1], with An [0] being the site ID and An [1] being the update amount of the site (S46). . If the first determination is Yes, it is determined whether or not R {An [0]} <R {Am [0]}, that is, a second whether or not the search score of the site An is smaller than the search score of the site Am. Is determined (S47). If the second determination is Yes, An and Am are exchanged by swap processing (S48).
[0061]
When the first determination is No or the second determination is No, the processing of Loop 2 is repeated. When the loop 2 is completed for all Am of a certain site An, the process of the loop 1 is repeated to the next site An + 1 by repeating the process of the loop 2 for the next site An + 1. By repeating the above loops 1 and 2, array variables B of the sites sorted by the update amount are generated. However, array variable B stores pointers to each element of array variable A.
[0062]
The array variable B sorted by the update amount is passed to the processing for generating the site search result (S49).
[0063]
Next, FIG. 6 shows a flowchart of a process of generating a site search result in the site search result generation unit 2. In the figure, S51 to S56 are processing steps.
[0064]
The site search result generated by this process also shows the following.
[0065]
1) Top page information representing the site, page information within the site hit by the AND / OR logic of a single search keyword or a plurality of search keywords, and a search score;
2) A link to the update information page (there can be various buttons and the like), but if there is no update information, the link is not displayed. If no update information page has been registered, information indicating "updated" is displayed (text, image, etc.).
[0066]
Here, a process of generating a site search result when the above-described page DB is generated will be described. As the site search result, at least site information sorted and ranked by an update amount and site information of each site are displayed. It is arbitrary as long as the top page URL and the hit site URL exist, and there are processes that become unnecessary according to the URL, so the following processing procedure changes.
[0067]
In addition, since the site search information is sorted by the search score from the site search engine 1 to the site search result generating unit 2, the sites having the same update amount are ranked in the order of the search score without considering the update amount. Site search results are generated. If the site is not ranked by the search score, the search score calculated by the site search engine 1 or the search score calculated by the site search result generation unit 2 is used to search for sites having the same update amount in the sorting process. The sites with the highest scores should be replaced so that they are ranked higher.
[0068]
In FIG. 6, for each site to be displayed as a search result, first, the site DB 3 is searched using the ID of the searched site as a key, and a top page URL representing the site and the presence or absence of update information are indicated. The flag of the site to be executed and the hit page, and the URL of the update information page are obtained (S51).
[0069]
Next, the link to the top page URL of the site and the hit page URL in the site related thereto is output as HTML (S52).
[0070]
Next, the flag of the site is checked (S53). If the flag is "1", it is checked whether or not there is an update information page from the page flag in the site (S54). An image or the like indicating that there is an update is output to the site as HTML (S55), and if there is an update information page, a link (represented by a button or the like) to the update information page is output as HTML (S56).
[0071]
The site search result in FIG. 1 shows a display example at the time of a site search. The search condition is “men's shoes”, and “Shoes shop” and “B shoes store” are searched and displayed as sites meeting these conditions. Then, a hit page is displayed in the “A shoe shop”, and a button such as “New” is displayed when there is a related update information page. When the searcher clicks the button, a page previously selected by "A shoe shop" is displayed.
[0072]
Therefore, a hit page is displayed as a normal search result, but in addition to the search result in the present embodiment, a link to an update information page is displayed for a page with a large update degree, and a page with a small update degree is displayed. Indicates that there has been an update. Similarly, it is also possible to change the display mode for a site with a large update degree and a site with a small update degree. As a result, the site operator can automatically generate own site information that attracts the searcher's interest, and the searcher can easily know a site with a high degree of update from the site search result.
[0073]
The function realizing means of each unit shown in FIG. 1 is constituted by a computer program, or the steps of the processing shown in FIGS. 2 to 6 are constituted by a computer program, and the program is executed by the computer. Needless to say, a computer-readable recording medium, such as a flexible disk or an MO, stores a program for realizing the function of the computer or a program for causing the computer to execute the processing steps. , A ROM, a memory card, a CD, a DVD, a removable disk, or the like, and can be stored or distributed. Further, it is also possible to provide the above program through a network such as the Internet or e-mail. The present invention can be implemented by installing the above-mentioned program in a computer from these recording media, or by installing the above-mentioned program in a computer by downloading from a network.
[0074]
【The invention's effect】
As apparent from the above, according to the present invention, a searcher can know from a search result whether or not there has been a large update in site information, and can select a desired result.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a Web site search system according to an embodiment of the present invention.
FIG. 2 is a flowchart showing a processing procedure of the entire pre-processing in a Web site search method according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating details of a page update check.
FIG. 4 is a flowchart showing details of a site update check.
FIG. 5 is a flowchart illustrating processing at the time of a site search in a Web site search method according to an embodiment of the present invention.
FIG. 6 is a flowchart illustrating a process of generating a site search result in a Web site search method according to an embodiment of the present invention.
[Explanation of symbols]
1. Site search engine
2. Site search result generation unit
3. Site DB
4: Site operator information management server

Claims

Collecting web pages in advance, dividing them into sites, calculating the update amount of the sites for each site, and storing the updated amount in the site database;
A process of performing a page search in response to a search request from a site search terminal device based on a search request of a searcher, performing a site search on a site database for a hit page, and obtaining site search information including a plurality of sites; ,
A step of acquiring the update amounts of the plurality of sites included in the site search information from the site database,
Sorting the site search information in descending order of the amount of update of the acquired site;
Generating a site search result based on the sorted site search information and transmitting the result to the site search terminal device.

In the process of calculating the update amount of the site and storing it in the site database,
Compare the list of file names in the site before and after the update, and calculate the update flag by dividing the number of files appearing in both lists by the number of files after the update or among the files in the updated site 2. The Web site search method according to claim 1, wherein the update amount of the site is calculated by a number obtained by dividing the number of files by 1 by the total number of files.

In the process of obtaining site search information,
A search score that indicates the degree of relevance to site search is calculated and included in the site search information.
In the sorting step,
3. When there are a plurality of sites having the same update amount, the site is sorted based on the search score, or site search information already sorted based on the search score is output. Web site search method described in.

In the sorting step,
4. The Web site search method according to claim 1, wherein an update amount of each site is presented together with the site search result.

A site database in which Web pages collected in advance are divided into sites, and an update amount of the site calculated for each site is stored;
A site that performs a page search in response to a search request from a site search terminal device based on a search request of a searcher, performs a site search on the site database for a hit page, and obtains site search information including a plurality of sites. Search engine,
Means for obtaining the update amount of the plurality of sites included in the site search information from the site database,
Means for sorting the site search information in the descending order of the update amount of the acquired site,
Means for generating a site search result based on the sorted site search information and outputting the result to the site search terminal device.

The update amount of the site,
Compare the list of file names in the site before and after the update, and calculate the update flag by dividing the number of files appearing in both lists by the number of files after the update or among the files in the updated site 6. The Web site search device according to claim 5, wherein a value calculated by dividing the number of files by 1 by the total number of files.

The sorting means includes:
When there are a plurality of sites having the same update amount, a search score indicating the degree of conformity to the site search calculated by the site search engine or a search score indicating the degree of conformity to the site search calculated by the site search terminal device 7. The system according to claim 5, wherein the site search information is sorted based on the site search information or the site search information is sorted by a search score indicating a degree of conformity to the site search calculated by the site search engine. Web site search device as described.

The sorting means includes:
The Web site search device according to any one of claims 5 to 7, wherein the sorted search result is presented together with the update amount of each site.

A Web site search program, wherein the program in the Web site search method according to any one of claims 1 to 4 is executed by a computer.

A program for causing a computer to execute the process in the website search method according to any one of claims 1 to 4,
A recording medium recording a Web site search program, wherein the program is recorded on a recording medium readable by the computer.