JP5174385B2

JP5174385B2 - Duplicate Web site dynamic detection device

Info

Publication number: JP5174385B2
Application number: JP2007177285A
Authority: JP
Inventors: 孝之田村
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2007-07-05
Filing date: 2007-07-05
Publication date: 2013-04-03
Anticipated expiration: 2027-07-05
Also published as: JP2009015636A

Description

本発明は、Ｗｅｂサイトの重複性判定をＷｅｂクローリングに追随して動的に行う重複Ｗｅｂサイト動的検出装置に関する。 The present invention relates to a duplicate web site dynamic detection apparatus that dynamically performs web site duplication determination following web crawling.

重複Ｗｅｂサイト（ミラーサイト）とは、ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）のサイト名（ホスト名）部分だけが異なり、同一内容を持つＷｅｂページを提供するＷｅｂサイトの集合である。重複Ｗｅｂサイトが生じる原因は、負荷分散やバックアップを目的とした物理的なコピー、あるいは検索エンジンでのランキングを向上させるためにＤＮＳ（ＤｏｍａｉｎＮａｍｅＳｙｓｔｅｍ）へのホスト名の多重登録を行って論理的に多数のサイトに見せかける行為など、様々である。 A duplicate Web site (mirror site) is a set of Web sites that provide Web pages that have the same content but differ only in the site name (host name) portion of a URL (Uniform Resource Locator). The reason for the occurrence of the duplicate Web site is logical by copying multiple host names to DNS (Domain Name System) in order to improve the ranking of physical copies for the purpose of load distribution and backup, or search engines. The act of making it appear to many sites is various.

重複Ｗｅｂサイトを検出することで、検索エンジンの結果から重複を減らし、Ｗｅｂクローリングによる情報収集の効率、およびプロキシサーバやブラウザにおけるキャッシュヒット率を向上させることが期待される。 By detecting duplicate Web sites, it is expected to reduce duplication from the search engine results, improve the efficiency of information collection by Web crawling, and improve the cache hit rate in proxy servers and browsers.

従来の重複Ｗｅｂサイト検出装置では、Ｗｅｂクローリングにより収集されたＷｅｂページの集合を一括して分析し、ＵＲＬ文字列のパス名部分とＷｅｂページコンテンツのハッシュ値の共通性とに基づいて、２つ以上の複数サイトが重複Ｗｅｂサイトか否かを判定しているものがある（例えば、特許文献１参照）。 In the conventional duplicate Web site detection apparatus, a set of Web pages collected by Web crawling is collectively analyzed, and based on the commonality of the path name portion of the URL character string and the hash value of the Web page content, There is one that determines whether or not the above plural sites are duplicate Web sites (for example, see Patent Document 1).

また、ＵＲＬ文字列の類似性、ＩＰアドレスの類似性、およびリンク先ＵＲＬの共通性などを一括して分析し、それらの指標が基準値より高い場合には、２つのサイトが重複Ｗｅｂサイトであると判定するものがある（例えば、特許文献２参照）。 Also, URL character string similarity, IP address similarity, link URL commonality, etc. are collectively analyzed, and if these indices are higher than the reference value, the two sites are duplicate Web sites. There are some which are determined to be present (see, for example, Patent Document 2).

特開２００６−２１５７３５号公報JP 2006-215735 A 特開２００４−２６４９２６号公報JP 2004-264926 A

しかしながら、従来技術には次のような課題がある。
従来の重複Ｗｅｂサイト検出方法は、各Ｗｅｂページの情報が一括して与えられることを前提としており、Ｗｅｂクローリング中に発見した未知のＷｅｂサイトが重複Ｗｅｂサイトであるか否かを即座に判断できない。そのため、当該サイトが重複Ｗｅｂサイトであった場合には、他のＷｅｂサイトと同一内容のＷｅｂページを多数取得することになり、Ｗｅｂクローリングの効率低下を招くとともに、収集結果の統計的性質が現実から乖離してしまう。 However, the prior art has the following problems.
The conventional duplicate website detection method is based on the premise that information of each web page is given at once, and it is impossible to immediately determine whether an unknown website discovered during web crawling is a duplicate website. . Therefore, if the site is a duplicate website, a large number of web pages having the same content as other websites will be acquired, resulting in a decrease in the efficiency of web crawling and the statistical nature of the collected results. Will deviate from.

これに対して、新たに発見したＷｅｂサイトからのＷｅｂページ取得を少数に留めておき、Ｗｅｂクローリングを一度打ち切って重複Ｗｅｂサイト検出を実行する方法も考えられる。しかし、重複Ｗｅｂサイト検出は、Ｗｅｂページ情報全体の分析を必要とするため、実行時間（すなわち、Ｗｅｂクローリングの停止時間）が長くなり、急速に変化するＷｅｂ情報をタイムリに捉えることが困難になってしまう。 On the other hand, a method is also conceivable in which only a small number of web pages are acquired from newly discovered websites, and the crawling is terminated once to detect duplicate websites. However, since duplicate Web site detection requires analysis of the entire Web page information, the execution time (that is, Web crawling stop time) increases, making it difficult to capture rapidly changing Web information in a timely manner. End up.

本発明は上述のような課題を解決するためになされたもので、Ｗｅｂサイトの重複性判定をＷｅｂクローリングに追随して動的に行い、Ｗｅｂ情報の効率的な収集と不要情報の除去を可能にする重複Ｗｅｂサイト動的検出装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and it is possible to dynamically determine the duplication of websites following web crawling, and to efficiently collect web information and remove unnecessary information. An object is to obtain a duplicate Web site dynamic detection apparatus.

本発明に係る重複Ｗｅｂサイト動的検出装置は、Ｗｅｂクローリング中に発見されたＷｅｂページのＵＲＬとコンテンツを受け取り、ＵＲＬから取り出したサイト名およびパス名と、コンテンツから算出したコンテンツ特徴量とを対応付けてＷｅｂページ状態情報として記憶部に記憶させるとともに、コンテンツ特徴量とパス名の組が一致するサイト名の集合をコンテンツ特徴量およびパス名と対応付けて重複Ｗｅｂサイト候補情報として記憶部に記憶させる受付手段と、サイト名の集合に含まれるそれぞれのサイト名について、各サイト名に対応する全てのパス名とコンテンツ特徴量から、複数のサイト名に対して１つのコンテンツ特徴量のみを有するパス名の数（ヒット数）と、複数のサイト名に対して複数のコンテンツ特徴量を有するパス名の数（ミス数）を求め、ヒット数およびミス数が所定範囲にある場合にサイト名の集合を重複Ｗｅｂサイト集合として検出する判定手段とを備え、受付手段は、サイト名と、同一のサイト名を有する異なるパス名の数に相当するパス数と、パス数としてカウントされたパス名の中で、パス名とコンテンツ特徴量との組と同一の組合せが、異なるサイト名に存在するパス名の数に相当する重複パス数とを対応づけてＷｅｂサイト状態情報として記憶部にさらに記憶させ、判定手段は、サイト名の集合に含まれるそれぞれのサイト名に対応するパス数と、重複パス数に対するパス数の比とがともに所定範囲にある場合に重複Ｗｅｂサイト集合の検出を行うものである。

The duplicate Web site dynamic detection apparatus according to the present invention receives the URL and content of a Web page discovered during Web crawling, and associates the site name and path name extracted from the URL with the content feature amount calculated from the content At the same time, it is stored in the storage unit as Web page state information, and a set of site names in which the set of content feature values and path names match is associated with the content feature values and path names and stored as duplicate Web site candidate information in the storage unit. And a path having only one content feature amount for a plurality of site names from all path names and content feature amounts corresponding to each site name for each site name included in the set of site names. Number of names (number of hits) and multiple content features for multiple site names Path name determined number (misses) of hits and misses is a determining means for detecting a duplicate Web site set a set of site name when in the predetermined range, accepting means, and the site name, the same Among the number of paths corresponding to the number of different path names having the same site name and the path name counted as the number of paths, the same combination of the path name and the content feature amount exists in different site names. The number of duplicate paths corresponding to the number of path names is associated and further stored in the storage unit as Web site status information, and the determination means duplicates the number of paths corresponding to each site name included in the set of site names. When the ratio of the number of paths to the number of paths is within a predetermined range, a duplicate Web site set is detected .

本発明によれば、コンテンツ特徴量とパス名の組が一致するサイト名の集合をコンテンツ特徴量およびパス名と対応付けた重複Ｗｅｂサイト候補情報を維持して重複Ｗｅｂサイトの可能性があるサイト集合を随時把握できるようにするとともに、コンテンツ特徴量とパス名との対応関係に基づいて重複判定を行うことにより、Ｗｅｂサイトの重複性判定をＷｅｂクローリングに追随して動的に行い、Ｗｅｂ情報の効率的な収集と不要情報の除去を可能にする重複Ｗｅｂサイト動的検出装置を得ることができる。 According to the present invention, a site having a possibility of a duplicate website by maintaining duplicate website candidate information in which a set of site names in which a set of content feature quantity and path name match is associated with the content feature quantity and path name. By making it possible to grasp a set at any time and performing duplication determination based on the correspondence between content feature quantities and path names, web site duplication determination is performed dynamically following web crawling, and Web information It is possible to obtain a duplicate Web site dynamic detection device that enables efficient collection and removal of unnecessary information.

以下、本発明の重複Ｗｅｂサイト動的検出装置の好適な実施の形態につき図面を用いて説明する。 Hereinafter, a preferred embodiment of a duplicate Web site dynamic detection apparatus according to the present invention will be described with reference to the drawings.

実施の形態１．
図１は、本発明の実施の形態１における重複Ｗｅｂサイト動的検出装置の構成図である。本実施の形態１における重複Ｗｅｂサイト動的検出装置は、受付手段１、記憶部２、判定手段６、および問合せ手段７で構成される。 Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a duplicate Web site dynamic detection apparatus according to Embodiment 1 of the present invention. The duplicate Web site dynamic detection apparatus according to the first embodiment includes an accepting unit 1, a storage unit 2, a determining unit 6, and an inquiry unit 7.

さらに、記憶部２は、Ｗｅｂページ状態情報３、重複Ｗｅｂサイト候補情報４、およびＷｅｂサイト状態情報５が格納されている。また、このように構成された重複Ｗｅｂサイト動的検出装置は、Ｗｅｂクローラ８と接続される。 Further, the storage unit 2 stores Web page status information 3, duplicate Web site candidate information 4, and Web site status information 5. Further, the duplicate Web site dynamic detection apparatus configured as described above is connected to the Web crawler 8.

まず始めに、各構成要素の機能について説明する。
受付手段１は、Ｗｅｂクローラ８からＷｅｂページのＵＲＬとコンテンツを受け取り、記憶部２に格納されたＷｅｂページ状態情報３、重複Ｗｅｂサイト候補情報４、およびＷｅｂサイト状態情報５を更新する。また、受付手段１は、Ｗｅｂサイト状態情報５が一定の条件を満たす場合には、Ｗｅｂサイト名の集合を渡すことにより、判定手段６の動作を起動する。 First, the function of each component will be described.
The accepting unit 1 receives the URL and content of the Web page from the Web crawler 8, and updates the Web page state information 3, the duplicate Web site candidate information 4, and the Web site state information 5 stored in the storage unit 2. In addition, when the website state information 5 satisfies a certain condition, the reception unit 1 starts the operation of the determination unit 6 by passing a set of website names.

判定手段６は、受付手段１からＷｅｂサイト名の集合を受け取り、記憶部２に格納されたＷｅｂページ状態情報３を参照して、当該Ｗｅｂサイト集合が重複Ｗｅｂサイトであるか否かを判定する。そして、判定手段６は、当該Ｗｅｂサイト集合が重複Ｗｅｂサイトであると判定した場合には、記憶部２に格納されたＷｅｂサイト状態情報５を更新する。 The determination unit 6 receives a set of website names from the reception unit 1 and refers to the Web page state information 3 stored in the storage unit 2 to determine whether the website set is a duplicate website. . When the determination unit 6 determines that the Web site set is a duplicate Web site, the determination unit 6 updates the Web site state information 5 stored in the storage unit 2.

問合せ手段７は、Ｗｅｂクローラ８からＷｅｂサイト名を受け取り、記憶部２に格納されたＷｅｂサイト状態情報５を参照して、当該Ｗｅｂサイト名が重複Ｗｅｂサイトの別名（非正規名）である場合には、正規名に変換し、変換後の正規のＷｅｂサイト名をＷｅｂクローラ８に返す。 The inquiry means 7 receives the website name from the web crawler 8, refers to the website status information 5 stored in the storage unit 2, and the website name is an alias (non-canonical name) of the duplicate website. Is converted to a canonical name, and the converted canonical Web site name is returned to the Web crawler 8.

次に、図２〜図４を用いて、記憶部２内に格納されている各種情報について説明する。
図２は、本発明の実施の形態１における記憶部２に格納されたＷｅｂページ状態情報３の詳細を示した図である。Ｗｅｂページ状態情報３は、サイト名３１、パス名３２、およびコンテンツハッシュ値３３で構成される。サイト名３１およびパス名３２は、ＵＲＬ文字列のサイト名（ホスト名）部分およびパス名部分をそれぞれ表す。 Next, various types of information stored in the storage unit 2 will be described with reference to FIGS.
FIG. 2 is a diagram showing details of the Web page state information 3 stored in the storage unit 2 according to Embodiment 1 of the present invention. The web page state information 3 includes a site name 31, a path name 32, and a content hash value 33. The site name 31 and the path name 32 represent the site name (host name) part and path name part of the URL character string, respectively.

Ｗｅｂページ状態情報３の各エントリは、Ｗｅｂページと１対１に対応しており、ＷｅｂページのＵＲＬに対応するサイト名３１およびパス名３２の組で一意に識別される。Ｗｅｂクローラ８から同一ＵＲＬのＷｅｂページを複数回受け取った際には、同一エントリがこのＷｅｂページ状態情報３に上書きされる。 Each entry of the Web page status information 3 has a one-to-one correspondence with the Web page, and is uniquely identified by a set of a site name 31 and a path name 32 corresponding to the URL of the Web page. When a Web page with the same URL is received a plurality of times from the Web crawler 8, the same entry is overwritten on the Web page status information 3.

また、Ｗｅｂページ状態情報３は、サイト名３１が指定した値を持つ複数エントリを効率的に検索できるように構成されているものとする。このためには、例えば、公知のＢ−ｔｒｅｅを用いて、各エントリをサイト名３１とパス名３２の組に基づいて整列された状態で維持すればよい。 Further, it is assumed that the Web page state information 3 is configured so that a plurality of entries having values designated by the site name 31 can be efficiently searched. For this purpose, for example, a known B-tree may be used to maintain each entry in an aligned state based on the set of the site name 31 and the path name 32.

コンテンツハッシュ値３３は、Ｗｅｂページのコンテンツデータ全体にハッシュ関数を適用した結果の値である。ここで用いるハッシュ関数には、異なるコンテンツデータに対して同一のハッシュ値が対応する確率が実用上無視できるほど低いものが適しており、例えば、公知のＭＤ５やＳＨＡ−１などを用いることができる。 The content hash value 33 is a value obtained by applying a hash function to the entire content data of the Web page. As the hash function used here, one having a probability that the same hash value corresponds to different content data is so low as to be practically negligible is suitable. For example, known MD5 or SHA-1 can be used. .

次に、図３は、本発明の実施の形態１における記憶部２に格納された重複Ｗｅｂサイト候補情報４の詳細を示した図である。重複Ｗｅｂサイト候補情報４は、コンテンツハッシュ値４１、パス名４２、および候補サイト集合４３で構成される。重複Ｗｅｂサイト候補情報４の各エントリは、コンテンツハッシュ値４１およびパス名４２の組で一意に識別される。 Next, FIG. 3 is a diagram showing details of the duplicate Web site candidate information 4 stored in the storage unit 2 according to Embodiment 1 of the present invention. The duplicate Web site candidate information 4 includes a content hash value 41, a path name 42, and a candidate site set 43. Each entry of the duplicate Web site candidate information 4 is uniquely identified by a set of a content hash value 41 and a path name 42.

図３に示した重複Ｗｅｂサイト候補情報４は、先の図２に示したコンテンツハッシュ値３３とパス名３２の組に基づいてＷｅｂページ状態情報３のエントリを並べ替え、それぞれコンテンツハッシュ値４１およびパス名４２とする。さらに、図３に示した重複Ｗｅｂサイト候補情報４は、先の図２において同一のコンテンツハッシュ値３３とパス名３２の組を持つ複数エントリのサイト名３１をまとめて、候補サイト集合４３に格納している。 The duplicate Web site candidate information 4 shown in FIG. 3 rearranges the entries of the Web page state information 3 based on the set of the content hash value 33 and the path name 32 shown in FIG. The path name is 42. Further, the duplicate Web site candidate information 4 shown in FIG. 3 stores a plurality of entry site names 31 having the same content hash value 33 and path name 32 combination in FIG. doing.

次に、図４は、本発明の実施の形態１における記憶部２に格納されたＷｅｂサイト状態情報５の詳細を示した図である。Ｗｅｂサイト状態情報５は、サイト名５１、パス数５２、重複パス数５３、および正規名５４で構成される。Ｗｅｂサイト状態情報５の各エントリは、サイト名５１で一意に識別される。パス数５２は、先の図２におけるＷｅｂページ状態情報３のエントリの中のサイト名３１が、図４におけるサイト名５１と一致するものの数であり、当該Ｗｅｂサイトから取得したＷｅｂページ数を表す。 Next, FIG. 4 is a diagram showing details of the Web site state information 5 stored in the storage unit 2 according to Embodiment 1 of the present invention. The website status information 5 includes a site name 51, a path number 52, a duplicate path number 53, and a canonical name 54. Each entry of the Web site status information 5 is uniquely identified by a site name 51. The number of paths 52 is the number of the site names 31 in the entry of the Web page status information 3 in FIG. 2 that matches the site name 51 in FIG. 4, and represents the number of Web pages acquired from the Web site. .

また、重複パス数５３は、先の図２におけるＷｅｂページ状態情報３のエントリの中のサイト名３１が、図４におけるサイト名５１と一致し、かつ、先の図２におけるパス名３２とコンテンツハッシュ値３３の組が、Ｗｅｂページ状態情報３全体で一意でないものの数である。すなわち、この重複パス数５３は、パス数としてカウントされたパス名の中で、パス名とコンテンツハッシュ値との組と同一の組合せが、異なるサイト名に存在するパス名の数に相当し、重複Ｗｅｂサイト候補情報４のエントリの内、候補サイト集合４３がサイト名５１を含み、かつサイト名５１と異なるサイト名を１つ以上含むものの数と等しい。 Further, the number 53 of duplicate paths indicates that the site name 31 in the entry of the Web page status information 3 in FIG. 2 matches the site name 51 in FIG. 4 and the path name 32 and content in the previous FIG. The set of hash values 33 is the number of items that are not unique in the entire Web page state information 3. That is, the number 53 of duplicate paths corresponds to the number of path names in which the same combination as the combination of the path name and the content hash value exists in different site names among the path names counted as the number of paths. Of the duplicate Web site candidate information 4 entries, the number of candidate site sets 43 includes the site name 51 and includes one or more site names different from the site name 51.

正規名５４には、重複Ｗｅｂサイト集合に属するＷｅｂサイトにおいて、重複Ｗｅｂサイト集合の代表元のサイト名が設定される。代表元Ｗｅｂサイト自体、あるいは重複Ｗｅｂサイト集合に属さないＷｅｂサイトにおいては、正規名５４は空文字列である。 The canonical name 54 is set to the name of the representative site of the duplicate website set in the duplicate website set. In the representative source website itself or a website that does not belong to the duplicate website set, the canonical name 54 is an empty string.

なお、符号１〜７で示された各手段および各情報は、ＣＰＵ、メモリ、磁気ディスク装置、および通信インタフェースを備えた一般的なコンピュータで実現することができる。この場合、受付手段１、判定手段６、および問合せ手段７は、ＣＰＵに実行させるプログラムとして実現し、記憶部２は、磁気ディスク装置として実現する。 Each means and each information indicated by reference numerals 1 to 7 can be realized by a general computer having a CPU, a memory, a magnetic disk device, and a communication interface. In this case, the reception unit 1, the determination unit 6, and the inquiry unit 7 are realized as programs to be executed by the CPU, and the storage unit 2 is realized as a magnetic disk device.

次に、フローチャートを用いて、各手段の一連の動作について説明する。
まず始めに、受付手段１の動作を説明する。図５は、本発明の実施の形態１における受付手段１の動作の詳細を示すフローチャートである。まず、ステップＳ５１において、受付手段１は、Ｗｅｂクローラ８からＷｅｂページのＵＲＬ文字列とコンテンツデータを受け取り、ＵＲＬ文字列からサイト名とパス名を切り出して、それぞれ入力サイト名および入力パス名とする。さらに、受付手段１は、コンテンツデータ全体にハッシュ関数を適用してハッシュ値に変換し、入力コンテンツハッシュ値を生成する。 Next, a series of operations of each means will be described using a flowchart.
First, the operation of the accepting unit 1 will be described. FIG. 5 is a flowchart showing details of the operation of the accepting unit 1 according to Embodiment 1 of the present invention. First, in step S51, the accepting unit 1 receives a URL character string and content data of a Web page from the Web crawler 8, cuts out a site name and a path name from the URL character string, and sets them as an input site name and an input path name, respectively. . Further, the accepting unit 1 applies a hash function to the entire content data to convert it into a hash value, and generates an input content hash value.

次に、ステップＳ５２において、受付手段１は、入力サイト名に対応するＷｅｂサイト状態情報５のエントリが存在するか否かを調べる。そして、存在しない場合には、受付手段１は、Ｗｅｂサイト状態情報５に新たなエントリを挿入し、サイト名５１を入力サイト名に、パス数５２および重複パス数５３を０に、正規名５４を空文字列にそれぞれ設定する。 Next, in step S52, the accepting unit 1 checks whether there is an entry of the website state information 5 corresponding to the input site name. If not, the accepting unit 1 inserts a new entry into the Web site status information 5, sets the site name 51 as the input site name, sets the number of paths 52 and the number of duplicate paths 53 to 0, and creates the canonical name 54. Is set to an empty string.

次に、ステップＳ５３において、受付手段１は、入力サイト名および入力パス名に対応するＷｅｂページ状態情報３のエントリが存在するか否かを調べる。そして、存在する場合には、ステップＳ５４に進み、存在しない場合には、ステップＳ５５に進む。 Next, in step S53, the accepting unit 1 checks whether there is an entry of the Web page state information 3 corresponding to the input site name and the input path name. And when it exists, it progresses to step S54, and when it does not exist, it progresses to step S55.

ステップＳ５４に進んだ場合には、受付手段１は、入力サイト名および入力パス名に対応するＷｅｂページ状態情報３のエントリにおけるコンテンツハッシュ値３３の値を入力コンテンツハッシュ値と比較する。そして、両者が一致する場合には、動作を終了し、一致しない場合には、ステップＳ５６に進む。 When the process proceeds to step S54, the accepting unit 1 compares the value of the content hash value 33 in the entry of the Web page state information 3 corresponding to the input site name and the input path name with the input content hash value. If the two match, the operation ends. If not, the process proceeds to step S56.

一方、先のステップＳ５３の判断によりステップＳ５５に進んだ場合には、受付手段１は、Ｗｅｂページ状態情報３に新たなエントリを挿入し、サイト名３１、パス名３２、およびコンテンツハッシュ値３３をそれぞれ入力サイト名、入力パス名、および入力コンテンツハッシュ値に設定する。さらに、受付手段１は、入力サイト名に対応するＷｅｂサイト状態情報５のエントリにおいて、パス数５２の値に１を加え、その後、ステップＳ５８に進む。 On the other hand, when the process proceeds to step S55 based on the determination at the previous step S53, the accepting unit 1 inserts a new entry into the web page state information 3, and stores the site name 31, the path name 32, and the content hash value 33. Set the input site name, input path name, and input content hash value respectively. Further, the accepting unit 1 adds 1 to the value of the number of paths 52 in the entry of the website state information 5 corresponding to the input site name, and then proceeds to step S58.

先のステップＳ５４の判断によりステップＳ５６に進んだ場合には、受付手段１は、Ｗｅｂページ状態情報３の既存エントリのコンテンツハッシュ値３３に対して、後述するコンテンツハッシュ値削除処理を実行する。さらに、続くステップＳ５７において、受付手段１は、当該エントリのコンテンツハッシュ値３３に入力コンテンツハッシュ値を設定することにより更新した後、ステップＳ５８に進む。 When the process proceeds to step S56 as a result of the determination in the previous step S54, the accepting unit 1 executes a content hash value deletion process to be described later on the content hash value 33 of the existing entry of the Web page state information 3. Further, in the subsequent step S57, the accepting unit 1 updates the content hash value 33 of the entry by setting the input content hash value, and then proceeds to step S58.

次に、ステップＳ５８において、受付手段１は、入力サイト名、入力パス名、および入力コンテンツハッシュ値に対して、後述するコンテンツハッシュ値挿入処理を実行し、動作を終了する。 Next, in step S58, the accepting unit 1 executes content hash value insertion processing described later on the input site name, input path name, and input content hash value, and ends the operation.

次に、先の図５のステップＳ５６におけるコンテンツハッシュ値削除処理の詳細な動作を説明する。図６は、本発明の実施の形態１におけるコンテンツハッシュ値削除処理の動作の詳細を示すフローチャートである。 Next, the detailed operation of the content hash value deletion process in step S56 of FIG. 5 will be described. FIG. 6 is a flowchart showing details of the content hash value deletion processing in Embodiment 1 of the present invention.

まず、ステップＳ６１において、受付手段１は、削除対象コンテンツハッシュ値および削除対象パス名（入力パス名に等しい）に対応する重複Ｗｅｂサイト候補情報４のエントリを検索し、当該エントリの候補サイト集合４３に含まれる要素サイト数に応じて条件分岐する。要素サイト数が１に等しければ、ステップＳ６２に、要素サイト数が２に等しければステップＳ６３に、それ以外の場合にはステップＳ６４にそれぞれ進む。 First, in step S61, the accepting unit 1 searches the duplicate Web site candidate information 4 entry corresponding to the deletion target content hash value and the deletion target path name (equal to the input path name), and the candidate site set 43 of the entry. Branches according to the number of element sites included in. If the number of element sites is equal to 1, the process proceeds to step S62. If the number of element sites is equal to 2, the process proceeds to step S63. Otherwise, the process proceeds to step S64.

先のステップＳ６１の判断によりステップＳ６２に進んだ場合には、受付手段１は、ステップＳ６１で検索したエントリを重複Ｗｅｂサイト候補情報４から削除し、終了する。 If the process proceeds to step S62 based on the determination in step S61, the accepting unit 1 deletes the entry searched in step S61 from the duplicate Web site candidate information 4, and the process ends.

また、先のステップＳ６１の判断によりステップＳ６３に進んだ場合には、受付手段１は、ステップＳ６１で検索したエントリの候補サイト集合４３に格納された２つのサイト名に対し、それぞれに対応するＷｅｂサイト状態情報５のエントリの重複パス数５３を１減少させ、ステップＳ６５に進む。 If the process proceeds to step S63 by the determination in step S61, the accepting unit 1 performs Web corresponding to each of the two site names stored in the candidate site set 43 of the entry searched in step S61. The number 53 of duplicate paths in the entry of the site status information 5 is decreased by 1, and the process proceeds to step S65.

また、先のステップＳ６１の判断によりステップＳ６４に進んだ場合には、受付手段１は、削除対象サイト名（入力サイト名に等しい）に対応するＷｅｂサイト状態情報５のエントリの重複パス数５３を１減少させ、ステップＳ６５に進む。 If the process proceeds to step S64 based on the determination in step S61, the accepting unit 1 sets the number of duplicate paths 53 of the entry in the website state information 5 corresponding to the deletion target site name (equal to the input site name). Decrease by 1 and proceed to Step S65.

そして、ステップＳ６５において、受付手段１は、先のステップＳ６１で検索したエントリの候補サイト集合４３から削除対象サイト名を取り除き、終了する。 In step S65, the accepting unit 1 removes the deletion target site name from the candidate site set 43 of the entry searched in the previous step S61, and the process ends.

上述のステップＳ６１〜Ｓ６５の動作により、重複Ｗｅｂサイト候補情報４およびＷｅｂサイト状態情報５は、削除対象コンテンツハッシュ値が挿入される前の状態に設定され、コンテンツハッシュ値の削除処理が完了する。 Through the operations in steps S61 to S65 described above, the duplicate website candidate information 4 and the website state information 5 are set to the state before the deletion target content hash value is inserted, and the content hash value deletion process is completed.

次に、先の図５のステップＳ５８におけるコンテンツハッシュ値挿入処理の詳細な動作を説明する。図７は、本発明の実施の形態１におけるコンテンツハッシュ値挿入処理の動作の詳細を示すフローチャートである。 Next, the detailed operation of the content hash value insertion process in step S58 of FIG. 5 will be described. FIG. 7 is a flowchart showing details of the operation of the content hash value insertion process according to Embodiment 1 of the present invention.

まず、ステップＳ７１において、受付手段１は、挿入対象コンテンツハッシュ値（入力コンテンツハッシュ値に等しい）および挿入対象パス名（入力パス名に等しい）に対応する重複Ｗｅｂサイト候補情報４のエントリを検索し、その結果に応じて条件分岐する。エントリが存在しなければステップＳ７２に、エントリが存在し候補サイト集合４３の要素サイト数が１に等しければステップＳ７３に、それ以外の場合にはステップＳ７４にそれぞれ進む。 First, in step S71, the accepting unit 1 searches for an entry of the duplicate Web site candidate information 4 corresponding to the insertion target content hash value (equal to the input content hash value) and the insertion target path name (equal to the input path name). And conditional branching according to the result. If there is no entry, the process proceeds to step S72. If there is an entry and the number of element sites in the candidate site set 43 is equal to 1, the process proceeds to step S73. Otherwise, the process proceeds to step S74.

先のステップＳ７１の判断によりステップＳ７２に進んだ場合には、受付手段１は、重複Ｗｅｂサイト候補情報４に新たなエントリを挿入し、コンテンツハッシュ値４１に挿入対象コンテンツハッシュ値を、パス名４２に挿入対象パス名を、候補サイト集合４３に挿入対象サイト名（入力サイト名に等しい）をそれぞれ設定し、終了する。 When the process proceeds to step S72 based on the determination at the previous step S71, the accepting unit 1 inserts a new entry into the duplicate Web site candidate information 4, sets the content hash value to be inserted into the content hash value 41, and the path name 42. Is set to the insertion target path name, and the candidate site set 43 is set to the insertion target site name (equal to the input site name).

また、先のステップＳ７１の判断によりステップＳ７３に進んだ場合には、受付手段１は、ステップＳ７１で検索した重複Ｗｅｂサイト候補情報４のエントリの候補サイト集合４３の単一要素サイト名に対応するＷｅｂサイト状態情報５のエントリを検索し、当該エントリの重複パス数５３に１を加え、ステップＳ７４に進む。 If the process proceeds to step S73 based on the determination in step S71, the accepting unit 1 corresponds to the single element site name of the candidate site set 43 of the duplicate Web site candidate information 4 entry searched in step S71. The entry of the Web site status information 5 is searched, 1 is added to the number 53 of duplicate paths of the entry, and the process proceeds to step S74.

次に、ステップＳ７４において、受付手段１は、挿入対象サイト名に対応するＷｅｂサイト状態情報５のエントリを検索し、当該エントリの重複パス数５３に１を加えるとともに、ステップＳ７１で検索した重複Ｗｅｂサイト候補情報４のエントリの候補サイト集合４３に挿入対象サイト名を追加する。 Next, in step S74, the accepting unit 1 searches for an entry of the website status information 5 corresponding to the insertion target site name, adds 1 to the duplicate path number 53 of the entry, and duplicate web searched in step S71. The insertion target site name is added to the candidate site set 43 of the entry of the site candidate information 4.

さらに、ステップＳ７５において、受付手段１は、候補サイト集合４３の各要素サイト名に対応するＷｅｂサイト状態情報５のエントリを検索し、各エントリにおける重複パス数５３の値と、重複パス数５３のパス数５２に対する比を求め、それぞれに対する閾値と比較する。全てのエントリにおいて２つの値が閾値以上の場合には、ステップＳ７６に進み、そうでない場合には、終了する。ただし、このステップＳ７５の判断において、正規名５４が空文字列でないエントリは無視し、正規名５４が空文字列のエントリが１つ以下であれば終了する。 Further, in step S75, the accepting unit 1 searches for an entry of the Web site state information 5 corresponding to each element site name in the candidate site set 43, and the value of the number of duplicate paths 53 in each entry and the number of duplicate paths 53. A ratio for the number of passes 52 is obtained and compared with a threshold value for each. If the two values are equal to or larger than the threshold value in all entries, the process proceeds to step S76, and if not, the process ends. However, in the determination of step S75, the entry whose canonical name 54 is not an empty character string is ignored, and the process ends if the canonical name 54 has one or less entries of the empty character string.

先のステップＳ７５の判断によりステップＳ７６に進んだ場合には、受付手段１は、候補サイト集合４３の複数サイト名の内、対応するＷｅｂサイト状態情報５の正規名５４が空文字列のものを渡して判定手段６の動作を起動し、終了する。 When the process proceeds to step S76 as a result of the previous determination in step S75, the accepting means 1 passes the name of the canonical site 54 of the corresponding website status information 5 among the plurality of site names in the candidate site set 43. Then, the operation of the judging means 6 is started and finished.

ここで、ステップＳ７５における閾値としては、例えば、重複パス数５３については３、重複パス数５３のパス数５２に対する比については０．４とする。ステップＳ７５の目的は、判定手段６により重複Ｗｅｂサイトでないと判定されることが明らかな候補サイト集合に対する判定を回避することである。 Here, for example, the threshold value in step S75 is 3 for the number 53 of overlapping paths and 0.4 for the ratio of the number 53 of overlapping paths to the number 52 of paths. The purpose of step S75 is to avoid determination on a candidate site set that is clearly determined not to be a duplicate Web site by the determination means 6.

次に、判定手段６の動作を説明する。図８は、本発明の実施の形態１における判定手段６の動作の詳細を示すフローチャートである。このフローチャートは、先の図７のステップＳ７６で起動される。そして、ステップＳ８１において、判定手段６は、Ｗｅｂページ状態情報３のエントリの内、サイト名３１が受け取った候補サイト集合に属する各サイト名のいずれかと一致するものについて、パス名３２とコンテンツハッシュ値３３の値の組毎に出現頻度を数える。 Next, the operation of the determination unit 6 will be described. FIG. 8 is a flowchart showing details of the operation of the determination means 6 in Embodiment 1 of the present invention. This flowchart is started in step S76 of FIG. Then, in step S81, the determination means 6 uses the path name 32 and the content hash value for the entry of the Web page status information 3 that matches any of the site names belonging to the candidate site set received by the site name 31. The appearance frequency is counted for each set of 33 values.

続くステップＳ８２において、判定手段６は、パス名の種類と、２種類以上のコンテンツハッシュ値が対応しているパス名の種類（ミス）とを数え、後者の前者に対する割合をミス率として算出する。そして、ミス率がある第１の所定閾値以上である場合には、判定手段６は、判定結果は偽であるとし、処理を終了する。一方、ミス率が第１の所定閾値未満である場合には、ステップＳ８３に進む。 In subsequent step S82, the determination means 6 counts the path name type and the path name type (miss) corresponding to two or more content hash values, and calculates the ratio of the latter as the miss rate. . If the miss rate is greater than or equal to the first predetermined threshold, the determination unit 6 determines that the determination result is false and ends the process. On the other hand, if the miss rate is less than the first predetermined threshold, the process proceeds to step S83.

次に、ステップＳ８３において、判定手段６は、１種類のコンテンツハッシュ値が対応するパス名について、出現頻度（サイト数）が２以上であり、かつ候補サイト集合要素数の一定割合以上となっているもの（ヒット）を数え、ヒット数として算出する。そして、判定手段６は、ヒット数がある第２の所定閾値未満である場合には、判定結果は偽であるとし、処理を終了する。 Next, in step S83, the determination means 6 has an appearance frequency (number of sites) of 2 or more and a certain ratio or more of the number of candidate site set elements for a path name corresponding to one type of content hash value. Count the number of hits (hits) and calculate the number of hits. Then, when the number of hits is less than the second predetermined threshold, the determination unit 6 determines that the determination result is false and ends the process.

なお、ステップＳ８３の判断において、判定手段６は、１種類のコンテンツハッシュ値が対応するパス名について、出現頻度（サイト数）が所定数以上であることのみを条件としてヒットとすることもできる。 In the determination in step S83, the determination unit 6 can also use the path name corresponding to one type of content hash value as a hit only on condition that the appearance frequency (the number of sites) is a predetermined number or more.

一方、ヒット数のパス名の種類に対する割合（ヒット率）が第３の所定閾値未満である場合にも、判定手段６は、判定結果は偽であるとし、処理を終了する。ヒット数が第２の所定閾値以上であり、かつヒット率が第３の所定閾値以上である場合には、判定手段６は、判定結果は真であるとして、ステップＳ８４に進む。 On the other hand, even when the ratio of the number of hits to the type of path name (hit rate) is less than the third predetermined threshold, the determination unit 6 determines that the determination result is false and ends the process. If the number of hits is equal to or greater than the second predetermined threshold and the hit rate is equal to or greater than the third predetermined threshold, the determination unit 6 determines that the determination result is true and proceeds to step S84.

ここで、ヒット数に対する第２の所定閾値およびヒット率に対する第３の所定閾値は、先の図７のステップＳ７５における重複パス数に対する閾値および重複パス数のパス数に対する比に対する閾値と対応しており、それぞれ同じ値を用いる（例えば、ステップＳ７５での説明と同様に、３と０．４を用いることができる）。 Here, the second predetermined threshold for the number of hits and the third predetermined threshold for the hit rate correspond to the threshold for the number of overlapping paths and the threshold for the ratio of the number of overlapping paths to the number of paths in step S75 of FIG. And the same value is used (for example, 3 and 0.4 can be used similarly to the description in step S75).

次に、ステップＳ８４において、判定手段６は、候補サイト集合から代表元のサイト名を１つ選択し、代表元以外のサイト名に対応するＷｅｂサイト状態情報５のエントリの正規名５４に代表元のサイト名を設定し、終了する。 Next, in step S84, the determination unit 6 selects one representative site name from the candidate site set, and displays the representative name in the canonical name 54 of the entry of the website state information 5 corresponding to the site name other than the representative source. Set the site name of and exit.

ここで、候補サイト集合から代表元を選択する際には、各要素のサイト名にスコアを付与し、その順位が最も高いもの（値が最も小さいもの）を選択するものとする。図９は、本発明の実施の形態１におけるサイト名に対するスコアの例を示した図である。図９に示すように、サイト名の文字列パターンに応じて、サイト名文字列長およびドメインレベル数によりあらかじめ規定されるサイト名スコアを計算式として用意しておき、このスコア計算式を正規サイト名の選択に用いる。 Here, when selecting a representative from the candidate site set, a score is assigned to the site name of each element, and the one with the highest rank (the one with the smallest value) is selected. FIG. 9 is a diagram showing an example of scores for site names in Embodiment 1 of the present invention. As shown in FIG. 9, a site name score defined in advance by the site name character string length and the number of domain levels is prepared as a calculation formula according to the character string pattern of the site name, and this score calculation formula is used as a regular site. Used for name selection.

より具体的には、適用優先度順にサイト名全体の文字列パターンを照合し、最初に適合した行のサイト名スコア計算式を用いる。ただし、ドメインレベル数は“．”で区切られたドメイン名要素の数とする。なお、スコアが同一のサイト名は、文字列順で先頭に来るものを優先する。 More specifically, the character string pattern of the entire site name is collated in order of application priority, and the site name score calculation formula of the first matching row is used. However, the number of domain levels is the number of domain name elements separated by “.”. For site names with the same score, the one that comes first in the order of character strings is given priority.

図１０は、本発明の実施の形態１における判定手段６の動作の意味を示す概念図である。先の図８のフローチャートによる一連の動作は、図１０に示すように、サイト名３１を各行に、パス名３２を各列に対応させてコンテンツハッシュ値３３を並べた行列において、列毎にヒットおよびミスの判定を行なって、それぞれの数を数えることに等しい。 FIG. 10 is a conceptual diagram showing the meaning of the operation of the determination means 6 in Embodiment 1 of the present invention. As shown in FIG. 10, a series of operations according to the flowchart of FIG. 8 is performed by hitting each column in a matrix in which the site name 31 is associated with each row and the path name 32 is associated with each column and the content hash value 33 is arranged. It is equivalent to making a mistake determination and counting each number.

コンテンツハッシュ値がＮ／Ａとなっている部分は、サイト名とパス名に対応するＷｅｂページをＷｅｂクローラ８から受け取っていないことを表している。Ｗｅｂクローラ８は、一般に、Ｗｅｂページ間のリンクを辿りながらＷｅｂページを収集するため、存在するＷｅｂページであってもアクセスしないことがあり得る。 The portion where the content hash value is N / A indicates that the Web page corresponding to the site name and the path name has not been received from the Web crawler 8. Since the Web crawler 8 generally collects Web pages while following links between Web pages, even an existing Web page may not be accessed.

図１０の例では、パス名“／”に対しては、全てのコンテンツハッシュ値が等しく、種類＝１、出現頻度＝３となることからヒットとなる。また、パス名“／ｌｉｎｋｓ．ｈｔｍｌ”に対しては、サイトａａａ．ｂｂｂ．ｃｃｃおよびｚｚｚ．ｗｗｗ．ａａａに対応する値が存在しないため、種類＝１、出現頻度＝１となってヒットでもミスでもないと見なされる。さらに、パス名“／ｎｅｗｓ．ｈｔｍｌ”に対しては、コンテンツハッシュ値の種類＝２となるため、ミスとなる。 In the example of FIG. 10, the path name “/” is a hit because all content hash values are equal, type = 1, and appearance frequency = 3. For the path name “/links.html”, the site aaa. bbb. ccc and zzz. www. Since there is no value corresponding to aaa, the type = 1 and the appearance frequency = 1 are regarded as neither a hit nor a miss. Furthermore, for the path name “/news.html”, the type of content hash value = 2, which is a mistake.

次に、問合せ手段７の動作を説明する。図１１は、本発明の実施の形態１における問合せ手段７の動作の詳細を示すフローチャートである。まず、ステップＳ１１１において、問合せ手段７は、Ｗｅｂクローラ８から問合せを受け付け、問合せ対象のサイト名を結果サイト名に設定する。 Next, the operation of the inquiry means 7 will be described. FIG. 11 is a flowchart showing details of the operation of the inquiry means 7 in the first embodiment of the present invention. First, in step S111, the inquiry unit 7 receives an inquiry from the Web crawler 8, and sets the site name to be inquired as a result site name.

次に、ステップＳ１１２において、問合せ手段７は、結果サイト名に対応するＷｅｂサイト状態情報５のエントリを検索し、エントリが存在しないか、または当該エントリの正規名５４が空文字列である場合には、ステップＳ１１４に進む。一方、当該エントリの正規名５４が空文字列でなくエントリが存在する場合には、問合せ手段７は、結果サイト名は重複Ｗｅｂサイトにおける非代表元であり、サイト名を変換する必要があると判断し、ステップＳ１１３に進む。 Next, in step S112, the inquiry unit 7 searches for an entry of the website state information 5 corresponding to the result site name, and if there is no entry or the canonical name 54 of the entry is an empty character string, The process proceeds to step S114. On the other hand, if the canonical name 54 of the entry is not an empty character string but an entry exists, the inquiry means 7 determines that the result site name is a non-representative source in a duplicate Web site and the site name needs to be converted. Then, the process proceeds to step S113.

そして、ステップＳ１１３において、問合せ手段７は、エントリの正規名５４を結果サイト名に設定し、ステップＳ１１２に戻る。ここで、ステップＳ１１２の処理を繰り返すのは、ある重複Ｗｅｂサイトの代表元が、後に別の重複Ｗｅｂサイトに非代表元として含まれると判定される可能性があるためである。 In step S113, the inquiry unit 7 sets the canonical name 54 of the entry as the result site name, and returns to step S112. Here, the process of step S112 is repeated because there is a possibility that a representative source of a certain duplicate website is later included as a non-representative source in another duplicate website.

一方、先のステップＳ１１２の判断によりステップＳ１１４に進んだ場合には、問合せ手段７は、結果サイト名の値をＷｅｂクローラ８に返し、終了する。 On the other hand, when the process proceeds to step S114 based on the determination at the previous step S112, the inquiry unit 7 returns the value of the result site name to the web crawler 8 and ends.

Ｗｅｂクローラ８は、Ｗｅｂページからリンクを抽出したとき、あるいはＷｅｂページのダウンロードを開始する前に、それらのＵＲＬからサイト名を取り出し、問合せ手段７に渡して得られる結果で元のサイト名を置き換えることにより、重複Ｗｅｂサイトの非代表元へのアクセスを回避することができる。 When the web crawler 8 extracts a link from the web page or before starting the download of the web page, the web crawler 8 extracts the site name from the URL and replaces the original site name with the result obtained by passing to the inquiry means 7. As a result, it is possible to avoid access to non-representative sources of duplicate Web sites.

以上のように、実施の形態１によれば、ＷｅｂクローラからＷｅｂページを順次受け付ける受付手段を備え、重複Ｗｅｂサイト候補情報を維持して重複Ｗｅｂサイトの可能性があるサイト集合を随時把握できるようにするとともに、Ｗｅｂサイト状態情報を維持して判定手段による重複判定の実施タイミングを制御することができる。この結果、Ｗｅｂクローリングに追随した動的な重複Ｗｅｂサイト検出を実現することができる。 As described above, according to the first embodiment, the reception unit that sequentially receives Web pages from the Web crawler is provided, so that it is possible to keep track of a set of sites that may be duplicate Web sites while maintaining the duplicate Web site candidate information. In addition, it is possible to control the execution timing of the overlap determination by the determination means while maintaining the website state information. As a result, it is possible to realize dynamic duplicate Web site detection following Web crawling.

さらに、重複Ｗｅｂサイト検出結果を問い合わせる問合せ手段を備えている。この結果、Ｗｅｂクローラに対して重複Ｗｅｂサイトの代表元以外からのＷｅｂページ収集を回避する手段を提供することができる。 Further, inquiry means for inquiring the duplicate Web site detection result is provided. As a result, it is possible to provide the Web crawler with a means for avoiding Web page collection from other than the representative of the duplicate Web site.

なお、上述の実施の形態１においては、コンテンツ特徴量の一例であるコンテンツハッシュ値を、コンテンツデータ全体に一方向性ハッシュ関数を適用した値としたが、別の算出法を用いることもできる。例えば、コンテンツがＨＴＭＬで記述されている場合に、ＨＴＭＬのタグ、コメント、スクリプト、およびスタイルを取り除いた残りのテキストデータに対して一方向性ハッシュ関数を適用した値を、コンテンツ特徴量であるコンテンツハッシュ値としても、全体の構成や動作には影響しない。このようなコンテンツハッシュ値を適用することにより、Ｗｅｂページに含まれる広告などの可変要素を無視することが可能となり、より多くの重複Ｗｅｂサイトが検出可能になる効果がある。 In the first embodiment described above, the content hash value, which is an example of the content feature amount, is a value obtained by applying the one-way hash function to the entire content data. However, another calculation method may be used. For example, when the content is described in HTML, a value obtained by applying a one-way hash function to the remaining text data from which HTML tags, comments, scripts, and styles are removed is the content that is the content feature amount. The hash value does not affect the overall configuration or operation. By applying such a content hash value, it is possible to ignore variable elements such as advertisements included in the Web page, and there is an effect that more duplicate Web sites can be detected.

また、上述の実施の形態１では、サイト名の正規名を、パス数や重複パス数の情報とともに、Ｗｅｂサイト状態情報として記憶、管理する場合を説明した。しかしながら、パス数や重複パス数の管理とは別に、正規サイト名と残りのサイト名とを対応づけてＷｅｂサイト名関連情報として管理することによっても、問合せ手段を用いた正規サイト名の抽出が可能となる。 In the first embodiment described above, the case where the regular name of the site name is stored and managed as the website state information together with the information on the number of paths and the number of duplicate paths has been described. However, apart from managing the number of paths and the number of duplicate paths, it is also possible to extract the legitimate site name using the inquiry means by managing the website name related information by associating the legitimate site name with the remaining site name. It becomes possible.

実施の形態２．
先の実施の形態１では、全ての入力パス名に対して重複Ｗｅｂサイト候補情報４の候補サイト集合４３を維持するため、Ｗｅｂサイト毎に収集範囲の偏りがあっても重複Ｗｅｂサイトの検出漏れを防ぐようにしていた。しかしながら、その一方で、重複Ｗｅｂサイト候補情報４のデータ量が大きくなり、更新負荷が高くなる問題がある。 Embodiment 2. FIG.
In the first embodiment, since the candidate site set 43 of the duplicate website candidate information 4 is maintained for all input path names, even if there is a bias in the collection range for each website, the duplicate website is not detected. I was trying to prevent. However, on the other hand, there is a problem that the data amount of the duplicate Web site candidate information 4 becomes large and the update load becomes high.

そこで、本実施の形態２では、入力パス名の全てではなく、一部だけを重複Ｗｅｂサイト候補情報４およびＷｅｂサイト状態情報５に反映する場合について説明する。より具体的には、入力パス名が特定パターンに合致しない場合には、受付手段１が図５のステップＳ５５の後半の処理、ステップＳ５６およびステップＳ５８の処理を実行しないものとする。 Therefore, in the second embodiment, a case will be described in which only a part of the input path name is reflected in the duplicate website candidate information 4 and the website state information 5. More specifically, if the input path name does not match the specific pattern, it is assumed that the accepting unit 1 does not execute the latter half of step S55 in FIG. 5, the steps S56 and S58.

入力パス名の特定パターンとしては、例えば、部分文字列として“ｉｎｄｅｘ”を含むものに合致するようにする。これにより、Ｗｅｂサイトの入り口として一般的なＵＲＬのみについて候補サイト集合４３を維持し、重複Ｗｅｂサイト候補情報４のデータ量を大幅に削減することができる。なお、Ｗｅｂページ状態情報３は、全てのパス名に関する情報を含むので、判定手段６の動作には影響しない。 As the specific pattern of the input path name, for example, the input path name is matched with a pattern including “index” as a partial character string. Thereby, the candidate site set 43 can be maintained only for the general URL as the entrance of the Web site, and the data amount of the duplicate Web site candidate information 4 can be greatly reduced. Note that the Web page state information 3 includes information on all path names, and thus does not affect the operation of the determination unit 6.

また、入力パス名の特定パターンの別の例として、“／”を１つだけ含むものに合致するようにしてもよい。これにより、ディレクトリ最上位のパス名だけを反映することになる。一般に、Ｗｅｂページのリンクは、ディレクトリ上位のパス名を指し易い傾向にあるので、特定文字列を仮定することによる候補サイト集合の見逃しを防ぐことができる。 Further, as another example of the specific pattern of the input path name, it may be matched with one including only one “/”. As a result, only the top-level path name of the directory is reflected. In general, a link of a Web page tends to indicate a path name in the upper directory, so that it is possible to prevent a candidate site set from being overlooked by assuming a specific character string.

なお、更新負荷の削減策としては、入力パス名を特定パターンに限定する以外に、Ｗｅｂサイト状態情報５のパス数５２の値に基づく実現も可能である。すなわち、パス数５２が一定値に達した後は、入力パス名によらずステップＳ５５の後半の処理、ステップＳ５６およびステップＳ５８の処理を実行しない。これは、入力パス名を値ではなく、先着順で制限することに相当し、パターンに基づく方式では避けられない見逃しの問題を解決することができる。 Note that the update load can be reduced based on the value of the number 52 of paths in the website status information 5 in addition to limiting the input path name to a specific pattern. That is, after the number of paths 52 reaches a certain value, the latter half of step S55, step S56 and step S58 are not executed regardless of the input path name. This is equivalent to restricting the input path name not in terms of value but in first-come-first-served basis, and can solve the problem of oversight that cannot be avoided in the pattern-based method.

以上のように、実施の形態２によれば、特定パターンを有する入力パス名に限定して、その入力パス名を重複Ｗｅｂサイト候補情報４およびＷｅｂサイト状態情報５に反映することができる。この結果、一般的なＵＲＬのみについて候補サイト集合を維持し、重複Ｗｅｂサイト候補情報のデータ量を大幅に削減することができる。 As described above, according to the second embodiment, the input path name can be reflected in the duplicate Web site candidate information 4 and the Web site state information 5 only for the input path name having the specific pattern. As a result, the candidate site set can be maintained only for general URLs, and the data amount of the duplicate Web site candidate information can be greatly reduced.

さらに、入力パス名の特定パターンではなく、入力パス名に対応するパス数が一定値に達する前に限定して、その入力パス名を重複Ｗｅｂサイト候補情報４およびＷｅｂサイト状態情報５に反映することができる。この結果、入力パス名を値ではなく、先着順で制限することができ、特定パターンに基づく方式では避けられない見逃しの問題を解決することができる。 Further, the input path name is reflected in the duplicate website candidate information 4 and the website status information 5 only before the number of paths corresponding to the input path name reaches a certain value, not the specific pattern of the input path name. be able to. As a result, the input path name can be limited not in terms of value but in the order of arrival, and the overlooked problem that cannot be avoided by the method based on the specific pattern can be solved.

実施の形態３．
本実施の形態３では、先の実施の形態１に加えて、候補サイト集合におけるサイト名の類似性を考慮し、類似サイト名からなる候補サイト集合に対してはコンテンツハッシュ値の不一致の許容範囲を広くし、重複Ｗｅｂサイトとして検出しやすくする場合について説明する。 Embodiment 3 FIG.
In the third embodiment, in addition to the first embodiment, considering the similarity of the site names in the candidate site set, the allowable range of content hash value mismatches for the candidate site set consisting of similar site names A case will be described in which it is easy to detect a duplicate Web site.

図１２は、本発明の実施の形態３における判定手段６の動作の詳細を示すフローチャートである。本実施の形態３において、判定手段６は、先の実施の形態１で説明した図８のステップＳ８１に先立ち、図１２のフローチャートに示す動作を行う。 FIG. 12 is a flowchart showing details of the operation of the determination means 6 in Embodiment 3 of the present invention. In the third embodiment, the determination unit 6 performs the operation shown in the flowchart of FIG. 12 prior to step S81 of FIG. 8 described in the first embodiment.

まず、ステップＳ１２１において、判定手段６は、候補サイト集合の各サイト名についてドメインレベル数（“．”で区切られた構成要素数に相当）を求め、その最小値を最小ドメインレベル数とする。例えば、サイト名がｘｘｘ.ｙｙｙ.ｚｚｚのとき、ドメインレベル数は、３である。 First, in step S121, the determination unit 6 obtains the number of domain levels (corresponding to the number of components delimited by “.”) For each site name of the candidate site set, and sets the minimum value as the minimum domain level number. For example, when the site name is xxx.yyy.zzz, the number of domain levels is 3.

次に、ステップＳ１２２において、判定手段６は、各サイト名の構成要素（例えば”ｘｘｘ”、“ｙｙｙ”、“ｚｚｚ”のそれぞれ）毎にいくつのサイト名に含まれるかを求め、候補サイト集合の一定割合以上のサイト名に含まれるものを数えて頻出ドメインレベル数とする。 Next, in step S122, the determination unit 6 obtains how many site names are included in each site name component (for example, “xxx”, “yyy”, and “zzz”), and sets candidate site sets. Count the number of sites included in site names above a certain percentage of the number of frequent domain levels.

次に、ステップＳ１２３において、判定手段６は、頻出ドメインレベル数と最小ドメインレベル数とを比較し、頻出ドメインレベル数が最小ドメインレベル数より小さい場合には、終了する。一方、頻出ドメインレベル数が最小ドメインレベル数以上の場合には、ステップＳ１２４に進む。そして、ステップＳ１２４において、判定手段６は、候補サイト集合の要素数の対数に比例する係数をミス率の閾値に乗じ、当該係数の逆数をヒット数およびヒット率の閾値に乗じる。係数としては例えば、ｌｏｇ（サイト数）×４．５などを用いる。 Next, in step S123, the determination means 6 compares the number of frequent domain levels with the minimum number of domain levels, and ends if the number of frequent domain levels is smaller than the minimum number of domain levels. On the other hand, if the frequent domain level number is greater than or equal to the minimum domain level number, the process proceeds to step S124. In step S124, the determination unit 6 multiplies the threshold of the miss rate by a coefficient proportional to the logarithm of the number of elements of the candidate site set, and multiplies the threshold of the hit number and hit rate by the reciprocal of the coefficient. For example, log (number of sites) × 4.5 is used as the coefficient.

このように、図１２に示した一連の前処理を施すことにより、判定手段６は、サイト名の類似性に関する情報も利用して重複判定を行うことができる。特に、候補サイト集合内のサイト名が高い類似性を持つ場合に、閾値の範囲を広げ、重複と判定し易い条件を整えることで、大規模な重複サイトの検出漏れを防ぐことができる。 In this way, by performing the series of pre-processing shown in FIG. 12, the determination means 6 can perform duplication determination using information on the similarity of site names. In particular, when the site names in the candidate site set have high similarity, it is possible to prevent the detection of a large-scale duplicate site from being missed by expanding the threshold range and preparing conditions that make it easy to determine that there is a duplicate.

以上のように、実施の形態３によれば、サイト名の類似性に関する情報も利用し、類似性に応じて重複判定に用いる閾値を変更することができる。この結果、特に、高い類似性を持つ大規模な重複サイトの検出漏れを防ぐことができる。 As described above, according to the third embodiment, it is possible to change the threshold used for the duplication determination according to the similarity by using the information related to the similarity between the site names. As a result, it is possible to prevent omission of detection of a large-scale duplicate site having high similarity.

実施の形態４．
本実施の形態４では、先の実施の形態１に加えて、Ｗｅｂサイトがディレクトリ単位で重複しているときにも検出を可能にする場合について説明する。 Embodiment 4 FIG.
In the fourth embodiment, in addition to the first embodiment, a case will be described in which detection is possible even when Web sites are duplicated in directory units.

本実施の形態４において、受付手段１は、先の実施の形態１で説明した図５のステップＳ５１において、受け取ったＵＲＬから入力サイト名および入力パス名を生成する際に、本来のサイト名とパス名だけでなく、サイト名にパス名の上位ディレクトリ名を連結した擬似的なサイト名と、上位ディレクトリ名が連結されていない残りのパス名の部分からなる擬似的なパス名とを生成する。 In the fourth embodiment, the accepting unit 1 generates the input site name and the input path name from the received URL in step S51 of FIG. In addition to the path name, generate a pseudo site name that concatenates the upper directory name of the path name to the site name, and a pseudo path name that consists of the remaining path name parts that are not concatenated with the upper directory name. .

すなわち、ＵＲＬ“ｈｔｔｐ：／／ａａａ．ｂｂｂ．ｃｃｃ／〜ｕｓｅｒ１／ｄｉａｒｙ．ｈｔｍｌ”に対し、サイト名“ａａａ．ｂｂｂ．ｃｃｃ”およびパス名“／〜ｕｓｅｒ１／ｄｉａｒｙ．ｈｔｍｌ”だけでなく、擬似的なサイト名“ａａａ．ｂｂｂ．ｃｃｃ／〜ｕｓｅｒ１”および擬似的なパス名“／ｄｉａｒｙ．ｈｔｍｌ”を生成する。これ以外の動作は、先の実施の形態１と同様である。 That is, for the URL “http: //aaa.bbb.ccc/˜user1/diary.html”, not only the site name “aaa.bbb.ccc” and path name “/˜user1/diary.html” but also pseudo A typical site name “aaa.bbb.ccc / ˜user1” and a pseudo path name “/diary.html” are generated. Other operations are the same as those in the first embodiment.

このように、擬似的なサイト名、擬似的なパス名をも考慮して重複判定を行うことにより、Ｗｅｂサイトの一部が重複している場合であっても検出することができ、Ｗｅｂクローリングの効率低下を防ぐことができる。 In this way, by performing duplication determination in consideration of a pseudo site name and a pseudo path name, even if a part of the web site is duplicated, it is possible to detect the web crawling. It is possible to prevent a decrease in efficiency.

以上のように、実施の形態４によれば、受け取ったＵＲＬから入力サイト名および入力パス名を生成する際に、擬似的なサイト名、擬似的なパス名も合わせて生成し、重複判定に利用している。この結果、Ｗｅｂサイトの一部が重複している場合にも確実に検出することができる。 As described above, according to the fourth embodiment, when the input site name and the input path name are generated from the received URL, the pseudo site name and the pseudo path name are also generated and used for duplication determination. We are using. As a result, even when a part of the website overlaps, it can be reliably detected.

実施の形態５．
本実施の形態５では、判定を誤る可能性の高いパス名を重複判定に使用しないようにする場合について説明する。 Embodiment 5 FIG.
In the fifth embodiment, a case will be described in which path names that are likely to be erroneously determined are not used for duplicate determination.

図１３は、本発明の実施の形態５において、記憶部２に新たに格納される除外パス名情報９を示した図である。この除外パス名情報９は、除外パス名９１の集合である。 FIG. 13 is a diagram showing excluded path name information 9 newly stored in the storage unit 2 in Embodiment 5 of the present invention. This excluded path name information 9 is a set of excluded path names 91.

判定手段６は、先の実施の形態１で説明した図８のステップＳ８１において、Ｗｅｂページ状態情報３のエントリを検索する際に、パス名３２が除外パス名９１と一致するものを無視するようにする。 The determination unit 6 ignores the case where the path name 32 matches the excluded path name 91 when searching for the entry of the Web page state information 3 in step S81 of FIG. 8 described in the first embodiment. To.

また、判定手段６は、先の図８のステップＳ８２やステップＳ８３で判定結果が偽となった際に、ヒットしたパス名の内、出現するサイト名がある閾値以上（第２の所定範囲に相当）のものを見付け、除外パス名情報９に追加する。 In addition, when the determination result in step S82 or step S83 in FIG. 8 is false, the determination unit 6 has a site name that appears among the hit path names that is equal to or greater than a certain threshold (within a second predetermined range). Equivalent) and add it to the excluded path name information 9.

このように、除外パス名情報９を設けることにより、全く関連性のないＷｅｂサイトであっても共通することのあるパス名（例えば、Ｗｅｂサーバソフトウェアのマニュアルページなど）の影響を排除し、誤って重複サイトと判定することを防ぐことができる。 In this manner, by providing the excluded path name information 9, the influence of a path name (for example, a manual page of the Web server software) that may be common even if the website is completely unrelated is eliminated, and an error is caused. Therefore, it can be determined that the site is a duplicate site.

以上のように、実施の形態５によれば、除外パス名を考慮して重複判定処理を行うことができる。この結果、全く関連性のないＷｅｂサイトであるがパス名が共通するものを、重複サイトと誤判定することを防止することができ、判定精度の向上を図ることができる。 As described above, according to the fifth embodiment, duplication determination processing can be performed in consideration of an excluded path name. As a result, it is possible to prevent a web site that is completely unrelated but has a common path name from being erroneously determined as a duplicate site, and to improve the determination accuracy.

本発明の実施の形態１における重複Ｗｅｂサイト動的検出装置の構成図である。It is a block diagram of the duplicate Web site dynamic detection apparatus in Embodiment 1 of this invention. 本発明の実施の形態１における記憶部に格納されたＷｅｂページ状態情報の詳細を示した図である。It is the figure which showed the detail of the web page state information stored in the memory | storage part in Embodiment 1 of this invention. 本発明の実施の形態１における記憶部に格納された重複Ｗｅｂサイト候補情報の詳細を示した図である。It is the figure which showed the detail of the duplication Web site candidate information stored in the memory | storage part in Embodiment 1 of this invention. 本発明の実施の形態１における記憶部に格納されたＷｅｂサイト状態情報の詳細を示した図である。It is the figure which showed the detail of the website status information stored in the memory | storage part in Embodiment 1 of this invention. 本発明の実施の形態１における受付手段の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of the reception means in Embodiment 1 of this invention. 本発明の実施の形態１におけるコンテンツハッシュ値削除処理の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of the content hash value deletion process in Embodiment 1 of this invention. 本発明の実施の形態１におけるコンテンツハッシュ値挿入処理の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of the content hash value insertion process in Embodiment 1 of this invention. 本発明の実施の形態１における判定手段の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of the determination means in Embodiment 1 of this invention. 本発明の実施の形態１におけるサイト名に対するスコアの例を示した図である。It is the figure which showed the example of the score with respect to the site name in Embodiment 1 of this invention. 本発明の実施の形態１における判定手段の動作の意味を示す概念図である。It is a conceptual diagram which shows the meaning of operation | movement of the determination means in Embodiment 1 of this invention. 本発明の実施の形態１における問合せ手段の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of the inquiry means in Embodiment 1 of this invention. 本発明の実施の形態３における判定手段の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of the determination means in Embodiment 3 of this invention. 本発明の実施の形態５において、記憶部に新たに格納される除外パス名情報を示した図である。In Embodiment 5 of this invention, it is the figure which showed the exclusion path name information newly stored in a memory | storage part.

Explanation of symbols

１受付手段、２記憶部、３Ｗｅｂページ状態情報、３１サイト名、３２パス名、３３コンテンツハッシュ値（コンテンツ特徴量）、４重複Ｗｅｂサイト候補情報、４１コンテンツハッシュ値、４２パス名、４３候補サイト集合、５Ｗｅｂサイト状態情報、５１サイト名、５２パス数、５３重複パス数、５４正規名、６判定手段、７問合せ手段、８Ｗｅｂクローラ、９除外パス名情報、９１除外パス名。 DESCRIPTION OF SYMBOLS 1 Accepting means, 2 Storage unit, 3 Web page state information, 31 Site name, 32 Path name, 33 Content hash value (content feature amount), 4 Duplicate website candidate information, 41 Content hash value, 42 Path name, 43 candidate Site set, 5 Web site status information, 51 Site name, 52 Path count, 53 Duplicate path count, 54 Regular name, 6 Judgment means, 7 Inquiry means, 8 Web crawler, 9 Excluded path name information, 91 Excluded path name.

Claims

The URL and content of the Web page discovered during Web crawling are received, and the site name and path name extracted from the URL are associated with the content feature amount calculated from the content and stored in the storage unit as Web page state information And a receiving unit that stores a set of site names in which the set of the content feature quantity and the path name match with each other in association with the content feature quantity and the path name, and stores it as duplicate Web site candidate information in the storage unit;
For each site name included in the set of site names, the number of path names having only one content feature amount for a plurality of site names from all path names and content feature amounts corresponding to each site name ( Number of path names having a plurality of content feature quantities for a plurality of site names (number of misses), and when the number of hits and the number of misses are within a predetermined range, Determination means for detecting as a duplicate Web site set ,
The reception means includes a site name, a path number corresponding to the number of different path names having the same site name, and a set of a path name and a content feature amount among the path names counted as the path number. The same combination is further stored in the storage unit as Web site state information in association with the number of duplicate paths corresponding to the number of path names existing in different site names,
The determination means determines the number of duplicate Web site sets when both the number of paths corresponding to each site name included in the set of site names and the ratio of the number of paths to the number of duplicate paths are within a predetermined range. A duplicate Web site dynamic detection apparatus that performs detection.

In the duplicate Web site dynamic detection device according to claim 1,
The determination means selects one site name included in the detected duplicate Web site set as a normal site name, associates the remaining site name with the normal site name, and stores the storage unit as Web site name related information. To remember further,
When a site name of a web page discovered during web crawling is received, and a regular site name associated with the site name is stored in the website name related information in the storage unit, the site name A duplicate Web site dynamic detection apparatus further comprising inquiry means for outputting the corresponding regular site name.

In the duplicate Web site dynamic detection device according to claim 2,
As a pre-processing for detecting the duplicate Web site set, the determination means stores a regular site name corresponding to each site name included in the site name set in the Web site name related information in the storage unit. If there is, the duplicate Web site dynamic detection apparatus, wherein a site name corresponding to the regular site name is removed from the set of site names.

In the duplicate Web site dynamic detection device according to claim 2 or 3,
The determination means selects the site name with the highest score from the site names included in the duplicate Web site set according to a score calculation formula defined in advance by the character string pattern of the site name and the number of domain levels. A duplicate Web site dynamic detection apparatus characterized by being selected as:

The duplicate Web site dynamic detection device according to any one of claims 1 to 4,
The determination unit counts the number of path names having only one content feature amount as a hit number for a plurality of site names that are equal to or greater than a predetermined ratio of the number of all site names. Detection device.

In the duplicate Web site dynamic detection device according to any one of claims 1 to 5,
The duplicate Web site dynamic detection apparatus, wherein the reception unit calculates the content feature amount by applying a one-way hash function to the received content data.

In the duplicate Web site dynamic detection device according to any one of claims 1 to 5,
The receiving means calculates the content feature amount by applying a one-way hash function to the remaining data obtained by removing HTML tags, HTML comments, scripts, and styles from the received content data. Web site dynamic detection device.

In the duplicate Web site dynamic detection device according to claim 1 ,
If the path name does not match a predetermined character string pattern, the accepting unit does not store the path name as the duplicate Web site candidate information in the storage unit, and counts the path name and the number of duplicate paths. A duplicate Web site dynamic detection device characterized in that it is excluded from the above.

In the duplicate Web site dynamic detection device according to claim 1 ,
When the site name of a web page discovered during web crawling matches a specific site name whose number of passes has reached a predetermined value, the accepting unit includes the duplicate web site candidate information. The path name corresponding to the site name of the Web page discovered during the Web crawling is not stored as the element of the set of site names, and the path number corresponding to the site name of the Web page discovered during the Web crawling and the number of duplicate paths A duplicate Web site dynamic detection device, characterized in that it is excluded from the counting target.

In the duplicate Web site dynamic detection device according to any one of claims 1 to 9 ,
The determination means determines the similarity of each site name by dividing each site name into components for each site name included in the set of site names, and If it is determined that a site name having a similarity equal to or greater than a predetermined value is included, the predetermined range for the number of hits and the number of misses when detecting the duplicate Web site set is changed, and the similarity A duplicate Web site dynamic detection apparatus characterized by facilitating detection of a high site name as a duplicate Web site.

The duplicate Web site dynamic detection device according to any one of claims 1 to 10 ,
When the receiving unit extracts a site name and a path name from the received URL, a pseudo site name obtained by concatenating the site name with an upper directory name of the path name and the upper directory name in the path name are A duplicate Web characterized in that a pseudo path name consisting of the remaining parts that are not connected is generated, and the pseudo site name and the pseudo path name are included in the site name and path name, respectively. Site dynamic detection device.

The duplicate Web site dynamic detection device according to any one of claims 1 to 11 ,
The determination means excludes a path name included in the set of site names if the number of hits obtained for the set of site names not detected as a duplicate Web site set is within the second predetermined range. If the path name of the Web page discovered during Web crawling matches the path name stored as the excluded path name, the path name is detected when a duplicate Web site is detected. A duplicate Web site dynamic detection apparatus characterized by not being referred to.