JP4610360B2

JP4610360B2 - Duplicate website detection device

Info

Publication number: JP4610360B2
Application number: JP2005026743A
Authority: JP
Inventors: 孝之田村
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2005-02-02
Filing date: 2005-02-02
Publication date: 2011-01-12
Anticipated expiration: 2025-02-02
Also published as: JP2006215735A

Description

本発明は、Ｗｅｂ情報の収集効率の向上を図るために、重複しているＷｅｂサイトを検出する目的で使用される重複Ｗｅｂサイト検出装置に関する。 The present invention relates to a duplicate website detection apparatus used for the purpose of detecting duplicate websites in order to improve the collection efficiency of web information.

重複Ｗｅｂサイトとは、ＵＲＬ（Uniform Resource Locator）のサイト名（ホスト名）部分だけが異なり、同一内容からなるＷｅｂサイトの集合であり、負荷分散やバックアップを目的とした物理的なコピーによるものから、検索エンジンでのランキングを操作するためにＤＮＳ（Domain Name System）へのホスト名の多重登録を行って論理的に多数のサイトに見せかけたものなどが存在している。 A duplicate Web site is a set of Web sites that differ only in the site name (host name) portion of a URL (Uniform Resource Locator) and have the same contents, and are based on physical copies for the purpose of load distribution and backup. In order to manipulate rankings in search engines, there are those that logically appear to many sites by performing multiple registration of host names to DNS (Domain Name System).

重複サイトを検出することで、検索エンジンの結果の重複を減らし、Ｗｅｂクローリングによる情報収集の効率およびプロキシサーバやブラウザにおけるキャッシュヒット率を向上することが期待される。 By detecting duplicate sites, it is expected to reduce duplication of search engine results, improve the efficiency of information collection by Web crawling, and improve the cache hit rate in proxy servers and browsers.

従来の重複サイト（ミラーサイト）検出装置は、名前が異なる２つのＷｅｂサイトについて、
１）当該Ｗｅｂサイトに属するＷｅｂページＵＲＬの文字列の類似性
２）当該ＷｅｂサイトのＩＰアドレスの類似性
３）当該Ｗｅｂサイトに属するＷｅｂページからリンクされているＵＲＬの共通性
４）当該Ｗｅｂサイトに属するＷｅｂページからリンクされているＵＲＬのサイト名部分の共通性
のうち、１つ以上の指標を数値的に評価し、類似性や共通性が基準値より高い場合に、内容が互いに等価な重複サイトであると判定していた（例えば、特許文献１および特許文献２参照）。 The conventional duplicate site (mirror site) detection device has two websites with different names.
1) Similarity of character strings of Web page URLs belonging to the Web site 2) Similarity of IP addresses of the Web sites 3) Commonality of URLs linked from Web pages belonging to the Web site 4) The Web site Among the commonality of the site name part of the URL linked from the Web page belonging to, one or more indicators are evaluated numerically, and the contents are equivalent to each other when the similarity or commonality is higher than the reference value It was determined to be an overlapping site (see, for example, Patent Document 1 and Patent Document 2).

また、従来の方法では、３つ以上の名前を持つ重複サイトは、Ｗｅｂサイトを２つずつの組に分けて判定を繰り返すことにより処理する。例えば、サイトＡとサイトＢが重複サイトであり、サイトＡとサイトＣが重複サイトであるならば、サイトＢとサイトＣも重複サイトであると判定し、サイトＡ、Ｂ、Ｃを一つの重複サイト集合として検出していた。 In the conventional method, duplicate sites having three or more names are processed by dividing the website into two groups and repeating the determination. For example, if site A and site B are duplicate sites, and site A and site C are duplicate sites, it is determined that site B and site C are also duplicate sites, and sites A, B, and C are duplicated one by one. It was detected as a site set.

特開２００２−７３６０７号公報（第１頁、図１）JP 2002-73607 A (first page, FIG. 1) 米国特許第６、４８７、５５５号明細書Ｂ１Ｆｉｇ．４US Pat. No. 6,487,555 B1 FIG. 4

しかしながら、従来技術には次のような課題がある。従来の重複Ｗｅｂサイト検出装置は、２つのサイトの組に対する判定結果に推移律を適用して、３つ以上のサイトの場合に一般化しているため、数十から数百といった多数のサイトについて判定を行うと誤りを生じ易いという問題点があった。すなわち、重複サイトは、完全に等価である必要はなく、情報収集中のＷｅｂサイトの変化に対応できるように、ある程度の誤差を含むものとしており、推移律は、厳密には成立しない。 However, the prior art has the following problems. The conventional duplicate Web site detection device applies a transition rule to the determination result for a set of two sites and is generalized in the case of three or more sites. Therefore, the determination is made for many sites such as tens to hundreds. However, there is a problem that an error is likely to occur. That is, the duplicate sites do not need to be completely equivalent, and include a certain amount of error so that changes in the Web site during information collection can be accommodated, and the transition rule is not strictly established.

多くの重複サイトを検出できるように許容誤差を大きめに与えると、多数のサイトに適用した場合の累積誤差は、さらに大きくなり、重複サイトでないものも重複とみなしてしまうことになる。一方、この問題を避けるために、許容誤差を小さく設定すると、一部が変化した重複サイトを検出できなくなるという問題が起こる。ホスト名の多重登録による大規模な重複サイトが存在する一方で、共通のデザインで多数のユーザに個別のＷｅｂスペースを提供するサービス（非重複サイト）も普及しており、多数のサイトに対する判定は、ごく一般的な問題となっている。 If a large tolerance is provided so that many duplicate sites can be detected, the cumulative error when applied to a large number of sites is further increased, and those that are not duplicate sites are also regarded as duplicates. On the other hand, in order to avoid this problem, if the allowable error is set to be small, there arises a problem that it becomes impossible to detect a duplicate site whose part has changed. While there are large-scale overlapping sites due to multiple registrations of host names, services that provide individual web spaces to a large number of users with a common design (non-overlapping sites) are also widespread. It has become a very common problem.

本発明は上述のような課題を解決するためになされたもので、多数のサイトに対する重複性判定の精度を高め、Ｗｅｂ情報の正確な把握と活用を可能にする重複Ｗｅｂサイト検出装置を提供することを目的とする。 The present invention has been made to solve the above-described problems, and provides a duplicate Web site detection apparatus that improves the accuracy of duplication determination for a large number of sites and enables accurate grasp and utilization of Web information. For the purpose.

本発明に係る重複Ｗｅｂサイト検出装置は、Ｗｅｂページ情報から、それぞれのサイトのＵＲＬに対応するコンテンツハッシュ値、サイト名、およびサイト内パスを取り出してサイト情報を生成する前処理手段と、生成されたそれぞれのサイト情報に基づいて、サイト名を行、サイト内パスを列の指標としてコンテンツハッシュ値を要素とする行列を構成し、コンテンツハッシュ値が一致する列の数が第１所定値以上であり、かつコンテンツハッシュ値が一致しない列の数が第２所定値未満である複数行を重複サイト集合として検出する重複サイト集合検出手段とを備えたものである。 The duplicate Web site detection apparatus according to the present invention is generated by preprocessing means for generating site information by extracting content hash values, site names, and intra-site paths corresponding to URLs of respective sites from Web page information. Based on the respective site information, a matrix having a content hash value as an element is constructed using the site name as a row, the intra-site path as a column index, and the number of columns with matching content hash values is equal to or greater than a first predetermined value. And a duplicate site set detection means for detecting a plurality of rows in which the number of columns having no matching content hash value is less than a second predetermined value as a duplicate site set.

本発明によれば、それぞれのサイトのＵＲＬに対して、サイト名を行、サイト内パスを列の指標としてコンテンツハッシュ値を要素とする行列を構成し、列方向のコンテンツハッシュ値に基づいて一致度が所定値よりも高い複数行を重複サイト集合として検出することにより、多数のサイトに対する重複性判定の精度を高め、Ｗｅｂ情報の正確な把握と活用を可能にすることができる重複Ｗｅｂサイト検出装置を得ることができる。 According to the present invention, for each URL of the site, a matrix having a site name as a row, a site hash as a column index, and a content hash value as an element is constructed, and matches based on the content hash value in the column direction. Duplicate web site detection that can detect multiple rows with a degree higher than a predetermined value as a duplicate site set, thereby improving the accuracy of duplication judgment for a large number of sites and enabling accurate grasp and utilization of web information A device can be obtained.

以下、本発明の重複Ｗｅｂサイト検出装置の好適な実施の形態につき図面を用いて説明する。本発明の重複Ｗｅｂサイト検出装置は、多数のサイトに対応するそれぞれのコンテンツハッシュ値に基づいて重複サイト集合の検出を一度で行うことにより、多数のサイトに対する重複性判定の精度を高めることが可能となる点を特徴としている。 Hereinafter, preferred embodiments of the duplicate Web site detection apparatus of the present invention will be described with reference to the drawings. The duplicate Web site detection apparatus of the present invention can improve the accuracy of duplication determination for a large number of sites by detecting a set of duplicate sites at a time based on the respective content hash values corresponding to a large number of sites. It is characterized by the point.

実施の形態１．
図１は、本発明の実施の形態１における重複Ｗｅｂサイト検出装置の構成図である。図１において、前処理手段１は、入力したＷｅｂページコンテンツ７のデータ形式を、後続手段の処理に必要な形式に変換する手段である。第１のソート手段２は、前処理手段１の結果の並べ替えを行う。タグ付与手段３は、第１のソート手段２の結果に基づいて各Ｗｅｂページに対応するデータにタグを付与する。 Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a duplicate Web site detection apparatus according to Embodiment 1 of the present invention. In FIG. 1, preprocessing means 1 is means for converting the data format of the input web page content 7 into a format necessary for processing by subsequent means. The first sorting unit 2 rearranges the results of the preprocessing unit 1. The tag assigning means 3 assigns a tag to data corresponding to each Web page based on the result of the first sorting means 2.

第２のソート手段４は、タグを付与されたデータのソートを行う。計数手段５は、同一タグのデータに関する集計処理を行う。さらに、判定手段６は、計数手段５の結果に基づいて判定を行い、重複サイト名リスト８を出力する。ここで、第１のソート手段２、タグ付与手段３、第２のソート手段４、計数手段５および判定手段６は、重複サイト集合検出手段１０を構成する各手段に相当する。 The second sorting means 4 sorts the data to which the tags are attached. The counting means 5 performs a counting process on the data of the same tag. Further, the determination unit 6 performs determination based on the result of the counting unit 5 and outputs the duplicate site name list 8. Here, the first sorting unit 2, the tag assigning unit 3, the second sorting unit 4, the counting unit 5, and the determination unit 6 correspond to each unit constituting the duplicate site set detection unit 10.

これらの各手段は、それぞれ独立の演算器と記憶装置を備えたハードウェアで実現することができ、また単一の演算器と記憶装置を備えたコンピュータで逐次に実行することもできる。 Each of these means can be realized by hardware including an independent arithmetic unit and a storage device, or can be sequentially executed by a computer including a single arithmetic unit and a storage device.

次に、これらの各手段の動作について詳細に説明する。まず前処理手段１は、ＷｅｂページのＵＲＬ文字列とそのコンテンツを示す文字列の組を受け取り、ＵＲＬ文字列からサイト名とサイト内パスを切り出すとともに、Ｗｅｂページのコンテンツを示す文字列全体にハッシュ関数を適用してハッシュ値に変換し、Ｗｅｂページ毎にサイト名、サイト内パス、コンテンツハッシュ値を出力する。 Next, the operation of each means will be described in detail. First, the preprocessing unit 1 receives a set of a URL character string of a Web page and a character string indicating its content, extracts a site name and a site path from the URL character string, and hashes the entire character string indicating the Web page content. A function is applied to convert it into a hash value, and a site name, a site path, and a content hash value are output for each Web page.

図２は、本発明の実施の形態１における前処理手段１によるＵＲＬ文字列の処理を示す概念図である。図２において、ＵＲＬ文字列２０の内、サイト名２１は、サイトのホスト名を表す部分、サイト内パス２２は、「／」で始まる残りの文字列である。 FIG. 2 is a conceptual diagram showing URL character string processing by the preprocessing means 1 in Embodiment 1 of the present invention. In FIG. 2, in the URL character string 20, the site name 21 is a part representing the host name of the site, and the intra-site path 22 is the remaining character string starting with “/”.

また、ハッシュ値を求めるハッシュ関数は、異なるコンテンツ文字列に対して同一のハッシュ値が対応する確率が低いものが適しており、公知のＭＤ５やＳＨＡ−１などを用いることができる。 In addition, a hash function for obtaining a hash value is suitable that has a low probability that the same hash value corresponds to different content character strings, and known MD5, SHA-1, or the like can be used.

図３は、本発明の実施の形態１における前処理手段１の出力情報を示す図である。図３において、前処理手段出力情報３０は、Ｗｅｂページ毎に１つの行が対応しており、それぞれの行は、サイト名３１、サイト内パス３２、コンテンツハッシュ値３３の３つの列（カラム）のそれぞれの要素からなるサイト情報を構成している。 FIG. 3 is a diagram showing output information of the preprocessing unit 1 according to the first embodiment of the present invention. In FIG. 3, the preprocessing means output information 30 corresponds to one row for each Web page, and each row has three columns (columns) of a site name 31, a site path 32, and a content hash value 33. The site information which consists of each element of is comprised.

次に、第１のソート手段２は、前処理手段出力情報３０の各行に対して、サイト名、サイト内パスの昇順にソートする。図４は、本発明の実施の形態１における第１のソート手段２の出力情報を示す図である。図４において、第１のソート手段出力情報４０は、前処理手段出力情報３０と同じ形式をしているが、サイト名４１、サイト内パス４２の順に各行が配置されている点が異なる。前処理手段出力情報３０にサイト名およびサイト内パスがともに同一である行が複数存在する場合は、いずれか１行を残し、他は除去する。 Next, the first sorting unit 2 sorts each row of the preprocessing unit output information 30 in ascending order of the site name and the intra-site path. FIG. 4 is a diagram showing output information of the first sorting means 2 in Embodiment 1 of the present invention. In FIG. 4, the first sorting means output information 40 has the same format as the preprocessing means output information 30, except that each line is arranged in the order of the site name 41 and the intra-site path 42. If there are multiple lines in the preprocessing means output information 30 that have the same site name and intra-site path, leave one line and remove the others.

次に、タグ付与手段３は、第１のソート手段出力情報４０に基づいて、各Ｗｅｂページに対応するデータにタグを付与する。図５は、本発明の実施の形態１におけるタグ付与手段３の動作の詳細を示すフローチャートである。図５において、始めにステップＳ５０１で、現在サイト名および現在タグの値を空文字列に初期化する。次に、ステップＳ５０２で、第１のソート手段出力情報４０から１行を入力する。次に、ステップＳ５０３で、入力した行のサイト名カラムと現在サイト名の値とを比較し、一致する場合は、ステップＳ５０６に進む。 Next, the tag assigning means 3 assigns a tag to data corresponding to each Web page based on the first sort means output information 40. FIG. 5 is a flowchart showing details of the operation of the tag assigning means 3 in Embodiment 1 of the present invention. In FIG. 5, first, in step S501, the current site name and the current tag value are initialized to an empty character string. Next, one line is input from the first sort means output information 40 in step S502. Next, in step S503, the site name column of the input row is compared with the value of the current site name. If they match, the process proceeds to step S506.

一方、これらの値が一致しない場合は、ステップＳ５０４に進み、現在タグを入力行のコンテンツハッシュ値カラムとサイト名カラムとを文字列として連結した値に設定する。次いで、ステップＳ５０５で、現在サイト名をサイト名カラムの値に設定する。 On the other hand, if these values do not match, the process proceeds to step S504, where the current tag is set to a value obtained by concatenating the content hash value column and the site name column of the input row as a character string. In step S505, the current site name is set to the value of the site name column.

ステップＳ５０６で、現在タグ、サイト内パスカラム、コンテンツハッシュ値カラム、およびサイト名カラムを、入力行にタグ付与した結果として出力する。最後に、ステップＳ５０７で、第１のソート手段出力情報４０の全ての行を処理したか判定し、処理すべき行が残っていれば、ステップＳ５０２に戻り、残りの行に対する一連の処理を行い、残っていなければ一連の処理を終了する。 In step S506, the current tag, intra-site path column, content hash value column, and site name column are output as a result of tagging the input row. Finally, in step S507, it is determined whether all the rows of the first sorting means output information 40 have been processed. If there are any more rows to be processed, the process returns to step S502, and a series of processing is performed on the remaining rows. If not, the series of processing is terminated.

図６は、本発明の実施の形態１におけるタグ付与手段３の出力情報を示す図であり、タグ付与手段３が図５の一連の処理を実行することにより出力するタグ付与手段出力情報６０を示したものである。図６において、タグ付与手段出力情報６０は、Ｗｅｂページ毎に１つの行が対応しており、各行は、タグ６１、サイト内パス６２、コンテンツハッシュ値６３、およびサイト名６４の４つのカラムからなるサイト情報となっている。 FIG. 6 is a diagram showing the output information of the tag assigning means 3 according to the first embodiment of the present invention. The tag assigning means output information 60 output when the tag assigning means 3 executes the series of processes of FIG. It is shown. In FIG. 6, the tag granting unit output information 60 corresponds to one line for each Web page, and each line includes four columns of a tag 61, a site path 62, a content hash value 63, and a site name 64. It has become site information.

タグ付与手段３の処理の意味は、各サイトについて文字列順で先頭となるサイト内パスとコンテンツハッシュ値の組をタグとして付与することであり、このタグを用いて重複サイト名の候補集合を作ることが可能になる。例えば、図６における５行目と６行目に対応するサイト名６４は、ともにxxx.yyy.zzzであるため、それぞれの行のタグ６１は、５行目のコンテンツハッシュ値６３とサイト内パス６２との組として、同一のタグが付与されている。また、図６においては、タグ６１の値の一例として、コンテンツハッシュ値６３とサイト内パス６２とを「−」を挟んで連結した文字列を示している。 The meaning of the processing of the tag assigning means 3 is to assign, as a tag, a set of the in-site path and the content hash value that is the head in the character string order for each site. Using this tag, a candidate set of duplicate site names can be obtained. It becomes possible to make. For example, since the site names 64 corresponding to the fifth and sixth lines in FIG. 6 are both xxx.yyy.zzz, the tag 61 of each line has the content hash value 63 and the intra-site path in the fifth line. As a pair with 62, the same tag is given. In FIG. 6, as an example of the value of the tag 61, a character string in which the content hash value 63 and the intra-site path 62 are concatenated with “−” interposed therebetween is shown.

次に、第２のソート手段４は、タグ付与手段出力情報６０の各行に対して、タグ、サイト内パス、コンテンツハッシュ値の昇順にソートする。図７は、本発明の実施の形態１における第２のソート手段４の出力情報を示す図である。図７において、第２のソート手段出力情報７０は、タグ付与手段出力情報６０と同じ形式をしており、各行の配置順のみが異なる。 Next, the second sorting unit 4 sorts each row of the tag providing unit output information 60 in ascending order of the tag, the site path, and the content hash value. FIG. 7 is a diagram showing output information of the second sorting means 4 in the first embodiment of the present invention. In FIG. 7, the second sort means output information 70 has the same format as the tag assignment means output information 60, and only the arrangement order of each row is different.

次に、計数手段５は、第２のソート手段出力情報７０に基づいて、同一タグのデータに関する集計処理を行う。図８は、本発明の実施の形態１における計数手段５の動作の概要を示すフローチャートである。図８において、始めにステップＳ８０１で、第２のソート手段出力情報７０から同一タグが続く限り行を入力する。次に、ステップＳ８０２において、入力した複数行に対して後述する方法でヒット数、ミス率を計数し、タグおよびサイト名リストとともに出力する。 Next, the counting unit 5 performs a counting process on the data of the same tag based on the second sort unit output information 70. FIG. 8 is a flowchart showing an outline of the operation of the counting means 5 in the first embodiment of the present invention. In FIG. 8, first, in step S801, as long as the same tag continues from the second sort means output information 70, rows are input. Next, in step S802, the number of hits and the miss rate are counted for a plurality of input lines by a method described later, and output together with the tag and site name list.

次いで、ステップＳ８０３で、全ての入力を処理したか判断し、未処理の入力があればステップＳ８０１に戻り、未処理の入力に対して一連の処理を行い、全て処理済であれば一連の処理を終了する。 Next, in step S803, it is determined whether all inputs have been processed. If there are unprocessed inputs, the process returns to step S801, and a series of processes are performed on the unprocessed inputs. Exit.

図９は、本発明の実施の形態１における計数手段５の図８のステップＳ８０２の動作を詳細に示したフローチャートである。図９において、ステップＳ９０１で、同一タグに対応するサイト名が何種類存在するかを数える。次いで、ステップＳ９０２で、同一タグに対応するサイト内パスが何種類存在するかを数える。さらに、ステップＳ９０３で、サイト内パス毎にコンテンツハッシュ値の種類とその出現頻度を数える。 FIG. 9 is a flowchart showing in detail the operation of step S802 in FIG. 8 of the counting means 5 according to the first embodiment of the present invention. In FIG. 9, in step S901, the number of types of site names corresponding to the same tag is counted. In step S902, the number of intra-site paths corresponding to the same tag is counted. In step S903, the type of content hash value and its appearance frequency are counted for each intra-site path.

次いで、ステップＳ９０４で、サイト内パスの内、２種類以上のコンテンツハッシュ値が対応しているものの割合をミス率として求める。次に、ステップＳ９０５で、１種類のコンテンツハッシュ値が対応するサイト内パスについて、コンテンツハッシュ値の出現頻度が２以上かつサイト名種類の一定割合以上となっているものを数え、ヒット数として求める。最後に、ステップＳ９０６で、タグ、ヒット数、ミス率、サイト名リストを出力して終了する。 Next, in step S904, the ratio of those corresponding to two or more types of content hash values in the intra-site path is obtained as a miss rate. Next, in step S905, for the intra-site path to which one type of content hash value corresponds, the number of occurrences of the content hash value that is 2 or more and the site name type is a certain ratio or more is counted and obtained as the number of hits. . Finally, in step S906, the tag, hit count, miss rate, and site name list are output and the process ends.

図１０は、本発明の実施の形態１における計数手段５の動作の意味を示す概念図である。この図１０は、タグが12349876-/であるグループについて、行方向にサイト名１０１を取り、列方向にサイト内パス１０２を取り、対応するコンテンツハッシュ値１０３を並べた行列を示している。 FIG. 10 is a conceptual diagram showing the meaning of the operation of the counting means 5 in Embodiment 1 of the present invention. FIG. 10 shows a matrix in which the site name 101 is taken in the row direction, the intra-site path 102 is taken in the column direction, and the corresponding content hash values 103 are arranged for the group whose tag is 12349876- /.

コンテンツハッシュ値がＮ／Ａとなっている部分は、サイト名とサイト内パスに対応するＵＲＬが入力のＷｅｂページコンテンツ７に存在しなかったことを示している。大規模なＷｅｂ情報の収集は、Ｗｅｂページ間のリンクを辿りながら行なうのが一般的であるため、実際に存在するＵＲＬであってもアクセスしていないために情報が欠落することは有り得る。 The portion where the content hash value is N / A indicates that the URL corresponding to the site name and the site path does not exist in the input Web page content 7. Large-scale collection of Web information is generally performed while following links between Web pages. Therefore, even URLs that actually exist may not be accessed and information may be lost.

図９に示した動作は、図１０に示す行列において、列毎にヒットおよびミスの判定を行なって、それぞれの列の数を数えることと等価である。具体的には、図１０の例では、サイト内パス「／」に対しては、全てのコンテンツハッシュ値が等しく、種類＝１、出現頻度＝３となることから、この列はヒットとなる。 The operation shown in FIG. 9 is equivalent to performing hit / miss determination for each column in the matrix shown in FIG. 10 and counting the number of each column. Specifically, in the example of FIG. 10, since all the content hash values are equal for the intra-site path “/” and the type = 1 and the appearance frequency = 3, this column is a hit.

また、サイト内パス「/links.html」に対しては、サイトaaa.bbb.cccおよびzzz.www.aaaに対応するコンテンツハッシュ値が存在しないため種類＝１、出現頻度＝１となり、種類は１であるが出現頻度が２以上でないため、この列はヒットでもミスでもないと見なされる。さらに、サイト内パス「/news.html」に対しては、コンテンツハッシュ値の種類＝２となるため、この列はミスとなる。 For the intra-site path “/links.html”, there is no content hash value corresponding to the sites aaa.bbb.ccc and zzz.www.aaa, so type = 1, appearance frequency = 1, and type is Since it is 1 but the frequency of occurrence is not 2 or greater, this column is considered neither a hit nor a miss. Furthermore, for the in-site path “/news.html”, the type of content hash value = 2, so this column is missed.

このような場合、計数手段５は、タグとして12349876-/、ヒット数として１、ミス率として１／３、サイト名リストとしてaaa.bbb.ccc、xxx.yyy.zzz、zzz.www.aaaの３つをそれぞれ出力することとなる。ここで求めたヒット数は、図１０における一致列の数に相当し、ミス率は、図１０における不一致列の数に相当する。 In such a case, the counting means 5 uses 12349876- / as the tag, 1 as the number of hits, 1/3 as the miss rate, aaa.bbb.ccc, xxx.yyy.zzz, zzz.www.aaa as the site name list. Each of the three will be output. The number of hits obtained here corresponds to the number of matched columns in FIG. 10, and the miss rate corresponds to the number of mismatched columns in FIG.

次に、判定手段６は、計数手段５によって出力されたタグ、ヒット数、ミス率、サイト名リストの結果に基づいて判定を行い、重複サイト名リスト８を出力する。図１１は、本発明の実施の形態１における判定手段６の動作を示すフローチャートである。図１１において、ステップＳ１１０１で、タグ毎にヒット数、ミス率、サイト名リストを受け取る。 Next, the determination means 6 makes a determination based on the tag, hit count, miss rate, and site name list results output by the counting means 5 and outputs the duplicate site name list 8. FIG. 11 is a flowchart showing the operation of the determination unit 6 according to Embodiment 1 of the present invention. In FIG. 11, in step S1101, the number of hits, the miss rate, and the site name list are received for each tag.

次に、ステップＳ１１０２で、ヒット数が一定値以上かつミス率が一定割合未満であれば、当該タグに対応するサイトは、全て重複と判定し、サイト名リストを重複サイト名リスト８に出力する。次いで、ステップＳ１１０３で、全てのタグについて判定を行なったか判断し、未処理のタグがあれば、ステップＳ１１０１に戻って未処理のタグに対して一連の処理を行い、未処理のタグがない場合は、一連の処理を終了する。 Next, in step S1102, if the number of hits is equal to or greater than a certain value and the miss rate is less than a certain ratio, it is determined that all the sites corresponding to the tag are duplicates, and the site name list is output to the duplicate site name list 8. . Next, in step S1103, it is determined whether all tags have been determined. If there is an unprocessed tag, the process returns to step S1101 to perform a series of processing on the unprocessed tag, and there is no unprocessed tag. Ends a series of processing.

実施の形態１によれば、１つ以上の任意の数のサイトに対して、サイト内パスとコンテンツハッシュ値との比較を行なう計数手段を備えているので、大規模な重複サイトの判定を誤差を累積することなく行なうことができる。特に、判定対象の一定割合以上に共通するサイト内パスのみを一致（ヒット）と見なし、また、多数のサイトの一部でも不一致があればミスと見なすので、サイト数の増加に応じて判定基準も厳しくなり、閾値設定におけるトレードオフの問題を避けることができる。さらに、タグ付与手段を備えているので、タグに基づいて一部のパスの内容が一致する全てのサイトを重複候補として効率的に抽出することができる。 According to the first embodiment, since the counting means for comparing the intra-site path and the content hash value is provided for one or more arbitrary numbers of sites, it is possible to make an error in determining a large-scale duplicate site. Can be performed without accumulating. In particular, only intra-site paths that are common to a certain percentage or more of the judgment targets are regarded as matches (hits), and even if some of the sites do not match, they are regarded as mistakes. And the trade-off problem in threshold setting can be avoided. Furthermore, since the tag providing means is provided, all the sites whose contents of some paths match based on the tag can be efficiently extracted as duplication candidates.

実施の形態２．
実施の形態１では、同一タグ内でのサイト内パスのヒット／ミスの計数に基づいて、重複性を判定するようにしたものであるが、次に、サイト名の類似性を考慮して類似サイト名からなる重複サイト候補集合に対しては不一致の許容度を大きくする実施の形態を示す。 Embodiment 2. FIG.
In the first embodiment, the redundancy is determined based on the hit / miss count of the intra-site path within the same tag. Next, the similarity is considered in consideration of the similarity of the site names. An embodiment will be described in which the tolerance of mismatch is increased for a duplicate site candidate set made up of site names.

図１２は、本発明の実施の形態２における重複Ｗｅｂサイト検出装置の構成図である。図１２において、番号が図１と共通するものは同じ動作をする手段である。実施の形態１における図１と比較して、図１２は、第２のソート手段出力情報をサイト名に基づいて計数するドメイン計数手段５ａが新たに加わり、判定手段６が計数手段５とドメイン計数手段５ａの出力情報に基づいて重複サイトの判定を行なう判定手段６ａに置き換わった点が異なっている。 FIG. 12 is a configuration diagram of the duplicate Web site detection apparatus according to the second embodiment of the present invention. In FIG. 12, the same reference numerals as those in FIG. 1 are means for performing the same operation. Compared with FIG. 1 in the first embodiment, FIG. 12 shows the addition of the domain counting means 5a for counting the second sort means output information based on the site name. The difference lies in that it is replaced with determination means 6a for determining duplicate sites based on the output information of means 5a.

図１３は、本発明の実施の形態２におけるドメイン計数手段５ａの動作を示すフローチャートである。図１３において、ステップＳ１３０１で、サイト名毎に「．」で区切られた構成要素の数を数え、その最小値を求める。例えば、サイト名がxxx.yyy.zzzの場合の構成要素の数は、３である。 FIG. 13 is a flowchart showing the operation of the domain counting means 5a in the second embodiment of the present invention. In FIG. 13, in step S1301, the number of components separated by “.” Is counted for each site name, and the minimum value is obtained. For example, when the site name is xxx.yyy.zzz, the number of components is three.

次に、ステップＳ１３０２で、サイト名の構成要素（例えば、「xxx」、「yyy」、「zzz」のそれぞれ）毎に出現頻度を求め、入力のサイト数の一定割合以上の出現頻度となるものを数えて頻出ドメインレベルとする。最後に、ステップＳ１３０３で、最小ドメインレベル数および頻出ドメインレベル数を出力して終了する。 Next, in step S1302, the appearance frequency is obtained for each component of the site name (for example, “xxx”, “yyy”, and “zzz”), and the appearance frequency is equal to or higher than a certain ratio of the number of input sites. To the frequent domain level. Finally, in step S1303, the minimum number of domain levels and the number of frequent domain levels are output and the process ends.

図１４は、本発明の実施の形態２における判定手段６ａの動作を示すフローチャートである。図１４において、始めにステップＳ１４０１で、タグ、サイト名リストとともに、ヒット数、ミス率を計数手段５から受け取り、同じタグに対する最小ドメインレベル数、頻出ドメインレベル数をドメイン計数手段５ａから受け取る。 FIG. 14 is a flowchart showing the operation of the determination means 6a in Embodiment 2 of the present invention. In FIG. 14, first, in step S1401, the number of hits and the miss rate are received from the counting means 5 together with the tag and site name list, and the minimum number of domain levels and the number of frequent domain levels for the same tag are received from the domain counting means 5a.

次に、ステップＳ１４０２で、頻出ドメインレベル数と最小ドメインレベル数とを比較し、頻出ドメインレベル数が最小ドメインレベル数より小さい場合は、ステップＳ１４０４に進む。頻出ドメインレベル数が最小ドメインレベル数以上の場合は、ステップＳ１４０３に進み、サイト数の対数に比例する係数をヒット数に乗じ、当該係数の逆数をミス率に乗じる。係数としては、例えば、ｌｏｇ（サイト数）×４．５などを用いる。

In step S1402, the frequent domain level number is compared with the minimum domain level number. If the frequent domain level number is smaller than the minimum domain level number, the process proceeds to step S1404. If the frequent domain level number is greater than or equal to the minimum domain level number, the process proceeds to step S1403, where the hit number is multiplied by a coefficient proportional to the logarithm of the number of sites, and the miss rate is multiplied by the reciprocal of the coefficient. For example, log (number of sites) × 4.5 is used as the coefficient.

ステップＳ１４０４では、ヒット数が一定値以上かつミス率が一定割合未満の場合に、サイト名リストを重複サイト名リスト８に出力する。次いで、ステップＳ１４０５で、全てのタグについて処理を行なったか判断し、未処理のタグがあれば、ステップＳ１４０１に戻って未処理のタグに対して一連の処理を行い、未処理のタグがない場合は、一連の処理を終了する。 In step S1404, the site name list is output to the duplicate site name list 8 when the number of hits is equal to or greater than a certain value and the miss rate is less than a certain rate. Next, in step S1405, it is determined whether all tags have been processed. If there is an unprocessed tag, the process returns to step S1401 to perform a series of processes on the unprocessed tag, and there is no unprocessed tag. Ends a series of processing.

このようにして、判定手段６ａは、計数手段５で計数されたヒット数およびミス率を、ドメイン計数手段５ａで計数されたドメイン出現頻度に基づいて補正することにより、サイト名の類似性に関する情報も重複サイト情報の検出に利用できる。 In this way, the determination unit 6a corrects the hit count and the miss rate counted by the counting unit 5 based on the domain appearance frequency counted by the domain counting unit 5a, thereby obtaining information on the similarity of the site names. Can also be used to detect duplicate site information.

実施の形態２によれば、ドメイン係数手段を備えているので、サイト名の類似性に関する情報も判定に利用することができる。特に、重複サイト候補集合内のサイト名が高い類似性を持つ場合に、サイト内パスの計数結果にバイアスを加え、重複と判定し易い条件を整えることで大規模な重複サイトの検出漏れを防ぐことができる。 According to the second embodiment, since the domain coefficient means is provided, information on the similarity of site names can also be used for the determination. In particular, when the site names in the duplicate site candidate set have high similarity, a bias is added to the count results of the intra-site paths, and conditions that make it easy to determine duplicates are prepared to prevent detection of large-scale duplicate sites. be able to.

なお、図１２の構成においては、計数手段５とドメイン計数手段５ａを併用したが、計数手段５を用いずにドメイン計数手段５ａのみを用いて重複サイトの検出を簡易的に行うことも可能である。 In the configuration of FIG. 12, the counting means 5 and the domain counting means 5a are used in combination, but it is also possible to easily detect duplicate sites using only the domain counting means 5a without using the counting means 5. is there.

実施の形態３．
実施の形態１では、各サイトに対して先頭サイト内パスに対応する単一のタグを付与したものであるが、次に各サイトに対して複数のタグを付与する実施の形態を示す。 Embodiment 3 FIG.
In Embodiment 1, a single tag corresponding to the first site path is assigned to each site. Next, an embodiment in which a plurality of tags is assigned to each site will be described.

図１５は、本発明の実施の形態３における重複Ｗｅｂサイト検出装置の構成図である。図１５において、番号が図１と共通するものは同じ動作をする手段である。実施の形態１における図１と比較して、図１５は、タグ付与手段３が複数タグ付与手段３ａに置き換わり、判定手段６の後にマージ手段９が新たに加わった点が異なっている。 FIG. 15 is a configuration diagram of the duplicate Web site detection apparatus according to the third embodiment of the present invention. In FIG. 15, the same numbers as those in FIG. 1 are means for performing the same operation. Compared to FIG. 1 in the first embodiment, FIG. 15 is different from FIG. 15 in that the tag assigning means 3 is replaced with a plurality of tag assigning means 3a and a merging means 9 is newly added after the determining means 6.

図１６は、本発明の実施の形態３における複数タグ付与手段３ａの動作を示すフローチャートである。図１６において、ステップＳ１６０１で、第１のソート手段出力情報３０からＮ行を上限として同一サイト名が続く限り複数行を入力し、その行数をＭとする。次に、ステップＳ１６０２で、入力の各Ｍ行についてコンテンツハッシュ値カラムとサイト内パスとの文字列連結値を求め、それぞれ現在タグ１、２、・・・、Ｍとする。 FIG. 16 is a flowchart showing the operation of the multiple tag assigning means 3a in Embodiment 3 of the present invention. In FIG. 16, in step S1601, a plurality of lines are input from the first sorting means output information 30 as long as the same site name continues up to N lines, and the number of lines is M. Next, in step S1602, a character string concatenation value between the content hash value column and the site path is obtained for each input M row, and the current tags 1, 2,...

次いで、ステップＳ１６０３で、同一サイト名に対応する各行を第１のソート手段出力情報３０から入力し、各行について現在タグ１〜Ｍとサイト内パスカラム、コンテンツハッシュ値カラム、サイト名カラムの組み合わせＭ行を出力する。これにより、各サイト毎に複数のサイト内パスに対応するタグが付与される。なお、先頭Ｎ個ではなく、別の基準を用いて複数のサイト内パスを選択してもよい。 Next, in step S1603, each row corresponding to the same site name is input from the first sort means output information 30, and for each row, a combination M rows of the current tags 1 to M, the intra-site path column, the content hash value column, and the site name column. Is output. Thereby, tags corresponding to a plurality of intra-site paths are assigned to each site. Note that a plurality of intra-site paths may be selected using another criterion instead of the top N.

最後に、ステップＳ１６０４で、全ての行について処理を行なったか判断し、未処理の行があれば、ステップＳ１６０１に戻って未処理の行に対して一連の処理を行い、未処理の行がない場合は、一連の処理を終了する。 Finally, in step S1604, it is determined whether all rows have been processed. If there are unprocessed rows, the process returns to step S1601 to perform a series of processing on the unprocessed rows, and there is no unprocessed row. In the case, the series of processing ends.

引き続く計数手段５や判定手段６では、タグ毎に実施の形態１で説明した動作と同一の処理を行なうため、判定手段６の出力には、同一サイトが複数回現れる可能性が生じる。そこで、マージ手段９は、同じサイトを含む重複サイトを１つの重複サイトに併合する処理を行い、サイト名が高々１回現れるようにして重複サイトリストを出力する。 The subsequent counting unit 5 and determination unit 6 perform the same processing as the operation described in the first embodiment for each tag, so that the same site may appear multiple times in the output of the determination unit 6. Therefore, the merging means 9 performs a process of merging duplicate sites including the same site into one duplicate site, and outputs a duplicate site list so that the site name appears at most once.

実施の形態３によれば、複数タグ付与手段を備えているので、先頭サイト内パスが全ての重複サイトで共通していない場合でも重複サイトを検出することができ、検出漏れを低減することができる。 According to the third embodiment, since a plurality of tag addition means are provided, it is possible to detect a duplicate site even when the path within the first site is not common to all duplicate sites, and to reduce detection omissions. it can.

実施の形態４．
実施の形態１では、Ｗｅｂページコンテンツ７の全てのＷｅｂページを処理対象としたが、次に、コンテンツが重複している可能性の高いＷｅｂページのみを予め選択する手段を設けた実施の形態を示す。 Embodiment 4 FIG.
In the first embodiment, all Web pages of the Web page content 7 are targeted for processing. Next, an embodiment in which means for selecting only Web pages that are highly likely to have duplicate contents is provided. Show.

図１７は、本発明の実施の形態４における重複Ｗｅｂサイト検出装置の構成図である。図１７において、番号が図１と共通するものは同じ動作をする手段である。実施の形態１における図１と比較して、図１７は、前処理手段１の前にＷｅｂページ選択手段１ａが加わった点が異なっている。Ｗｅｂページ選択手段１ａは、各Ｗｅｂページについて、サイト内パス名とコンテンツ長の出現頻度を調べ、１回しか現れないものを除去し、複数回現れるもののみを前処理手段１に渡す。 FIG. 17 is a configuration diagram of the duplicate Web site detection apparatus according to the fourth embodiment of the present invention. In FIG. 17, the same reference numerals as those in FIG. 1 are means for performing the same operation. Compared with FIG. 1 in the first embodiment, FIG. 17 is different in that a web page selection unit 1 a is added before the preprocessing unit 1. The Web page selection unit 1a checks the appearance frequency of the in-site path name and the content length for each Web page, removes those that appear only once, and passes only those that appear multiple times to the preprocessing unit 1.

実施の形態４によれば、Ｗｅｂページ選択手段を設けることにより、計算負荷の高いハッシュ処理に先立って非重複コンテンツを除去することができ、処理効率を高めることができる。 According to the fourth embodiment, by providing the Web page selection unit, it is possible to remove non-overlapping content prior to hash processing with a high calculation load, and it is possible to improve processing efficiency.

なお、図１７では、Ｗｅｂページ選択手段１ａと前処理手段１とを別々の構成として記載したが、前処理手段１にＷｅｂページ選択手段１ａの機能を付加することも可能である。 In FIG. 17, the Web page selection unit 1 a and the preprocessing unit 1 are described as separate configurations. However, the function of the Web page selection unit 1 a can be added to the preprocessing unit 1.

実施の形態５．
実施の形態１では、サイト内パスを全て考慮に入れたが、判定を誤る可能性の高いサイト内パスを取り除く手段を設けた実施の形態を示す。 Embodiment 5 FIG.
In the first embodiment, all the intra-site paths are taken into consideration, but an embodiment is provided in which means for removing intra-site paths that are likely to be erroneously determined is provided.

図１８は、本発明の実施の形態５における重複Ｗｅｂサイト検出装置の構成図である。図１８において、第１の重複Ｗｅｂサイト検出装置１８１は、以上の実施の形態１〜４のいずれかに相当するが、重複でないと判定したサイトのリストを非重複サイト集合として出力する点が上記の実施の形態１〜４と異なっている。除外パス抽出手段１８２は、非重複サイト集合内でヒットしているサイト内パスの内、出現頻度が所定値よりも高いものを抽出し、除外パスリスト１８３として出力する。 FIG. 18 is a configuration diagram of the duplicate Web site detection apparatus according to the fifth embodiment of the present invention. In FIG. 18, the first duplicate Web site detection apparatus 181 corresponds to any one of the first to fourth embodiments described above, but the point that the list of sites determined not to be duplicated is output as a non-duplicate site set is described above. This is different from the first to fourth embodiments. The excluded path extracting unit 182 extracts a path having an appearance frequency higher than a predetermined value from the intra-site paths hit in the non-overlapping site set, and outputs the extracted path as an excluded path list 183.

第２の重複Ｗｅｂサイト検出装置１８４は、以上の実施の形態１〜４のいずれかに相当するが、Ｗｅｂページコンテンツ７を再度処理する際に、除外パスリスト１８３に格納されたサイト内パスについては無視する点が上記の実施の形態１〜４と異なっている。 The second duplicate Web site detection apparatus 184 corresponds to any one of the above-described first to fourth embodiments, but the intra-site path stored in the excluded path list 183 when the Web page content 7 is processed again. Is different from the first to fourth embodiments described above.

図１８の構成においては、第１の重複Ｗｅｂサイト検出装置１８１と第２の重複Ｗｅｂサイト検出装置１８４とを別々の装置として記載したが、これに限定されない。除外パスリスト１８３の出力を第１の重複Ｗｅｂサイト検出装置１８１にフィードバックすることにより、１台の重複Ｗｅｂサイト検出装置により同等の効果を得ることができる。さらに、除外パスリスト１８３による抽出処理を、出現頻度が所定値よりも高いものがなくなるまで繰り返し処理することも可能である。 In the configuration of FIG. 18, the first duplicate website detection device 181 and the second duplicate website detection device 184 are described as separate devices, but the present invention is not limited to this. By feeding back the output of the excluded path list 183 to the first duplicate website detection device 181, the same effect can be obtained by one duplicate website detection device. Furthermore, the extraction process using the exclusion path list 183 can be repeatedly performed until there is no longer an appearance frequency higher than a predetermined value.

実施の形態５によれば、除外パス抽出手段と第２の重複サイト検出手段を設けたので、全く関連性のないサイトであっても共通することのあるパス（例えばＷｅｂサーバソフトウェアのマニュアルページなど）の影響を排除し、誤って重複サイトと判定するのを防ぐことができる。 According to the fifth embodiment, since the excluded path extracting unit and the second duplicate site detecting unit are provided, a path that may be common even if the site is completely unrelated (for example, a manual page of Web server software) ) Can be eliminated, and it can be prevented that a duplicate site is erroneously determined.

なお、実施の形態１〜５に示した重複Ｗｅｂサイト検出装置は、次のような応用が可能である。各重複サイト集合について、サイト名の代表を一つ選択し、重複サイト名を代表サイト名に変換するデータベースを備えたシステムに適用できる。 The duplicate Web site detection apparatus shown in the first to fifth embodiments can be applied as follows. For each duplicate site set, one representative site name can be selected and applied to a system having a database that converts the duplicate site name into the representative site name.

また、代表サイト名に変換するデータベースを参照して、ネットワーク経由で取得したＷｅｂ文書に含まれるリンクのＵＲＬを変換しつつ、Ｗｅｂ文書の取得を繰り返すＷｅｂクローラからなるシステム、およびＷｅｂ文書収集方法に適用できる。 In addition, a system including a Web crawler that repeatedly acquires a Web document while converting a URL of a link included in the Web document acquired via the network with reference to a database to be converted into a representative site name, and a Web document collection method Applicable.

さらに、代表サイト名に変換するデータベースを参照して、ネットワーク経由で取得したＷｅｂ文書に含まれるリンクのＵＲＬを変換しつつ、Ｗｅｂ文書の取得を繰り返すＷｅｂクローラと、取得したＷｅｂ文書から重複サイト検出を行い、代表サイト名に変換するデータベースを更新する手段を有するシステム、およびＷｅｂ文書収集方法にも適用できる。 Furthermore, referring to a database to be converted into a representative site name, a Web crawler that repeatedly acquires a Web document while converting a URL of a link included in the Web document acquired via the network, and a duplicate site detection from the acquired Web document And a system having means for updating a database to be converted into a representative site name, and a Web document collection method.

本発明の実施の形態１における重複Ｗｅｂサイト検出装置の構成図である。It is a block diagram of the duplication Web site detection apparatus in Embodiment 1 of this invention. 本発明の実施の形態１における前処理手段によるＵＲＬ文字列の処理を示す概念図である。It is a conceptual diagram which shows the process of the URL character string by the pre-processing means in Embodiment 1 of this invention. 本発明の実施の形態１における前処理手段の出力情報を示す図である。It is a figure which shows the output information of the pre-processing means in Embodiment 1 of this invention. 本発明の実施の形態１における第１のソート手段の出力情報を示す図である。It is a figure which shows the output information of the 1st sort means in Embodiment 1 of this invention. 本発明の実施の形態１におけるタグ付与手段の動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of operation | movement of the tag provision means in Embodiment 1 of this invention. 本発明の実施の形態１におけるタグ付与手段の出力情報を示す図である。It is a figure which shows the output information of the tag provision means in Embodiment 1 of this invention. 本発明の実施の形態１における第２のソート手段の出力情報を示す図である。It is a figure which shows the output information of the 2nd sort means in Embodiment 1 of this invention. 本発明の実施の形態１における計数手段の動作の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of operation | movement of the counting means in Embodiment 1 of this invention. 本発明の実施の形態１における計数手段の図８のステップＳ８０２の動作を詳細に示したフローチャートである。It is the flowchart which showed in detail the operation | movement of FIG.8 step S802 of the counting means in Embodiment 1 of this invention. 本発明の実施の形態１における計数手段の動作の意味を示す概念図である。It is a conceptual diagram which shows the meaning of operation | movement of the counting means in Embodiment 1 of this invention. 本発明の実施の形態１における判定手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the determination means in Embodiment 1 of this invention. 本発明の実施の形態２における重複Ｗｅｂサイト検出装置の構成図である。It is a block diagram of the duplication Web site detection apparatus in Embodiment 2 of this invention. 本発明の実施の形態２におけるドメイン計数手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the domain counting means in Embodiment 2 of this invention. 本発明の実施の形態２における判定手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the determination means in Embodiment 2 of this invention. 本発明の実施の形態３における重複Ｗｅｂサイト検出装置の構成図である。It is a block diagram of the duplication Web site detection apparatus in Embodiment 3 of this invention. 本発明の実施の形態３における複数タグ付与手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the several tag provision part in Embodiment 3 of this invention. 本発明の実施の形態４における重複Ｗｅｂサイト検出装置の構成図である。It is a block diagram of the duplication Web site detection apparatus in Embodiment 4 of this invention. 本発明の実施の形態５における重複Ｗｅｂサイト検出装置の構成図である。It is a block diagram of the duplication Web site detection apparatus in Embodiment 5 of this invention.

Explanation of symbols

１前処理手段、１ａＷｅｂページ選択手段、２第１のソート手段、３タグ付与手段、３ａ複数タグ付与手段、４第２のソート手段、５計数手段、５ａドメイン計数手段、６、６ａ判定手段、９マージ手段、１０重複サイト集合検出手段、１８１第１の重複Ｗｅｂサイト検出装置、１８２除外パス抽出手段、１８４第２の重複Ｗｅｂサイト検出装置。 DESCRIPTION OF SYMBOLS 1 Pre-processing means, 1a Web page selection means, 2 1st sort means, 3 Tag assignment means, 3a Multiple tag assignment means, 4 Second sort means, 5 Count means, 5a Domain count means, 6, 6a Determination means , 9 Merge means, 10 Duplicate site set detection means, 181 First duplicate website detection device, 182 Excluded path extraction means, 184 Second duplicate website detection device.

Claims

Preprocessing means for generating site information by extracting content hash values, site names, and intra-site paths corresponding to URLs of respective sites from Web page information;
Based on each of the generated site information, a matrix having a content hash value as an element is formed using the site name as a row, the intra-site path as a column index, and the number of columns with the matching content hash value is the first. 1. A duplicate Web set comprising: a duplicate site set detection means for detecting, as a duplicate site set , a plurality of rows in which the number of columns that are equal to or greater than one predetermined value and whose content hash values do not match is less than a second predetermined value. Site detection device.

The duplicate website detection device according to claim 1,
The duplicate site set detection means generates the same tag for a URL where at least one of the content hash value or the site name in the site information is duplicated, and the site name included in the same tag Is a duplication site candidate set, and based on the respective site information corresponding to the duplication site candidate set, a matrix having a content hash value as an element with the site name as a row and the intra-site path as a column index is constructed. Detecting a duplicate site candidate set in which the number of columns with matching content hash values is equal to or greater than a first predetermined value and the number of columns with mismatching content hash values is less than a second predetermined value as a duplicate site set. A duplicate Web site detection device as a feature.

The duplicate Web site detection apparatus according to claim 2,
The duplicate site set detection means includes:
First sorting means for rearranging each of the site information retrieved by the preprocessing means in order of character strings of the site name and the path within the site;
For each of the rearranged site information, a tag including a set of the intra-site path and the content hash value is generated, and a plurality of site information having the same site name is ordered in a character string order. A tag adding means for generating the same tag composed of a set of an in-site path and a content hash value corresponding to the sorted first site information, and adding the generated tag to the site information;
Second sorting means for extracting the site names included in the site information with the same tag as a duplicate site candidate set by rearranging the site information with the tag attached in the order of the character strings of the tags; Based on each of the site information corresponding to each of the extracted duplicate site candidate set, configure a matrix having the site name as a row, the intra-site path as a column index, and a content hash value as an element, For each column, the type of the corresponding content hash value and the appearance frequency of the same type are counted, and a predetermined value for the type of site name included in the duplicate site candidate set, the type being 1 and the appearance frequency being 2 or more. Counting means for counting columns that are equal to or greater than the percentage as matching columns, and counting columns that are two or more of the types as mismatching columns;
Based on the number of matched columns and mismatched columns counted by the counting means, the number of matched columns is equal to or greater than a first predetermined value and the number of mismatched columns is less than a second predetermined value. A duplicate Web site detection apparatus comprising: a determination unit that detects a site candidate set as a duplicate site set.

The duplicate Web site detection apparatus according to claim 3,
The duplicate site set detection means includes:
Domain count means for counting the domain appearance frequency for each domain name portion delimited by “.” Of each site name for the site names included in each of the extracted duplicate site candidate sets,
The determination means includes the number of the matched columns and the mismatched columns counted by the counting unit, and the domain appearance frequency counted by the domain counting unit is not less than a predetermined ratio of the number of sites included in the duplicate site candidate set. When the number of matched columns after correction is equal to or greater than the first predetermined value and the number of mismatched columns after correction is less than the second predetermined value. A duplicate Web site detection apparatus that detects a site name included in a duplicate site candidate set as a duplicate site set.

In the duplication Web site detection device according to claim 3 or 4,
For the plurality of site information having the same site name, the tag assigning means divides the site information rearranged in the character string order into a plurality of groups, and an intra-site path corresponding to the head site information of each group Generate multiple tags by generating the same tag consisting of a pair with the content hash value,
Wherein characterized in that it further comprises a merging unit for merging into a single overlapping sites overlapping sites together contain the same site name from the duplicate set of sites that have been detected by the determination means duplicate Web site detecting apparatus.

The duplicate Web site detection device according to any one of claims 1 to 5,
The pre-processing means counts the frequency of appearance of the intra-site path and content length corresponding to the URL of each site, and generates the site information after deleting the URL with the appearance frequency of once. Duplicate Web site detection device.

The duplicate Web site detection apparatus according to any one of claims 1 to 6,
From the duplicate site set detection means, a set that is not detected as a duplicate site set is extracted as a non-duplicate site set, and an appearance frequency equal to or higher than a predetermined value is selected from the intra-site paths included in the non-duplicate site set. The system further comprises an exclusion path extraction means for extracting the in-site path as an exclusion path list,
The preprocessing means, the overlapping set of sites detection means, and in the last duplicated Web site detection process by the exclusion path extraction means, when the exclusion path list was Tei extracted by the exclusion path extraction means,
The preprocessing means feeds back the excluded path list, reads the URL including the site path in the excluded path list, and then generates the site information again .
The overlapping set of sites detection means, duplicate Web site detecting apparatus characterized by detecting a duplicate set of sites again based on the site information generated again after deleting a URL that contains the site path of the exclusion path list .

The duplicate Web site detection apparatus according to claim 7,
A series of processing by site path until said exclusion path list is no longer extracted is, the pre-processing means and the overlapping set of sites detection means based on the exclusion path list having a frequency higher than a predetermined value by the exclusion path extraction means The duplicate Web site detection apparatus characterized by repeating.