JP2010123000A

JP2010123000A - Web page group extraction method, device and program

Info

Publication number: JP2010123000A
Application number: JP2008297242A
Authority: JP
Inventors: Yukako Kitagawa; 結香子北川; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-11-20
Filing date: 2008-11-20
Publication date: 2010-06-03

Abstract

<P>PROBLEM TO BE SOLVED: To extract Web pages each having similar content in page units as a group by using only URL information from a large number of URLs. <P>SOLUTION: A URL is extracted from an input access log, the URL is regarded as a character string and is divided as a partial character string in each portion, a feature vector is generated based on the appearing partial character string, similarity between the feature vectors is obtained, clustering is performed based on the similarity between the feature vectors, and the URL included in a generated cluster is extracted as the Web page group and is output. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、Ｗｅｂページグループ抽出方法及び装置及びプログラムに係り、特に、アクセスログ等に含まれる大量のＵＲＬが与えられた際に、ＵＲＬに異なりはみられるものの、同様の内容を持つＷｅｂページを一つのグループとして抽出するためのＷｅｂページグループ抽出方法及び装置及びプログラムに関する。 The present invention relates to a Web page group extraction method, apparatus, and program, and in particular, when a large number of URLs included in an access log or the like are given, although different URLs are seen, Web pages having similar contents are displayed. The present invention relates to a Web page group extraction method, apparatus, and program for extracting as one group.

アクセスログの解析の際には、ＵＲＬに基づきどのページにどのようなアクセスが行われているかを分析する。このとき、異なるＵＲＬであるが、同じＷｅｂページへの、あるいは同じ種類のＷｅｂページへのアクセスとして処理をすべきＵＲＬが存在する。 At the time of analyzing the access log, it is analyzed what access is being made to which page based on the URL. At this time, although there are different URLs, there are URLs that should be processed as access to the same Web page or the same type of Web page.

異なるＵＲＬであるが、同じＷｅｂページとして扱うべきものには、負荷分散などを目的としたミラーページなど同じ内容を持つＷｅｂページがあげられる。このようなページは、Ｗｅｂ閲覧者にとっては同じＷｅｂページであり、異なるＵＲＬであっても同じＵＲＬへのアクセスとして処理しなければ解析の精度は低下する。 Examples of different URLs that should be handled as the same Web page include Web pages having the same content such as mirror pages for the purpose of load distribution. Such a page is the same Web page for Web viewers, and even if it is a different URL, the accuracy of analysis is reduced unless it is processed as an access to the same URL.

また、同じ種類のＷｅｂページとしては、同種類の多数の項目についてそれぞれにＷｅｂページが存在する場合などがあげられる。 In addition, examples of the same type of Web page include a case where a Web page exists for each of many items of the same type.

例えば、ＥＣ(Electronic Commerce)サイトにおいて数多くの商品に対し、個別に商品詳細情報のＷｅｂページが準備されていることがある。このようなページは、ログ解析の目的によっては、個別のＷｅｂページへのアクセスとするよりも、ある一つの種類のＷｅｂページヘのアクセスとして処理することで解析の精度が向上する。 For example, a Web page of detailed product information may be prepared individually for many products on an EC (Electronic Commerce) site. Depending on the purpose of log analysis, the accuracy of analysis is improved by processing such a page as an access to a certain type of Web page rather than an access to an individual Web page.

以上のようなことから、異なるＵＲＬであるが、同じＷｅｂページ、あるいは同じ種類の情報を持つＷｅｂページからなるグループを抽出し、一つのＷｅｂページのアクセスとして解析対象とする必要がある。 Because of the above, it is necessary to extract a group consisting of Web pages with the same Web page or the same type of information with different URLs, and make it an analysis target as an access to one Web page.

アクセスログに含まれるＵＲＬに対し、手作業で同じＷｅｂページ、あるいは同じ種類のＷｅｂページからなるグループを抽出することは困難である。 It is difficult to manually extract the same Web page or a group of Web pages of the same type from the URL included in the access log.

これに対し、同様の内容を持つＷｅｂサイトの抽出方法としてはミラーサイト群の発見方法等がある。この方法は、大量Ｗｅｂページ集合からＷｅｂサイトのトップページとなるページを推定し、Ｗｅｂページ集合について推定したトップページと、それにリンクしたページからサイト集合を決定し、このサイト集合に対し、サイズが一定値以上のサイトを処理対象として絞り込み、サイトが持つリンク文字列、アンカー文字列、内部／外部リンク情報のファイルを作成する。この中から同じ特徴を持つサイトペアをミラーサイトとして選択し、ミラーサイト候補ペアの類似度からミラーサイトペアを検出する（例えば、特許文献１参照）。
特開２００４−２６４９２６号公報 On the other hand, a method for extracting a Web site having the same contents includes a method for finding a mirror site group. In this method, a page to be a top page of a website is estimated from a large number of web page sets, a site set is determined from a top page estimated for the web page set and pages linked to the top page, and the size of the site set is determined. Sites with a certain value or more are narrowed down as processing targets, and files of link character strings, anchor character strings, and internal / external link information possessed by the sites are created. A site pair having the same characteristics is selected as a mirror site from these, and the mirror site pair is detected from the similarity of the mirror site candidate pairs (see, for example, Patent Document 1).
JP 2004-264926 A

しかしながら、前述した同一Ｗｅｂサイトの抽出方法には、次のような問題がある。 However, the same Web site extraction method described above has the following problems.

（１）同一Ｗｅｂサイトの抽出方法は、アクセスログに含まれるＵＲＬと、ＵＲＬが指し示すＷｅｂページに含まれるリンク構造などを利用して、同一サイトを発見するものである。このため、アクセスログを解析する際には、必ずしも含まれるＵＲＬにアクセスされた時点でのＷｅｂページが取得できるとは限らず、Ｗｅｂページに含まれる情報を利用することができない場合も多い。このような場合には重複するＷｅｂページ抽出ができないという問題がある。 (1) The same Web site extraction method uses the URL included in the access log and the link structure included in the Web page indicated by the URL to find the same site. For this reason, when analyzing an access log, it is not always possible to acquire a Web page at the time of accessing an included URL, and there are many cases where information included in the Web page cannot be used. In such a case, there is a problem that overlapping Web pages cannot be extracted.

（２）また、上記の抽出方法は、サイトを単位として同一サイトの発見を目指すものである。このため、同じ内容を持つＷｅｂページには必ずしもサイト全体のミラーサイトに含まれるものとは限らず、あるサイトにおいてアクセスが集中する特定のページだけに重複ページが準備されていることも多い。このような場合には、必ずしもサイト全体が重複しているとは限らず、重複するＷｅｂページが抽出できないという問題がある。 (2) The above extraction method aims to find the same site in units of sites. For this reason, Web pages having the same contents are not necessarily included in the mirror site of the entire site, and duplicate pages are often prepared only for specific pages where access is concentrated on a certain site. In such a case, the entire site is not necessarily duplicated, and there is a problem that duplicate web pages cannot be extracted.

（３）また、上記の抽出方法は、完全に同一のＷｅｂページを抽出するものであり、同じ種類の情報へのアクセスだと考えられるＷｅｂページは抽出できない。 (3) In addition, the above extraction method extracts the completely same Web page, and Web pages that are considered to be access to the same type of information cannot be extracted.

上記のように、従来の方法は、アクセスログ解析の際には、Ｗｅｂページが取得されていなくてはならず、サイト単位での抽出方法であるという問題がある。 As described above, the conventional method has a problem that a Web page must be acquired in the access log analysis, and is an extraction method in units of sites.

本発明は、上記の点に鑑みなされたもので、大量のＵＲＬからＵＲＬ情報のみを用いて、ページ単位に同様の内容を持つＷｅｂページをグループとして抽出することが可能なＷｅｂページグループ抽出方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and a Web page group extraction method capable of extracting, as a group, Web pages having similar contents in units of pages using only URL information from a large number of URLs. An object is to provide an apparatus and a program.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、アクセスログから、異なるＵＲＬであるが、同じＷｅｂページ、あるいは、同じ種類のＷｅｂページをグループとして抽出する方法であって、
アクセスログ入力手段が、入力されたアクセスログからＵＲＬを抽出しＵＲＬ記憶手段に格納するＵＲＬ抽出ステップ（ステップ１）と、
文字列分割手段が、ＵＲＬ記憶手段からＵＲＬを読み出して、該ＵＲＬを文字列と見做し、各部位毎に部分文字列として分割する文字列分割ステップ（ステップ２）と、
特徴ベクトル算出手段が、出現する部分文字列に基づいて、特徴ベクトルを生成し、特徴ベクトル記憶手段に格納する特徴ベクトル算出ステップ（ステップ３）と、
類似度算出手段が、特徴ベクトル記憶手段から特徴ベクトルを読み出して、特徴ベクトル間の類似度を求め、類似度記憶手段に格納する類似度算出ステップ（ステップ４）と、
クラスタリング手段が、類似度記憶手段から特徴ベクトル間の類似度を読み出してクラスタリングを行い、生成されたクラスタに含まれるＵＲＬをＷｅｂページグループとして抽出し、ＵＲＬ分類記憶手段に出力するクラスタリングステップ（ステップ５）と、を行う。 The present invention (Claim 1) is a method for extracting the same Web page or the same type of Web page as a group from the access log, which are different URLs.
A URL extraction step (step 1) in which the access log input means extracts a URL from the input access log and stores it in the URL storage means;
A character string dividing step (Step 2) in which the character string dividing means reads the URL from the URL storage means, regards the URL as a character string, and divides the URL as a partial character string for each part;
A feature vector calculation means for generating a feature vector based on the appearing partial character string and storing it in the feature vector storage means (step 3);
A similarity calculation unit reads out a feature vector from the feature vector storage unit, obtains a similarity between the feature vectors, and stores it in the similarity storage unit (step 4);
The clustering means reads out the similarity between the feature vectors from the similarity storage means, performs clustering, extracts the URL included in the generated cluster as a Web page group, and outputs it to the URL classification storage means (step 5) ) And do.

また、本発明（請求項２）は、文字列分割ステップ（ステップ２）において、
ＵＲＬの部分文字列としてホスト部、ドメイン部、ディレクトリ部、クエリ部毎に分割し、
特徴ベクトル算出ステップ（ステップ３）において、
部分文字列の出現頻度によって特徴ベクトルを求める。 Further, the present invention (Claim 2) is a character string dividing step (Step 2).
Divided into host part, domain part, directory part, query part as partial character string of URL,
In the feature vector calculation step (step 3),
A feature vector is obtained based on the appearance frequency of the partial character string.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項３）は、アクセスログから、異なるＵＲＬであるが、同じＷｅｂページ、あるいは、同じ種類のＷｅｂページをグループとして抽出するＷｅｂページグループ抽出装置であって、
入力されたアクセスログからＵＲＬを抽出しＵＲＬ記憶手段３に格納するＵＲＬ抽出手段２と、
ＵＲＬ記憶手段３からＵＲＬを読み出して、該ＵＲＬを文字列と見做し、各部位毎に部分文字列として分割する文字列分割手段４１と、
出現する部分文字列に基づいて、特徴ベクトルを生成し、特徴ベクトル記憶手段５に格納する特徴ベクトル算出手段４２と、
特徴ベクトル記憶手段５から特徴ベクトルを読み出して、特徴ベクトル間の類似度を求め、類似度記憶手段６に格納する類似度算出手段４３と、
類似度記憶手段６から特徴ベクトル間の類似度を読み出してクラスタリングを行い、生成されたクラスタに含まれるＵＲＬをＷｅｂページグループとして抽出し、ＵＲＬ分類記憶手段７に出力するクラスタリング手段と４４、を有する。 The present invention (Claim 3) is a Web page group extracting apparatus that extracts the same Web page or the same type of Web page as a group from the access log, which are different URLs.
URL extraction means 2 for extracting a URL from the input access log and storing it in the URL storage means 3;
A character string dividing unit 41 that reads a URL from the URL storage unit 3, regards the URL as a character string, and divides the URL as a partial character string for each part;
A feature vector calculation means 42 for generating a feature vector based on the appearing partial character string and storing it in the feature vector storage means 5;
A similarity calculation unit 43 that reads out a feature vector from the feature vector storage unit 5, calculates a similarity between the feature vectors, and stores it in the similarity storage unit 6;
Clustering means 44 that reads out the similarity between feature vectors from the similarity storage means 6 and performs clustering, extracts URLs included in the generated cluster as a Web page group, and outputs them to the URL classification storage means 7; .

また、本発明（請求項４）は、文字列分割手段４１においては、
ＵＲＬの部分文字列としてホスト部、ドメイン部、ディレクトリ部、クエリ部毎に分割する手段を含み、
特徴ベクトル算出手段４２は、
部分文字列の出現頻度によって特徴ベクトルを求める手段を含む。 Further, according to the present invention (claim 4), in the character string dividing means 41,
Including means for dividing the URL part character string into a host part, a domain part, a directory part, and a query part,
The feature vector calculation means 42
Means for obtaining a feature vector based on the appearance frequency of the partial character string is included.

また、本発明（請求項５）は、請求項３または４のいずれかに記載のＷｅｂページグループ抽出装置を構成する各手段としてコンピュータを機能させるためのＷｅｂページグループ抽出プログラムである。 The present invention (Claim 5) is a Web page group extraction program for causing a computer to function as each means constituting the Web page group extraction apparatus according to any one of Claims 3 and 4.

上記のように本発明によれば、大量のＵＲＬからＵＲＬ情報のみを用い、異なるＵＲＬを持つが同じＷｅｂページ、あるいは／及び、同じ種類のＷｅｂページをグループとして抽出できる。Ｗｅｂページのグループを抽出することで、それらのＷｅｂページへのアクセスを同一Ｗｅｂページへのアクセスとして扱うことができ、アクセスログ解析の精度向上が期待できる。 As described above, according to the present invention, only URL information is used from a large number of URLs, and the same Web page or / and the same type of Web page having different URLs can be extracted as a group. By extracting a group of Web pages, access to those Web pages can be handled as access to the same Web page, and an improvement in access log analysis accuracy can be expected.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態におけるＵＲＬ分類装置の構成を示す。 FIG. 3 shows the configuration of the URL classification device in one embodiment of the present invention.

同図に示すＵＲＬ分類装置は、アクセスログ入力部２、ＵＲＬ記憶部３、ＵＲＬ分類部４、特徴ベクトル記憶部５、類似度記憶部６、ＵＲＬ分類記憶部７から構成される。 The URL classification apparatus shown in FIG. 1 includes an access log input unit 2, a URL storage unit 3, a URL classification unit 4, a feature vector storage unit 5, a similarity storage unit 6, and a URL classification storage unit 7.

アクセスログ入力部２は、収集されたアクセスログ１が入力されると、当該アクセスログからＵＲＬを抽出し、ＵＲＬ記憶部３に格納する。 When the collected access log 1 is input, the access log input unit 2 extracts a URL from the access log and stores it in the URL storage unit 3.

ＵＲＬ分類部４は、文字列分割部４１、特徴ベクトル算出部４２、類似度算出部４３、クラスタリング部４４から構成される。文字列分割部４１は、ＲＬ記憶部３からＵＲＬを読み出して、当該ＵＲＬに含まれる文字列を各部位ごとに分割し、特徴ベクトル算出部４２は部分文字列を計数し、ＵＲＬ特徴ベクトルを求め、類似度算出部４３はＵＲＬ特徴ベクトル間の類似度を算出し、クラスタリング部４４はクラスタリングを行い、その結果をＵＲＬ分類記憶部７に格納する。 The URL classifying unit 4 includes a character string dividing unit 41, a feature vector calculating unit 42, a similarity calculating unit 43, and a clustering unit 44. The character string dividing unit 41 reads the URL from the RL storage unit 3 and divides the character string included in the URL for each part, and the feature vector calculating unit 42 counts the partial character strings to obtain the URL feature vector. The similarity calculation unit 43 calculates the similarity between URL feature vectors, the clustering unit 44 performs clustering, and stores the result in the URL classification storage unit 7.

以下に上記の構成における動作を説明する。 The operation in the above configuration will be described below.

図４は、本発明の一実施の形態におけるＵＲＬ分類部の動作のフローチャートである。 FIG. 4 is a flowchart of the operation of the URL classification unit according to the embodiment of the present invention.

ステップ１０１）まず、文字列分割部４１において、ＵＲＬ記憶部３からアクセスログを読み出して、図５に示すようなアクセスログに含まれる各ＵＲＬについて、含まれる文字列を各部位毎に分割し、さらにディレクトリ部については１階層毎に、クエリ部については１パラメータ毎に分割して抽出する。 Step 101) First, the character string dividing unit 41 reads out the access log from the URL storage unit 3, and for each URL included in the access log as shown in FIG. Further, the directory part is extracted for each layer, and the query part is extracted for each parameter.

以下に分割例を示す。 Examples of division are shown below.

ＵＲＬ例：
http://www.xxx.yuu/path1/path2/path3?param1=value1&param2=value2
分割例：
host部［www］
domain部［xxx.yyy］
directory部［path1,path2,path3］
query部［param1=value1,param2=value2］
ステップ１０２）特徴ベクトル算出部４２は、ＵＲＬに含まれる部分文字列を計数し、それに基づいてＵＲＬ特徴ベクトルを求め、特徴ベクトル記憶部５に格納する。この際、異なる部位に含まれる部分文字列は異なる部分文字列として扱う。 URL example:
http: //www.xxx.yuu/path1/path2/path3? param1 = value1 & param2 = value2
Example of division:
host part [www]
domain part [xxx.yyy]
directory part [path1, path2, path3]
query part [param1 = value1, param2 = value2]
Step 102) The feature vector calculation unit 42 counts the partial character strings included in the URL, obtains a URL feature vector based on the partial character string, and stores it in the feature vector storage unit 5. At this time, partial character strings included in different parts are treated as different partial character strings.

なお、ベクトルの成分数は、対象全ＵＲＬのhost部に生起する全ての部分文字列の種類数ｐ、同じくdomain部、path部、query部毎に生起する全ての部分文字列の種類数ｑ、ｒ、ｓを足し合わせて（ｐ＋ｒ＋ｓ＝Ｎ）Ｎ個である。 Note that the number of vector components is the number p of all partial character strings occurring in the host part of all target URLs, and the number q of all partial character strings occurring in each domain part, path part, and query part. The sum of r and s is (p + r + s = N) N.

図６に示す１３個のＵＲＬ例が与えられたとき、算出されるＵＲＬ特徴ベクトルの一部を図７に示す。 FIG. 7 shows a part of the URL feature vector calculated when the 13 URL examples shown in FIG. 6 are given.

図６のＵＲＬ例では、部分文字列の種類数は、
host部：４、domain部：５、directory部：１１、query部：５
であり、次元数２５の特徴ベクトルにより表される。 In the URL example of FIG. 6, the number of types of partial character strings is
host part: 4, domain part: 5, directory part: 11, query part: 5
And is represented by a feature vector of 25 dimensions.

なお、図７〜図９における番号は図６の同じ行番号のＵＲＬを示すものとする。 The numbers in FIGS. 7 to 9 indicate URLs having the same row numbers in FIG.

ステップ１０３）次に、類似度算出部４３は、特徴ベクトル記憶部５からＵＲＬ特徴ベクトルを読み出して、当該ＵＲＬ特徴ベクトル間の類似度を算出し、類似度記憶部６に格納する。 Step 103) Next, the similarity calculation unit 43 reads the URL feature vector from the feature vector storage unit 5, calculates the similarity between the URL feature vectors, and stores it in the similarity storage unit 6.

類似度算出部４３における、２つのＵＲＬ特徴ベクトルの類似度sim(a,b)の算出方法にはいくつかの手法が考えられる。例えば、２つのＵＲＬ特徴ベクトル間の内積を利用することができる。図８に図６のＵＲＬリストの各々の２つのＵＲＬ類似度を示す。類似度記憶部６には、図８に示すＵＲＬ特徴ベクトル間距離が格納されることになる。 There are several methods for calculating the similarity sim (a, b) between two URL feature vectors in the similarity calculation unit 43. For example, an inner product between two URL feature vectors can be used. FIG. 8 shows two URL similarities in each of the URL lists of FIG. The similarity storage unit 6 stores the distance between URL feature vectors shown in FIG.

ステップ１０４）クラスタリング部４４は、類似度記憶部６から類似度（ＵＲＬ特徴ベクトル間距離）を読み出して、類似するＵＲＬ特徴ベクトルを同じクラスタにまとめる。クラスタにまとめる手法としては、いくつかの手法が考えられるが、例えば、最短距離法を利用することができる（参考文献：岸田和明、"文書クラスタリングの技法:文献レビュー"， Library and Information Science, no.49, pp.33-75 (2003)）。なお、ベクトルの類似度が高いものほど近くに位置するベクトルと考えられる。 Step 104) The clustering unit 44 reads the similarity (distance between URL feature vectors) from the similarity storage unit 6 and collects similar URL feature vectors into the same cluster. Several methods are conceivable as methods for grouping into clusters. For example, the shortest distance method can be used (reference: Kazuaki Kishida, “Document clustering technique: literature review”, Library and Information Science, no.49, pp.33-75 (2003)). A vector having a higher degree of similarity is considered to be a vector located closer.

以下に最短距離法によるクラスタリング手法を示す。 The clustering method using the shortest distance method is shown below.

１）全ＵＲＬ特徴ベクトルを個別に初期クラスタとし、処理を開始する。 1) All URL feature vectors are individually set as initial clusters, and processing is started.

２）まとめられたクラスタと他のクラスタとの距離を、２つのクラスタに属する対象のうち、最も近い対象間の距離をクラスタ間類似度として再計算する。 2) Recalculate the distance between the combined cluster and other clusters, using the distance between the closest objects among the objects belonging to the two clusters as the intercluster similarity.

上記の１）、２）のステップを予め設定した閾値以下の距離を持つクラスタがなくなるまで繰り返す。 The above steps 1) and 2) are repeated until there is no cluster having a distance equal to or smaller than a preset threshold value.

図９に図６のＵＲＬリストのクラスタリング結果の樹形図を示す。同じクラスタに含まれるＵＲＬ特徴ベクトルを持つＵＲＬグループを、同じ内容を持つwebページとして抽出し、ＵＲＬ分類記憶部７に格納する。 FIG. 9 shows a tree diagram of the clustering result of the URL list of FIG. URL groups having URL feature vectors included in the same cluster are extracted as web pages having the same contents and stored in the URL classification storage unit 7.

図９において、線に示されている位置に閾値が設定されているとすると、
ＵＲＬ番号１と２；
ＵＲＬ番号９，１０と１１；
ＵＲＬ番号３，４と５；
ＵＲＬ番号６，７と８；
ＵＲＬ番号１２；
ＵＲＬ番号１３；
の７つのクラスタが得られる。これは、期待するＵＲＬグループと同じである。 In FIG. 9, if a threshold is set at the position indicated by the line,
URL numbers 1 and 2;
URL numbers 9, 10 and 11;
URL numbers 3, 4 and 5;
URL numbers 6, 7 and 8;
URL number 12;
URL number 13;
7 clusters are obtained. This is the same as the expected URL group.

上記のような処理を行うことにより、アクセスログに含まれる大量のＵＲＬからＵＲＬ情報のみ（ＵＲＬの内容には関与せず）を用い、異なるＵＲＬを持つが、同じＷｅｂページ、あるいは／及び、同じ種類のＷｅｂページをグループとして抽出できる。 By performing the processing as described above, only URL information is used from a large number of URLs included in the access log (not related to the contents of the URL) and has different URLs, but the same Web page or / and the same Web pages of types can be extracted as a group.

なお、上記の図３に示す装置の構成要素の動作をプログラムとして構築し、Ｗｅｂページグループ抽出装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 It is possible to construct the operation of the components of the apparatus shown in FIG. 3 as a program, install it on a computer used as a Web page group extraction apparatus, and execute it, or distribute it via a network. .

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、Ｗｅｂページの検索する技術に適用可能である。 The present invention can be applied to a technique for searching a Web page.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態におけるＷｅｂページグループ抽出装置の構成図である。It is a block diagram of the Web page group extraction apparatus in one embodiment of this invention. 本発明の一実施の形態におけるＵＲＬ分類部の詳細な処理を示すフローチャートである。It is a flowchart which shows the detailed process of the URL classification | category part in one embodiment of this invention. 本発明の一実施の形態におけるＵＲＬ文字列の例である。It is an example of the URL character string in one embodiment of this invention. 本発明の一実施の形態におけるＵＲＬリストの具体例である。It is a specific example of a URL list in an embodiment of the present invention. 本発明の一実施の形態におけるＵＲＬ特徴ベクトルの具体例である。It is a specific example of the URL feature vector in one embodiment of the present invention. 本発明の一実施の形態におけるＵＲＬ特徴ベクトル間距離の具体例である。It is a specific example of the distance between URL feature vectors in an embodiment of the present invention. 本発明の一実施の形態におけるクラスタリング結果の例である。It is an example of the clustering result in one embodiment of this invention.

Explanation of symbols

１アクセスログ
２ＵＲＬ抽出手段、アクセスログ入力部
３ＵＲＬ記憶手段、ＵＲＬ記憶部
４ＵＲＬ分類部
５特徴ベクトル記憶手段
６類似度記憶手段
７ＵＲＬ分類記憶手段
４１文字列分割手段、文字列分割部
４２特徴ベクトル算出手段、特徴ベクトル算出部
４３類似度算出手段、類似度算出部
４４クラスタリング手段、クラスタリング部 DESCRIPTION OF SYMBOLS 1 Access log 2 URL extraction means, Access log input part 3 URL memory | storage means, URL memory | storage part 4 URL classification | category part 5 Feature vector memory | storage means 6 Similarity storage means 7 URL classification | category storage means 41 Character string division | segmentation means, Character string division | segmentation part 42 Feature vector calculation means, feature vector calculation section 43 similarity calculation means, similarity calculation section 44 clustering means, clustering section

Claims

A method for extracting the same Web page or the same type of Web page as a group from an access log with different URLs,
A URL extraction step in which the access log input means extracts a URL from the inputted access log and stores it in the URL storage means;
A character string dividing unit that reads the URL from the URL storage unit, regards the URL as a character string, and divides the URL as a partial character string for each part; and
A feature vector calculating means for generating a feature vector based on the appearing partial character string and storing it in the feature vector storage means;
A similarity calculation unit reads out a feature vector from the feature vector storage unit, obtains a similarity between the feature vectors, and stores the similarity in the similarity storage unit;
A clustering unit that reads out the similarity between feature vectors from the similarity storage unit, performs clustering, extracts a URL included in the generated cluster as a Web page group, and outputs the URL to the URL classification storage unit;
A Web page group extraction method characterized by:

In the string splitting step,
Divided into the host part, domain part, directory part, query part as a partial character string of the URL,
In the feature vector calculation step,
The Web page group extraction method according to claim 1, wherein the feature vector is obtained based on the appearance frequency of the partial character string.

A Web page group extraction device that extracts the same Web page or the same type of Web page as a group from an access log with different URLs,
URL extraction means for extracting a URL from the input access log and storing it in the URL storage means;
Character string dividing means for reading the URL from the URL storage means, regarding the URL as a character string, and dividing the URL as a partial character string for each part;
A feature vector calculating means for generating a feature vector based on the appearing partial character string and storing it in a feature vector storage means;
A similarity calculation unit that reads out a feature vector from the feature vector storage unit, obtains a similarity between the feature vectors, and stores the similarity in the similarity storage unit;
Clustering means for reading out similarity between feature vectors from the similarity storage means, performing clustering, extracting URLs included in the generated cluster as Web page groups, and outputting them to URL classification storage means;
A Web page group extracting apparatus characterized by comprising:

The character string dividing means is
Means for dividing the URL partial character string into a host part, a domain part, a directory part, and a query part;
The feature vector calculation means includes:
The Web page group extraction device according to claim 3, further comprising means for obtaining the feature vector based on the appearance frequency of the partial character string.

A Web page group extraction program for causing a computer to function as each means constituting the Web page group extraction device according to claim 3.