JP2015053087A

JP2015053087A - Grouping device and element extraction device

Info

Publication number: JP2015053087A
Application number: JP2014254357A
Authority: JP
Inventors: 田中　成典; Shigenori Tanaka; 成典田中; 中村　健二; Kenji Nakamura; 健二中村; 智史安彦; Satoshi Abiko; 雄平山本; Yuhei Yamamoto; 浩平川野; Kohei Kawano; 佑樹福島; Yuki Fukushima; 義典塚田; Yoshinori Tsukada
Original assignee: Kansai Informatics Institute Co Ltd
Current assignee: Kansai Informatics Institute Co Ltd
Priority date: 2014-12-16
Filing date: 2014-12-16
Publication date: 2015-03-19
Anticipated expiration: 2031-02-16
Also published as: JP5830159B2

Abstract

PROBLEM TO BE SOLVED: To extract an element included in a page by specifying a personal area page of a specific user on a network.SOLUTION: In the case that a similarity between a first address key and a second address key is determined to be equal to or more than a threshold, a URL is associated as a specific address group (personal area specification). Further, dots are arranged on an imaged Web page, and hierarchical structures of elements including the arranged dots are integrated, or the like, to extract corresponding contents data (element extraction).

Description

この発明は、インターネット上におけるＷｅｂページを解析するための技術に関し、特に、Ｗｅｂページ群のグループ化および各Ｗｅｂページの分割処理に関するものである。 The present invention relates to a technique for analyzing a Web page on the Internet, and particularly to grouping of Web pages and division processing of each Web page.

従来から、Ｗｅｂページを解析するために様々な方法が考えられている。例えば、特許文献１には、アクセスログに含まれるＵＲＩを解析し，Ｗｅｂページをグループ化する手法が開示されている。特許文献２には、主要コンテンツを自動的に抽出する手法が開示されている。 Conventionally, various methods have been considered for analyzing Web pages. For example, Patent Document 1 discloses a method of analyzing a URI included in an access log and grouping Web pages. Patent Document 2 discloses a technique for automatically extracting main contents.

特開２０１０−１２３０００号公報JP 2010-123000 A 特開２０１０−１１７９４１号公報JP 2010-117941 A

しかしながら、特許文献１の技術は、アクセスログを対象としてＷｅｂページをグループ化するものであり，グループ化する対象は，同様のドメインや同様のＵＲＩ構造をもつＷｅｂページ群である。よって、個人領域を特定するものではない。また、特許文献１の技術は、ドメイン部，ディレクトリ部，クエリ部に分割し，特徴ベクトル間の類似度を算出する手法を提案しており，類似度の算出アルゴリズムが複雑である。 However, the technique of Patent Document 1 groups Web pages for an access log, and the group to be grouped is a group of Web pages having a similar domain and a similar URI structure. Therefore, it does not specify a personal area. Further, the technique of Patent Document 1 proposes a method of dividing a domain part, a directory part, and a query part and calculating the similarity between feature vectors, and the similarity calculation algorithm is complicated.

特許文献２の技術は、ＨＴＭＬデータに対して所定の分割規則に基づいてセグメントに分割するものであり，対象となるＨＴＭＬデータ毎にルールを予め決定する必要がある。よって，汎用性の高い記事抽出が困難であり、記事を抽出する処理量が膨大である。 The technique of Patent Document 2 is to divide HTML data into segments based on a predetermined division rule, and it is necessary to determine a rule in advance for each target HTML data. Therefore, it is difficult to extract articles with high versatility, and the amount of processing for extracting articles is enormous.

この発明は、（i）個人領域の特定により、当該特定された個人領域に対して対象者の解析を正確に行うことを目的とする。また、この発明は、（ii）分割処理により、Ｗｅｂサイトの構造にのみ着目して重要な記事部分だけを容易に抽出することを目的とする。 It is an object of the present invention to accurately analyze a subject person with respect to the specified personal area by (i) specifying the personal area. Another object of the present invention is to easily extract only important article parts by focusing only on the structure of the website by (ii) division processing.

（１）この発明のグループ化プログラムは、
Ｗｅｂページのアドレスをグループ化するためのグループ化プログラムであって、
コンピュータを、
特定のアドレスからＷｅｂページを取得するＷｅｂページ取得手段、
前記Ｗｅｂページからリンクを抽出し、抽出した各リンクのアドレスを区切り文字で分割して第１のアドレスキーを生成する第１のアドレスキー生成手段、
前記第１のアドレスキーを生成したリンクから取得されるＷｅｂページからリンクを抽出し、抽出した各リンクのアドレスを区切り文字で分割して第２のアドレスキーを生成する第２のアドレスキー生成手段、
前記第１のアドレスキーと前記第２のアドレスキーとの間で一致するキーを照合し、当該一致するキーの出現順序が同じである組み合わせ数を計数し、その結果に基づいて類似度を算出する類似度算出手段、
前記類似度がしきい値以上であると判断されたリンクのアドレスを、特定のアドレス群として関連付けるグループ化手段、
として機能させることを特徴とする。 (1) The grouping program of this invention is
A grouping program for grouping web page addresses,
Computer
Web page acquisition means for acquiring a Web page from a specific address;
First address key generating means for extracting a link from the web page and dividing the address of each extracted link with a delimiter to generate a first address key;
Second address key generation means for extracting a link from a Web page acquired from the link that generated the first address key, and generating a second address key by dividing the address of each extracted link by a delimiter ,
Matching the matching keys between the first address key and the second address key, counting the number of combinations in which the matching keys appear in the same order, and calculating the similarity based on the result Similarity calculation means,
Grouping means for associating the addresses of the links determined to have a similarity equal to or higher than a threshold as a specific address group;
It is made to function as.

これにより、特定ユーザーの個人領域をグループ化して特定することができ、当該個人領域のＷｅｂページ群に対して解析を行うことができる。 Thereby, it is possible to group and specify the personal area of the specific user, and to analyze the Web page group of the personal area.

（２）この発明のグループ化プログラムは、
前記類似度算出手段が、
前記第１のアドレスキーと前記第２のアドレスキーとの間で、第１のアドレスキーを構成するキーの１つと一致するキーが第２のアドレスキーを構成するキーの中に存在するか否かを前方から照合し、
第２のアドレスキーを構成するキーの中に対応するキーが存在する場合には、第１および第２のアドレスキーにおいて一致するとして検出されたキーの次のキーから、後方に向けて一致するキーの組み合わせ数を計数し、
第１のアドレスキーを構成するキーの総数に対する前記第２のアドレスキーとの間で対応付けられたキーの組み合わせ数の割合を類似度として算出する、
ことを特徴とする。 (2) The grouping program of this invention is
The similarity calculation means includes:
Whether a key that matches one of the keys constituting the first address key exists in the keys constituting the second address key between the first address key and the second address key Or from the front,
If there is a corresponding key among the keys constituting the second address key, the keys match backward from the key next to the key detected as matching in the first and second address keys. Count the number of key combinations,
Calculating the ratio of the number of key combinations associated with the second address key to the total number of keys constituting the first address key as the similarity,
It is characterized by that.

これにより、第１のアドレスキーから第２のアドレスキーを照合して算出した類似度に基づいて、特定ユーザーの個人領域をグループ化して特定することができ、当該個人領域のＷｅｂページ群に対して解析を行うことができる。 Thereby, based on the similarity calculated by collating the second address key from the first address key, the personal area of the specific user can be grouped and specified, and the Web page group of the personal area can be specified. Analysis.

（３）この発明のグループ化プログラムは、
前記特定のアドレスから所定数のリンク階層数まで、類似度がしきい値以上のリンクのアドレスを、特定のアドレス群として関連付けるグループ化探索手段であって、
前記グループ化手段により特定のアドレス群として関連付けられた前記アドレスから取得したＷｅｂページからリンクを抽出し、抽出した各リンクのアドレスを区切り文字で分割して第３のアドレスキーを生成し、前記第１のアドレスキーと前記第３のアドレスキーとの間で、一致するアドレスキーの数を計数し、その結果に基づいて類似度を算出するグループ化探索手段、
を備えたこと、を特徴とする。 (3) The grouping program of this invention is
Grouping search means for associating, as a specific address group, addresses of links whose similarity is equal to or greater than a threshold value from the specific address to a predetermined number of link hierarchies,
A link is extracted from the Web page acquired from the address associated as a specific address group by the grouping means, and a third address key is generated by dividing the address of each extracted link by a delimiter character, Grouping search means for counting the number of matching address keys between one address key and the third address key, and calculating the similarity based on the result,
It is characterized by comprising.

これにより、所定のリンク階層数まで、特定ユーザーの個人領域をグループ化して特定することができる。 Thereby, it is possible to group and specify the personal areas of specific users up to a predetermined number of link hierarchies.

（４）この発明のグループ化プログラムは、
前記第１のアドレスキー生成手段または前記第２のアドレスキー生成手段が、少なくとも第１のアドレスキーまたは第２のアドレスキーを生成する前に、前記特定のアドレスにおいてユーザーを特定する識別子だけを置き換えたアドレスであると判断されたアドレスを削除する、
ことを特徴とする。 (4) The grouping program of this invention is
The first address key generating means or the second address key generating means replaces only an identifier that identifies a user at the specific address before generating at least the first address key or the second address key. Delete addresses that are determined to be
It is characterized by that.

これにより、異なるユーザーのＵＲＬを容易に削除することができる。 Thereby, URLs of different users can be easily deleted.

（５）この発明のグループ化プログラムは、
前記第１のアドレスキー生成手段または前記第２のアドレスキー生成手段が、少なくとも第１のアドレスキーまたは第２のアドレスキーを生成する前に、削除対象として登録されたアドレスに一致すると判定されたアドレスを削除する、
ことを特徴とするグループ化プログラム。 (5) The grouping program of this invention is
It is determined that the first address key generation means or the second address key generation means matches the address registered as a deletion target before at least generating the first address key or the second address key. Delete address,
A grouping program characterized by that.

これにより、予め登録された広告サイトなどのＵＲＬを容易に削除することができる。 As a result, URLs such as advertisement sites registered in advance can be easily deleted.

（６）この発明のグループ化プログラムは、
前記特定のアドレスを、サーバにアクセスすることにより所定時間毎に自動的に蓄積するアドレス蓄積手段、
を備えた、ことを特徴とする。 (6) The grouping program of the present invention is:
Address storage means for automatically storing the specific address every predetermined time by accessing a server;
It is characterized by having.

これにより、個人領域をグループ化して特定する特定ユーザーを自動的に蓄積することができる。 As a result, it is possible to automatically store specific users who specify personal areas by grouping them.

（９）この発明のエレメント抽出プログラムは、
Ｗｅｂページから所定のエレメントを抽出するためのエレメント抽出プログラムであって、
コンピュータを、
Ｗｅｂページを表示領域に展開するＷｅｂページ展開手段、
各エレメントの表示範囲を特定する座標を取得する座標取得手段、
前記表示領域上に、エレメントの配置方向に複数の点を配置し、当該配置した点を表示範囲に含むエレメントを選択するエレメント選択手段、
選択した前記エレメントの階層構造を順に配列するエレメント配列手段、
前記エレメントの各階層構造をタグ単位で分割して階層キーを生成する階層キー生成手段、
隣接するエレメントの間で、一致する階層キーの数を計数し、その結果に基づいて類似度を算出する類似度算出手段、
前記類似度に基づいて、２以上の隣接するエレメントの階層構造を特定し、当該階層構造に対応する内容データをＷｅｂページから取得する内容データ取得手段、
として機能させることを特徴とする。 (9) The element extraction program of the present invention is
An element extraction program for extracting a predetermined element from a web page,
Computer
Web page expansion means for expanding the Web page in the display area;
Coordinate acquisition means for acquiring coordinates for specifying the display range of each element;
An element selection means for arranging a plurality of points in the arrangement direction of the elements on the display area and selecting an element including the arranged points in the display range;
Element arrangement means for arranging the hierarchical structure of the selected elements in order;
Hierarchy key generation means for generating a hierarchy key by dividing each hierarchical structure of the element in units of tags,
Similarity calculation means for counting the number of matching hierarchical keys between adjacent elements and calculating the similarity based on the result.
Content data acquisition means for specifying a hierarchical structure of two or more adjacent elements based on the similarity and acquiring content data corresponding to the hierarchical structure from a Web page;
It is made to function as.

これにより、Ｗｅｂページに含まれる記事などの単位で分割して解析することが可能となる。例えば、記事単位でも検索システムに用いたり、有害であると判断されたＷｅｂページの一部だけを非表示としたり、特定商品に関する書き込みだけを抽出して商品の評価を収集するといった処理に利用することができる。 As a result, it is possible to divide and analyze in units such as articles included in the Web page. For example, it can be used in a search system even in article units, or only a part of a Web page determined to be harmful is hidden, or it is used for processing such as extracting only writing related to a specific product and collecting product evaluations. be able to.

（１０）この発明のエレメント抽出プログラムは、
コンピュータを、さらに、
隣接するエレメントの前記階層構造の類似度の平均を上位方向に算出して類似度ピラミッドを生成し、前記類似度がしきい値以上であるか否かを検出し、検出された類似度の底辺に含まれるエレメントの階層構造のうち、所定のルールに合致する階層構造を特定し、当該階層構造に対応する内容データをＷｅｂページから取得する内容データ取得手段、
として機能させることを特徴とする。 (10) The element extraction program of the present invention is
Computer, and
An average of the similarities of the hierarchical structures of adjacent elements is calculated in the upper direction to generate a similarity pyramid, whether or not the similarity is equal to or greater than a threshold value, and the base of the detected similarity Content data acquisition means for specifying a hierarchical structure that matches a predetermined rule among the hierarchical structures of elements included in the URL, and acquiring content data corresponding to the hierarchical structure from a Web page;
It is made to function as.

これにより、内容データの抽出対象となる記事の階層構造を容易に特定することができる。 As a result, the hierarchical structure of articles from which content data is to be extracted can be easily specified.

（１１）この発明のエレメント抽出プログラムは、
前記内容データ取得手段が、
類似度ピラミッドの底辺に含まれるエレメントのうち、隣接するエレメントが包含関係にあるかを判断し、
隣接するエレメントが包含関係にない場合には、各階層構造について対応する内容データを取得する、
ことを特徴とする。 (11) The element extraction program of the present invention is
The content data acquisition means is
Of the elements included in the bottom of the similarity pyramid, determine whether adjacent elements are inclusive,
If adjacent elements are not in an inclusive relationship, get the corresponding content data for each hierarchical structure,
It is characterized by that.

これにより、内容データの抽出対象となる記事の階層構造を正確に特定することができる。 Thereby, the hierarchical structure of the article from which the content data is extracted can be specified accurately.

（１２）この発明のエレメント抽出プログラムは、
前記エレメント抽出手段が、
類似度ピラミッドの底辺に含まれるエレメントのうち、隣接するエレメントの階層構造が包含関係にあるかを判断し、
隣接するエレメントが包含関係にある場合には、テキスト差分がしきい値以下の場合に、包含される下位の階層構造を削除して、包含する上位の階層構造に対応する内容データを取得し、
隣接するエレメントが包含関係にある場合には、テキスト差分がしきい値を超える場合に、包含する上位の階層構造を削除して、包含される下位の階層構造に対応する内容データを取得する、
ことを特徴とする。 (12) The element extraction program of the present invention is
The element extraction means;
Of the elements included in the bottom of the similarity pyramid, determine whether the hierarchical structure of adjacent elements is inclusive,
When adjacent elements are in an inclusive relationship, if the text difference is less than or equal to the threshold value, the included lower hierarchical structure is deleted, and content data corresponding to the upper hierarchical structure included is acquired,
When adjacent elements are in an inclusion relationship, if the text difference exceeds a threshold value, the upper hierarchical structure to be included is deleted, and content data corresponding to the lower hierarchical structure to be included is acquired.
It is characterized by that.

これにより、内容データの抽出対象となる記事の階層構造をより正確に特定することができる。 Thereby, it is possible to more accurately specify the hierarchical structure of articles from which content data is extracted.

（１３）この発明のエレメント抽出プログラムは、
前記エレメント選択手段が、前記表示領域上において、所定方向に等間隔で複数の点を配置し、配置した点を表示範囲に含むエレメントを選択する、
ことを特徴とする。 (13) The element extraction program of the present invention
The element selection means arranges a plurality of points at equal intervals in a predetermined direction on the display area, and selects an element including the arranged points in a display range.
It is characterized by that.

これにより、内容データの抽出対象となる階層構造を特定するために必要な階層キーを取得することができる。 As a result, it is possible to acquire a hierarchical key necessary for specifying a hierarchical structure from which content data is to be extracted.

（１４）この発明のエレメント抽出プログラムは、
前記エレメント選択手段が、前記表示領域上において、前記所定方向に垂直の直線上に複数の点を配置し、同一直線上に配置した点を最も多く表示範囲に含むエレメントを選択する、
ことを特徴とする。 (14) The element extraction program of the present invention is
The element selecting means arranges a plurality of points on a straight line perpendicular to the predetermined direction on the display area, and selects an element that includes the most points arranged on the same straight line in the display range.
It is characterized by that.

（１５）この発明のエレメント抽出プログラムは、
前記エレメント抽出手段が、エレメントに含まれるＡタグのＵＲＬまたは自然言語でマッチングして得た属性を、エレメントに関連付けて記憶した、
ことを特徴とする。 (15) The element extraction program of the present invention
The element extraction means stores the attribute obtained by matching the URL of the A tag included in the element or the natural language in association with the element,
It is characterized by that.

これにより、Ｗｅｂページから抽出した内容データの属性を区別して記憶することができる。 Thereby, the attribute of the content data extracted from the web page can be distinguished and stored.

この発明において、「プログラム」とは、ＣＰＵにより直接実行可能なプログラムだけでなく、ソース形式のプログラム、圧縮処理がされたプログラム、暗号化されたプログラム等を含む概念である。 In the present invention, the “program” is a concept including not only a program that can be directly executed by the CPU but also a source format program, a compressed program, an encrypted program, and the like.

「ネットサイト」とは、例えば、前略プロフィール、モバゲータウン、GREE（いずれも商標）などのホームページサービスその他の個人毎にページを持つことが可能なサイトを含む概念である。 The “net site” is a concept including a homepage service such as an abbreviation profile, Mobage Town, GREE (both are trademarks) and other sites that can have a page for each individual.

本発明のグループ化装置１００のブロック図である。It is a block diagram of the grouping apparatus 100 of this invention. 第１のアドレスキーおよび第２のアドレスキーとの間で照合する手順を示す図である。It is a figure which shows the procedure collated between the 1st address key and the 2nd address key. グループ化装置１００のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a grouping apparatus 100. FIG. グループ化プログラム３２（図３）による処理を示すフローチャートである。It is a flowchart which shows the process by the grouping program 32 (FIG. 3). 取得されたユーザーＡのトップページから抽出されたＵＲＬのデータ例を示す図である。It is a figure which shows the example of data of URL extracted from the top page of the acquired user A. FIG. 広告ＵＲＬリスト３６に登録されたＵＲＬと一致するリンクを削除した状態を示す図である。It is a figure which shows the state which deleted the link which corresponds to URL registered into the advertisement URL list | wrist 36. FIG. ユーザーフィルタルールと一致するリンクを削除した状態を示す図である。It is a figure which shows the state which deleted the link which corresponds to a user filter rule. 第１のアドレスキーを生成した状態を示す図である。It is a figure which shows the state which produced | generated the 1st address key. 図８に示すリンクα３のＷｅｂページから抽出されたリンクのＵＲＬのデータ例を示す図である。It is a figure which shows the data example of URL of the link extracted from the web page of link (alpha) 3 shown in FIG. 広告ＵＲＬリスト３６に登録されたＵＲＬと一致するリンクを削除した状態を示す図である。It is a figure which shows the state which deleted the link which corresponds to URL registered into the advertisement URL list | wrist 36. FIG. ユーザーフィルタルールと一致するリンクを削除した状態を示す図である。It is a figure which shows the state which deleted the link which corresponds to a user filter rule. 第２のアドレスキーを生成した状態を示す図である。It is a figure which shows the state which produced | generated the 2nd address key. キーの照合方法を示す図である。It is a figure which shows the collation method of a key. 類似度の算出方法および類似度を算出した結果を示す図である。It is a figure which shows the calculation method of the similarity, and the result of having calculated the similarity. 特定のアドレス群として関連付けて記憶されたデータ例を示す図である。It is a figure which shows the example of data memorize | stored linked | related as a specific address group. 本発明のエレメント抽出装置２００のブロック図である。It is a block diagram of the element extraction apparatus 200 of this invention. エレメント抽出装置２００のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of an element extraction device 200. FIG. エレメント抽出プログラム３８（図１６）による処理を示すフローチャートである。It is a flowchart which shows the process by the element extraction program 38 (FIG. 16). イメージデータに変換される前の掲示板のＨＴＭＬデータ（ソースコード）（図１８Ａ）およびこれをイメージデータに変換して画面上に表示した表示例（図１８Ｂ）である。It is the HTML data (source code) (FIG. 18A) of the bulletin board before being converted into image data, and a display example (FIG. 18B) in which this is converted into image data and displayed on the screen. エレメントの座標を算出する方法を示す図である。It is a figure which shows the method of calculating the coordinate of an element. Ｗｅｂページ上に点を配置した状態（図２０Ａ）および配置した点を含むエレメントの階層構造を配列した状態（図２０Ｂ）を示す図である。It is a figure which shows the state (FIG. 20B) which has arrange | positioned the state (FIG. 20A) which has arrange | positioned the point on a Web page, and the hierarchical structure of the element containing the arranged point. 各階層構造から階層キーを生成した状態を示す図である。It is a figure which shows the state which produced | generated the hierarchy key from each hierarchy structure. エレメント間の階層構造の類似度を算出する方法および類似度を算出した結果を示す図である。It is a figure which shows the method of calculating the similarity of the hierarchical structure between elements, and the result of calculating the similarity. 類似値ピラミッドを生成した状態を示す図である。It is a figure which shows the state which produced | generated the similar value pyramid. しきい値以上の類似度を頂点とするピラミッドを検出した状態を示す図である。It is a figure which shows the state which detected the pyramid which makes a vertex the similarity degree more than a threshold value. 階層構造を特定する処理（図１７のステップＳ２２０）の詳細を示す図である。It is a figure which shows the detail of the process (step S220 of FIG. 17) which specifies a hierarchical structure. 包含関係にあるエレメントの階層構造を統合する方法（テキスト差分がしきい値以下のとき）を示す図である。It is a figure which shows the method (when a text difference is below a threshold value) which integrates | stacks the hierarchical structure of the element in an inclusive relationship. 包含関係にあるエレメントの階層構造を統合する方法（テキスト差分がしきい値を超えるとき）を示す図である。It is a figure which shows the method (when a text difference exceeds a threshold value) which integrates | stacks the hierarchical structure of the element in an inclusive relationship. 最終的に抽出されたエレメントのデータ例およびその属性を示す図である。It is a figure which shows the example of data of the element finally extracted, and its attribute. その他の実施形態を示す図である。It is a figure which shows other embodiment. 本発明の応用例を示す図である。It is a figure which shows the example of application of this invention.

１．本発明の意義
近年、モバイルインターネットにおけるネットサイトには様々な問題が指摘されており、ネットパトロールシステムを自動化し、ネットサイトにおける子供たちの行動を継続的に監視し、また問題がある子供を的確に指導することが課題となっている。 1. Significance of the present invention In recent years, various problems have been pointed out in the mobile Internet network site. The network patrol system is automated, the children's behavior on the Internet site is continuously monitored, and children with problems are accurately identified. It has become an issue to teach.

しかし、ネットパトロールを自動化するためには、ネット上の膨大なＷｅｂページの中から、各ユーザーの個人領域を特定する必要がある。さらに、１つのＷｅｂページに複数人の書き込みが含まれる場合に、より正確に対象者の解析を行うためには、Ｗｅｂページ単位ではなく、記事単位でＷｅｂページから内容を抽出する必要がある。 However, in order to automate the net patrol, it is necessary to specify the personal area of each user from a huge amount of Web pages on the net. Further, when a single Web page includes writings of a plurality of people, in order to analyze the target person more accurately, it is necessary to extract contents from the Web page in units of articles, not in units of Web pages.

本発明は、ネットパトロール支援システムの自動化に寄与するものであり、本発明のグループ化装置１００（図１）は、膨大なＷｅｂページの中から、ネット上のユーザーの個人領域を特定するために有効な手段である。ネット上の個人領域の特定するために、例えば、各ユーザーのトップページからリンクされたＵＲＬ（アドレス）が類似するか否かという観点から、同一ユーザーのページをリンクに沿って探索することで、各ユーザーの個人領域を容易に特定することができる。 The present invention contributes to the automation of the net patrol support system, and the grouping apparatus 100 (FIG. 1) of the present invention is used to specify the personal area of the user on the net from a huge number of Web pages. It is an effective means. In order to identify the personal area on the net, for example, by searching the same user's page along the link from the viewpoint of whether the URL (address) linked from the top page of each user is similar, The personal area of each user can be easily identified.

また、本発明のエレメント抽出プログラム（図１５）は、掲示板のように１つのＷｅｂページに複数の内容が繰り返しで出現するような場合に、書き込まれた内容を抽出するために有効な手段である。すなわち、１ページ内に異なるユーザーの書き込みなど様々なデータが混在する場合であっても、重要であると推測される部分だけを容易に抽出することができる。 The element extraction program (FIG. 15) of the present invention is an effective means for extracting written contents when a plurality of contents appear repeatedly on one Web page, such as a bulletin board. . In other words, even when various data such as writings by different users are mixed in one page, only a portion that is assumed to be important can be easily extracted.

１−１．グループ化装置１００の構造
まず、図１などを用いて、本発明のグループ化装置１００について説明する。図１は本発明のグループ化装置１００のブロック図である。 1-1. Structure of Grouping Device 100 First, the grouping device 100 of the present invention will be described with reference to FIG. FIG. 1 is a block diagram of a grouping apparatus 100 of the present invention.

図１に示すように、本発明のグループ化装置１００は、Ｗｅｂページ取得手段２と、第１のアドレスキー生成手段４と、第２のアドレスキー生成手段６と、類似度算出手段８と、アドレスグループ化手段１０と、グループ化探索手段１２と、を備えている。 As shown in FIG. 1, the grouping apparatus 100 of the present invention includes a web page acquisition unit 2, a first address key generation unit 4, a second address key generation unit 6, a similarity calculation unit 8, Address grouping means 10 and grouping search means 12 are provided.

Ｗｅｂページ取得手段２（図１）は、特定のＵＲＬからＷｅｂページを取得する。例えば、ネットサイトに登録されているユーザーＡのトップページを取得する。 Web page acquisition means 2 (FIG. 1) acquires a Web page from a specific URL. For example, the top page of the user A registered on the net site is acquired.

第１のアドレスキー生成手段４（図１）は、Ｗｅｂページ取得手段２が取得したＷｅｂページ（トップページ）からリンクだけを抽出し、抽出した各リンクのＵＲＬを区切り文字で分割して第１のアドレスキーを生成する。 The first address key generation means 4 (FIG. 1) extracts only the links from the Web page (top page) acquired by the Web page acquisition means 2, and divides the URL of each extracted link with a delimiter character as the first. Generate an address key for.

さらに、第２のアドレスキー生成手段６（図１）は、第１のアドレスキーを生成したリンクから取得されるＷｅｂページからリンクを抽出し、抽出した各リンクのＵＲＬを区切り文字で分割して第２のアドレスキーを生成する。このようにして所定のリンク階層までアドレスキーを生成する。 Further, the second address key generation means 6 (FIG. 1) extracts the link from the Web page acquired from the link that generated the first address key, and divides the URL of each extracted link with a delimiter. A second address key is generated. In this way, an address key is generated up to a predetermined link hierarchy.

類似度算出手段８（図１）は、第１のアドレスキー生成手段４が生成した第１のアドレスキーと、第２のアドレスキー生成手段６が生成した第２のアドレスキーとの間で一致するキーを照合し、当該一致するキーの出現順序が同じである組み合わせ数を計数し、その結果に基づいて類似度を算出する。 The similarity calculation means 8 (FIG. 1) matches between the first address key generated by the first address key generation means 4 and the second address key generated by the second address key generation means 6. Keys to be matched, the number of combinations having the same appearance order of the corresponding keys is counted, and the similarity is calculated based on the result.

図２を用いて、第１のアドレスキーおよび第２のアドレスキーの間で照合する手順について説明する。 The procedure for collating between the first address key and the second address key will be described with reference to FIG.

照合処理は、図２に示すように、ユーザーＡのトップページ（http:top・）から抽出される１つのＵＲＬ「リンクＵＲＬ０１」から生成される第１のアドレスキーと、当該ＵＲＬ「リンクＵＲＬ０１」のＷｅｂページ（http://link1・）から抽出される各リンク（例えば、「リンクＵＲＬ１１」）から生成される第２のアドレスキーと、の間で行われる。なお、当該照合に基づく類似度の算出方法については、後述する。 As shown in FIG. 2, the collation process includes a first address key generated from one URL “link URL01” extracted from the top page (http: top ·) of user A and the URL “link URL01”. And a second address key generated from each link (for example, “link URL 11”) extracted from the Web page (http: // link1 ·). A method for calculating the similarity based on the collation will be described later.

グループ化手段１０（図１）は、第１のアドレスキーと第２のアドレスキーの類似度がしきい値以上と判断されたリンクのＵＲＬ（例えば、「リンクＵＲＬ０１」および「リンクＵＲＬ１１」）を、特定のアドレス群として関連付ける。 The grouping means 10 (FIG. 1) selects the URLs of links (for example, “link URL01” and “link URL11”) for which the similarity between the first address key and the second address key is determined to be equal to or greater than a threshold value. , As a specific address group.

グループ化探索手段１２（図１）は、さらに、特定のＵＲＬ（例えば、トップページ）から所定のリンク階層数まで、類似度がしきい値以上のリンクのＵＲＬを、特定のアドレス群として関連付けるための手段である。 The grouping search means 12 (FIG. 1) further associates URLs of links whose similarity is equal to or greater than a threshold value from a specific URL (for example, top page) to a predetermined number of link hierarchies as a specific address group. It is means of.

そのために、図２に示すように、グループ化手段１０により特定のアドレス群として関連付けられたＵＲＬ（例えば、「リンクＵＲＬ１１」）からＷｅｂページ（トップページから２リンク先の階層数http://link2・）を取得して当該Ｗｅｂページからリンクを抽出し、抽出した各リンク（例えば、「リンクＵＲＬ２１」）のＵＲＬを区切り文字で分割して第３のアドレスキーを生成する。さらに、「リンクＵＲＬ０１」から生成した第１のアドレスキーと、「リンクＵＲＬ２１」から生成した第３のアドレスキーとの間で、一致するアドレスキーの数を計数し、その結果に基づいて類似度を算出する。ｎリンク先の階層についても、同様に、「リンクＵＲＬ０１」から生成した第１のアドレスキーと、ｎリンク先のＷｅｂページから抽出されるリンクから生成した第ｎのアドレスキーとの間で、一致するアドレスキーの数を計数して、その結果に基づいて類似度を算出し、類似度がしきい値以上のリンクのＵＲＬを、特定のアドレス群として関連付ける。 For this purpose, as shown in FIG. 2, the URL associated with the grouping means 10 as a specific address group (for example, “link URL 11”) to the Web page (the number of hierarchies 2 links ahead from the top page http: // link2 To obtain a third address key by dividing the URL of each link (for example, “link URL 21”) by a delimiter character. Further, the number of matching address keys is counted between the first address key generated from the “link URL 01” and the third address key generated from the “link URL 21”, and the similarity is calculated based on the result. Is calculated. Similarly for the n link destination hierarchy, the first address key generated from the “link URL01” and the nth address key generated from the link extracted from the n link destination Web page are identical. The number of address keys to be counted is counted, a similarity is calculated based on the result, and URLs of links whose similarity is equal to or greater than a threshold are associated as a specific address group.

以上のように、特定のユーザーに関するＵＲＬがグループ化されることで、特定ユーザーの個人領域を正確に解析することができる。 As described above, the URL related to a specific user is grouped, so that the personal area of the specific user can be accurately analyzed.

１−２．グループ化装置１００のハードウェア構成
図３に、グループ化装置１００のハードウェア構成を示す。グループ化処理装置１００は、図２に示すＣＰＵ２０、ＲＡＭ２２、ディスプレイ２４、ハードディスク２６、キーボード／マウス２８、記録媒体ドライブ３０を備えたコンピュータで構成される。 1-2. Hardware Configuration of Grouping Device 100 FIG. 3 shows a hardware configuration of the grouping device 100. The grouping processing apparatus 100 is composed of a computer including the CPU 20, RAM 22, display 24, hard disk 26, keyboard / mouse 28, and recording medium drive 30 shown in FIG.

図３のハードディスク２６には、特定のユーザーの個人領域、すなわち、特定のユーザーに属するＷｅｂページ群を特定してグループ化するためのグループ化プログラム３２が記録されている。グループ化プログラム３２は、ＣＰＵ２０、ＲＡＭ２２、ディスプレイ２４、ハードディスク２６、キーボード／マウス２８、記録媒体ドライブ３０を備えたコンピュータを、図１に示すＷｅｂページ取得手段２、第１のアドレスキー生成手段４、第２のアドレスキー生成手段６、類似度算出手段８、アドレスグループ化手段１０、グループ化探索手段１２として機能させるプログラムである。 The hard disk 26 of FIG. 3 records a grouping program 32 for specifying and grouping a personal area of a specific user, that is, a group of Web pages belonging to the specific user. The grouping program 32 is a computer that includes the CPU 20, RAM 22, display 24, hard disk 26, keyboard / mouse 28, and recording medium drive 30, and the Web page acquisition means 2, first address key generation means 4, and so on shown in FIG. This is a program that functions as the second address key generation means 6, similarity calculation means 8, address grouping means 10, and grouping search means 12.

また、図３のハードディスク２６には、ネットサイトのユーザーリスト３４が記憶されている。例えば、ユーザーリスト３４には、ネットサイトから所定時間毎に予め収集されたユーザーのトップページに対応するＵＲＬが蓄積されている。 In addition, a user list 34 of the net site is stored in the hard disk 26 of FIG. For example, in the user list 34, URLs corresponding to the user's top page collected in advance from the network site every predetermined time are stored.

また、図３のハードディスク２６には、広告のＵＲＬを削除するために用いられる広告ＵＲＬリスト３６が記憶されている。例えば、広告ＵＲＬリスト３４には、予めネット上から収集された広告のＵＲＬが蓄積されている。 Further, the hard disk 26 of FIG. 3 stores an advertisement URL list 36 used for deleting advertisement URLs. For example, the advertisement URL list 34 stores URLs of advertisements collected in advance from the Internet.

１−３．グループ化プログラム３２による処理
図４は、グループ化プログラム３２（図３）による処理を示すフローチャートである。 1-3. Processing by Grouping Program 32 FIG. 4 is a flowchart showing processing by the grouping program 32 (FIG. 3).

グループ化プログラム３２が起動されると、ＣＰＵ２０は、ユーザーリスト３４を参照し、特定ユーザーのＵＲＬからＷｅｂページを取得する（ステップＳ１０２）。例えば、ユーザーＡのトップページが取得される。 When the grouping program 32 is activated, the CPU 20 refers to the user list 34 and acquires a Web page from the URL of a specific user (step S102). For example, the top page of user A is acquired.

さらに、ＣＰＵ２０は、取得したＷｅｂページからリンクを抽出する（ステップＳ１０４）。具体的には、ＨＴＭＬデータに含まれるハイパーリンク（いわゆる、Ａタグ）を検索することで、図５に示すようなリンクのＵＲＬが抽出される。 Further, the CPU 20 extracts a link from the acquired web page (step S104). Specifically, the URL of the link as shown in FIG. 5 is extracted by searching for a hyperlink (so-called A tag) included in the HTML data.

つぎに、不要なリンクを削除するために、広告フィルタ処理が行われる（ステップＳ１０６）。すなわち、ＣＰＵ２０は、広告ＵＲＬリスト３６を参照し、ステップＳ１０４で抽出したリンクのうち、広告サイトのＷｅｂページとＵＲＬが一致するリンクを削除する。これにより、ユーザーＡとは明らかに無関係な広告サイトのＷｅｂページを削除することができる。例えば、図６に示すように、広告ＵＲＬリスト３６に登録されたＵＲＬと一致するリンクα１が全て削除される。なお、広告ＵＲＬリストに登録されていないものが上記広告フィルタ処理を通過したとしても、通常は、後述する類似度に基づく削除処理（図４のステップＳ１２２）において削除される。 Next, an advertisement filter process is performed to delete unnecessary links (step S106). That is, the CPU 20 refers to the advertisement URL list 36 and deletes the link whose URL matches the web page of the advertisement site from the links extracted in step S104. Thereby, it is possible to delete the Web page of the advertising site that is clearly unrelated to the user A. For example, as shown in FIG. 6, all the links α1 that match the URLs registered in the advertisement URL list 36 are deleted. Note that even if an item not registered in the advertisement URL list passes the advertisement filtering process, it is usually deleted in a deletion process (step S122 in FIG. 4) based on similarity described later.

さらに、不要なリンクを削除するために、ユーザーフィルタ処理が行われる（ステップＳ１０８）。ユーザーフィルタ処理とは、特定のＵＲＬについて、ユーザーを特定する識別子だけを置き換えたＵＲＬのリンクを削除するというものである。例えば、ＵＲＬ「http://pr.cccboy.com/16378304」のうちユーザーＩＤを示す「16378304」の部分だけが異なる場合は、別のユーザーのＵＲＬ（トップページ）であるため、当該ＵＲＬのリンクを削除することとした。 Further, user filter processing is performed to delete unnecessary links (step S108). The user filter process is to delete a URL link in which only an identifier for specifying a user is replaced for a specific URL. For example, when only “16378304” indicating the user ID in the URL “http://pr.cccboy.com/16378304” is different, it is another user's URL (top page), and the link of the URL It was decided to delete.

具体的な処理としては、「http://pr.cccboy.com/*」のように、トップページのＵＲＬのうち、ユーザーＩＤの部分のみをワイルドカード（正規表現）として検索し、その結果、検出された図７に示すリンクα２（ＵＲＬ「http://pr.cccboy.com/12345678」）を削除した。 Specifically, as in “http://pr.cccboy.com/*”, only the user ID portion of the top page URL is searched as a wild card (regular expression), and as a result, The detected link α2 (URL “http://pr.cccboy.com/12345678”) shown in FIG. 7 was deleted.

上記ステップＳ１０６およびステップＳ１０８において不要なリンクを削除した後、ＣＰＵ２０は、残りのリンクについて、ＵＲＬを区切り文字で分割して第１のアドレスキーを生成する（ステップＳ１１０）。 After deleting unnecessary links in Step S106 and Step S108, the CPU 20 generates a first address key by dividing the URL with a delimiter for the remaining links (Step S110).

具体的には、各リンクのＵＲＬを、プロトコル名を除いて、区切り文字（英数字以外のスラッシュ「/」、ピリオド「.」など）でテキストを分割することで、各キーが生成される。例えば、図８に示すリンクα３のＵＲＬ「http://bbs.cccboy.com/Guestbook/BBS/16378304/」から、第１のアドレスキーとして、６つのキー「bbs」、「cccboy」、「com」、「Guestbook」、「BBS」「16378304」が生成される。ＣＰＵ２０は、これらのキーを配列順序と併せて、後述する照合処理のためにＲＡＭ２２に記憶する。さらに、第１のアドレスキーを生成する元になった図８に示す各リンクのＵＲＬを、ユーザーＡのトップページに属するアドレス群として関連付けて記憶する（ステップＳ１１１）。 Specifically, each key is generated by dividing the text of the URL of each link by a delimiter (such as a non-alphanumeric slash “/”, period “.”, Etc.) excluding the protocol name. For example, from the URL “http://bbs.cccboy.com/Guestbook/BBS/16378304/” of the link α3 shown in FIG. 8, six keys “bbs”, “cccboy”, “com” "," Guestbook "," BBS ", and" 16378304 "are generated. The CPU 20 stores these keys together with the arrangement order in the RAM 22 for collation processing to be described later. Further, the URL of each link shown in FIG. 8 that is the source for generating the first address key is stored in association with the address group belonging to the top page of user A (step S111).

次に、ＣＰＵ２０は、第１のアドレスキーを生成した各リンクのＵＲＬからＷｅｂページを取得し（ステップＳ１１２）、当該Ｗｅｂページからリンクを抽出する（ステップＳ１１４）。図８に示すリンクα３のＷｅｂページから抽出されたリンクを、図９に示す。 Next, the CPU 20 acquires a web page from the URL of each link that generated the first address key (step S112), and extracts a link from the web page (step S114). A link extracted from the Web page of the link α3 shown in FIG. 8 is shown in FIG.

さらに、ステップＳ１０６およびステップＳ１０８と同様、不要ＵＲＬの削除処理が行われる（ステップＳ１１６）。例えば、図１０に示すように、ＣＰＵ２０は、ステップＳ１１４で抽出したリンクから、広告サイトのＷｅｂページとＵＲＬが一致するリンクβ１、β２を削除する（ステップＳ１０６と同じ処理）。さらに、ＣＰＵ２０は、ユーザーフィルタ処理を行って、図１１に示すように、特定のＵＲＬにおいてユーザーを特定する識別子だけを置き換えたＵＲＬのリンクβ３〜β５を削除する（ステップＳ１０８と同じ処理）。 Furthermore, unnecessary URL deletion processing is performed in the same manner as in steps S106 and S108 (step S116). For example, as shown in FIG. 10, the CPU 20 deletes the links β1 and β2 whose URLs match the Web page of the advertisement site from the link extracted in step S114 (the same processing as step S106). Further, the CPU 20 performs user filter processing, and deletes URL links β3 to β5 obtained by replacing only the identifier for specifying the user in the specific URL as shown in FIG. 11 (the same processing as step S108).

上記ステップＳ１１６において不要なリンクを削除した後、ＣＰＵ２０は、ステップＳ１１０と同様、残りの各リンクについて、ＵＲＬを所定のテキスト単位に分割し、第２のアドレスキーを生成する（ステップＳ１１８）。 After deleting unnecessary links in step S116, the CPU 20 divides the URL into predetermined text units for the remaining links and generates a second address key, as in step S110 (step S118).

前述のように、各リンクのＵＲＬを、プロトコル名を除いて、区切り文字（英数字以外のスラッシュ「/」、ピリオド「.」など）でテキスト単位に分割することで、各キーが生成される。例えば、図１２に示すリンクβ６〜β９それぞれのＵＲＬから、各リンクについての第２のアドレスキーが生成される（図１２を参照）。 As described above, each key is generated by dividing the URL of each link into text units by using a delimiter (such as a non-alphanumeric slash “/” or period “.”), Excluding the protocol name. . For example, the second address key for each link is generated from the URLs of the links β6 to β9 shown in FIG. 12 (see FIG. 12).

なお、図１２に示す例では、「=」の後に英数字列が連続する場合の当該英数字列（リンクβ７〜β９における「0」など）はキーとして生成しないように設定している。一般に、「＝」の後には、コメントなどの記述を特定するＩＤが入り、かかるＩＤは同一人のコメントであったとしても、各コメント毎に異なる。したがって、これらをキーとして同一人の判定に用いると、邪魔になると考えられるからである。このため、結果として、リンクβ７〜β９の第２のアドレスキーは同じものとなっている。 In the example shown in FIG. 12, when the alphanumeric string continues after “=”, the alphanumeric string (such as “0” in links β7 to β9) is set not to be generated as a key. In general, after “=”, an ID for specifying a description such as a comment is entered. Even if the ID is a comment of the same person, the ID is different for each comment. Therefore, if these are used as a key for determination of the same person, it is considered to be an obstacle. Therefore, as a result, the second address keys of the links β7 to β9 are the same.

さらに、ＣＰＵ２０は、ステップＳ１１０で生成した第１のアドレスキーと、ステップＳ１１８で算出した第２ののアドレスキーとの間で、それぞれを構成するキーを照合することにより、同じ順序で、かつ、一致するキーの組み合わせ数を計数し、その結果に基づいて類似度を算出する（ステップＳ１２０）。この実施形態では、途中に異なるキーが存在したとしても、出現順序が同じであれば一致するキーであるとしている。 Furthermore, the CPU 20 collates the keys constituting each of the first address key generated in step S110 and the second address key calculated in step S118, in the same order, and The number of matching key combinations is counted, and the similarity is calculated based on the result (step S120). In this embodiment, even if there are different keys in the middle, the keys match if they appear in the same order of appearance.

キーの照合方法について、図１３ａを用いて説明する。図１３ａのパターン１に示すように、第１のアドレスキーが｛a,b,c｝の３つであり、第２のアドレスキーが｛a,d,b,e,c｝の５つであるとき、まず、第１のアドレスキーを構成するキーの１つ｛a｝と一致するキーが第２のアドレスキーを構成するキーの中に存在するか否かを前方から照合する。これにより、まず、図１３ａに示す｛a｝の組み合わせが検出される。 A key verification method will be described with reference to FIG. As shown in pattern 1 in FIG. 13a, there are three first address keys {a, b, c} and five second address keys {a, d, b, e, c}. In some cases, first, it is collated from the front whether or not a key matching one of the keys constituting the first address key {a} exists in the key constituting the second address key. Thereby, first, the combination of {a} shown in FIG. 13a is detected.

さらに、第１のアドレスキーを構成する次のキー｛b｝が第２のアドレスキーに存在するか、第２のアドレスキーのキー｛d｝から照合することにより、後方に向けて一致するキーの組み合わせ数を計数する。これにより、図１３ａに示す｛b｝｛c｝の組み合わせが検出され、一致するキーの出現順序が同じである組み合わせ数は３となる。 Further, the next key {b} constituting the first address key is present in the second address key, or a key that matches backward by collating with the key {d} of the second address key. Count the number of combinations. Thereby, the combination of {b} {c} shown in FIG. 13A is detected, and the number of combinations having the same appearance order of the matching keys is 3.

なお、第１のアドレスキーを｛a,b,c,d｝の４つとしたパターン２の場合は、キー｛c｝同士の組み合わせが検出された後は、第１のアドレスキーの４番目のキー｛d｝は、第２のアドレスキーの５番目の｛c｝から後方に向けて検出するため、図１３ａ中、点線の矢印で示す第２のアドレスキーの２番目のキー｛d｝との組み合わせはカウントされず、パターン１と同様に、組み合わせ数は３となる。 In the case of pattern 2 with four first address keys {a, b, c, d}, the fourth address of the first address key is detected after the combination of the keys {c} is detected. Since the key {d} is detected backward from the fifth {c} of the second address key, the second key {d} of the second address key indicated by a dotted arrow in FIG. The combinations are not counted, and the number of combinations is 3 as in the case of the pattern 1.

類似度は、上記組み合わせ数などに基づいて、次式から算出することができる。なお、次式における類似度Ext(wn_i, wn_ij)は、第１のアドレスキーwn_iと、第２のアドレスキーwn_ijの間の類似度を表す。 The similarity can be calculated from the following formula based on the number of combinations. Note that the similarity Ext (wn _i , wn _ij ) in the following expression represents the similarity between the first address key wn _i and the second address key wn _ij .

ここで、式の分母count(Element(wn_i))は、第１のアドレスキーを構成するキーElement(wn_i)の総数を表す。また、式の分子count(LCS(Element(wn_i), Element(wn_ij))は、第１のアドレスキーを構成するキーElement(wn_i)と、第２のアドレスキーを構成するキーElement(wn_ij)との間で、同じ順序で、かつ、一致するキーの組み合わせ数を表す。 Here, the denominator count (Element (wn _i )) in the equation represents the total number of keys Element (wn _i ) constituting the first address key. Also, the numerator count (LCS (Element (wn _i ), Element (wn _ij ))) of the formula includes a key Element (wn _i ) that constitutes a first address key and a key Element (that constitutes a second address key). wn _ij ) represents the number of matching key combinations in the same order.

上記の式から、例えば、図１３ｂに示す第１のアドレスキーwn₁と、リンクβ６の第２のアドレスキーwn₁₁との間の類似度Ext(wn₁,wn₁₁)は、Ext(wn₁,wn₁₁)= 0 / 6 = 0となる。また、図１３ｂに示す第１のアドレスキーwn₁と、リンクβ７の第２のアドレスキーwn₁₂との間の類似度Ext(wn₁,wn₁₂)は、Ext(wn₁,wn₁₂)= 5 / 6 = 0.83333....となる（Ext(wn₁,wn₁₃)、Ext(wn₁,wn₁₄)も同じ）。 From the above formula, for example, the similarity Ext (wn ₁ , wn ₁₁ ) between the first address key wn ₁ shown in FIG. 13b and the second address key wn ₁₁ of the link β6 is Ext (wn ₁ , wn ₁₁ ) = 0 0/6 = 0. Further, the similarity Ext (wn ₁ , wn ₁₂ ) between the first address key wn ₁ shown in FIG. 13b and the second address key wn ₁₂ of the link β7 is Ext (wn ₁ , wn ₁₂ ) = 5/6 = 0.83333 .... (Ext (wn ₁ , wn ₁₃ ), Ext (wn ₁ , wn ₁₄ ) are the same).

ＣＰＵ２０は、類似度がしきい値以下であると判断したリンクを除去する（ステップＳ１２２）。例えば、図１３ｂにおいて類似度が「0」と算出されたリンクβ６（図１２）が削除される。一方、ＣＰＵ２０は、類似度がしきい値（例えば、「0.6」）以上であると判断したリンクについては、そのＵＲＬを、特定のアドレス群として関連付けて記憶する（ステップＳ１２４）。例えば、図１３ｂにおいて類似度が「0.83333....」と算出されたリンクβ７〜β９（図１２）のＵＲＬは、ユーザーＡのトップページに属するアドレス群として関連付けられ、グループ化される。 The CPU 20 removes the link for which the similarity is determined to be less than or equal to the threshold (step S122). For example, the link β6 (FIG. 12) whose similarity is calculated as “0” in FIG. 13B is deleted. On the other hand, the CPU 20 associates and stores the URL as a specific address group for the link whose similarity is determined to be greater than or equal to a threshold (for example, “0.6”) (step S124). For example, the URLs of the links β7 to β9 (FIG. 12) whose similarity is calculated as “0.83333...” In FIG. 13B are associated and grouped as an address group belonging to the top page of the user A.

起点となったＵＲＬから所定のリンク数（例えば、１０リンク）に達するまで、上記処理を繰り返す（ステップＳ１２６）。なお、同じ記事であっても次のページにとして取り扱われていれば、リンク数はカウントされることになる。 The above process is repeated until a predetermined number of links (for example, 10 links) is reached from the starting URL (step S126). If the same article is handled as the next page, the number of links will be counted.

以上の処理により、特定のアドレス群（例えば、特定のサイトユーザーについてのＷｅｂページ）として関連付けて記憶されたデータ例を、図１４に示す。図１４に示すグループ化したＵＲＬの例では、ユーザーＡのトップページのＵＲＬに、ステップＳ１１１において記憶された第１の判別キーを生成する元となったリンクのＵＲＬ（図８）、およびステップＳ１２２において類似度の判定により削除されなかったＵＲＬ（図１４）が、ユーザーＡの個人領域として関連付けて記憶されている。 FIG. 14 shows an example of data stored in association with a specific address group (for example, a Web page for a specific site user) by the above processing. In the grouped URL example shown in FIG. 14, the URL (FIG. 8) of the link from which the first discrimination key stored in step S111 is generated is added to the URL of the top page of user A, and step S122. The URL (FIG. 14) that was not deleted due to the similarity determination in FIG.

個人領域が全てトップページのＵＲＬに文字列を追加したものであるときは、トップページのＵＲＬと前方一致の関係にあるか否かを検索することで、個人領域のＵＲＬをグループ化することも可能である。しかし、個人領域の全てがトップページのＵＲＬに文字列を追加したものではない場合（例えば、トップページのＵＲＬの間に異なる文字列が挿入されているようなとき）には、本実施形態による上記処理が特に有効である。 When the personal area is all the URL of the top page with a character string added, it is possible to group the URLs of the personal area by searching whether or not the URL of the top page has a forward matching relationship. Is possible. However, when not all of the personal area is obtained by adding a character string to the URL of the top page (for example, when a different character string is inserted between the URLs of the top page), according to the present embodiment. The above processing is particularly effective.

１−４．他の実施形態
なお、上記実施形態では、リンク階層数を１または２までとしたが、これに限定されるものではなく、３以上（例えば、１０リンク）のリンク階層先まで探索してもよい。 1-4. Other Embodiments In the above-described embodiment, the number of link hierarchies is set to 1 or 2. However, the number of link hierarchies is not limited to this. .

なお、上記実施形態では、トップページに含まれるリンクのＵＲＬから第１のアドレスキーを生成したが（図２など）、トップページ自体のＵＲＬから第１のアドレスキーを生成してもよい。この場合、トップページに含まれるリンクのＵＲＬから第２のアドレスキーを生成すればよい。 In the above embodiment, the first address key is generated from the URL of the link included in the top page (FIG. 2 and the like), but the first address key may be generated from the URL of the top page itself. In this case, the second address key may be generated from the URL of the link included in the top page.

なお、上記実施形態では、類似値算出の式において、第１のアドレスキーの総数を分母としたが、これに限定されるものではなく、第２のアドレスキーの総数を分母としたり、これらのいずれか大きい方を分母としてもよい。 In the above embodiment, the total number of the first address keys is used as the denominator in the similarity value calculation formula. However, the present invention is not limited to this, and the total number of the second address keys may be used as the denominator. The larger one may be used as the denominator.

なお、上記実施形態では、一致するキーの数を前方から後方まで照合することとしたが、これに限られるものではなく、後方から前方に向けて照合してもよい。 In the above embodiment, the number of matching keys is collated from the front to the rear. However, the present invention is not limited to this, and the collation may be performed from the rear to the front.

なお、上記実施形態では、一致するキーの数を前方から後方まで照合することとしたが、これに限られるものではなく、照合を途中で停止してもよい。例えば、一致しないキーが出現した場合に、以降の照合を行わないようにしてもよい。 In the above embodiment, the number of matching keys is collated from the front to the rear. However, the present invention is not limited to this, and the collation may be stopped halfway. For example, when a key that does not match appears, the subsequent verification may not be performed.

なお、上記実施形態では、類似度を算出するためにキーの一致数だけを考慮したが、特定のキーに重みを持たせて類似度を算出してもよい。例えば、ドメイン名から生成されるキーに、他のキーよりも重み付けを行うようにしてもよい。 In the above embodiment, only the number of matching keys is considered in order to calculate the similarity, but the similarity may be calculated by giving a weight to a specific key. For example, a key generated from a domain name may be weighted more than other keys.

なお、上記実施形態では、ＵＲＬを対象としたが、他のアドレス（例えば、ＵＲＮ）を対象としてもよい。ＵＲＬとＵＲＮはいずれもＵＲＩの概念に含まれるものである。なお、ＵＲＬの削除処理を設けなくてもよい。 In the above embodiment, the URL is targeted, but other addresses (for example, URN) may be targeted. Both URL and URN are included in the concept of URI. It is not necessary to provide URL deletion processing.

上記のようにして、特定ユーザーの個人領域を特定することができるが、同じＷｅｂページ内に複数の個人による書き込みが存在する場合がある。そのような場合には、他のユーザーが書き込んだ部分と区別できればより高度な解析を行うことができる。Ｗｅｂページを特定の単位で分割するための処理について以下に説明する。 As described above, the personal area of the specific user can be specified, but there may be cases where writing by a plurality of individuals exists in the same Web page. In such a case, more advanced analysis can be performed if it can be distinguished from the part written by other users. Processing for dividing a Web page in a specific unit will be described below.

２−１．エレメント抽出装置２００の構造（図１５）
本発明のエレメント抽出装置２００は、Ｗｅｂページから所定のエレメント抽出するために、図１５に示すＷｅｂページ展開手段５２、座標取得手段５４、エレメント選択手段５６、エレメント配列手段５８、階層キー生成手段６０、類似度算出手段６２、内容データ取得手段６４を備える。これらの手段を用いることで、Ｗｅｂページの中に膨大な数のエレメントが含まれる場合でも、重要なエレメントだけを抽出すための記事抽出ルールが、Ｗｅｂページ毎に自動的に決定されるため，汎用性の高い記事抽出が可能となる。 2-1. Structure of element extraction device 200 (FIG. 15)
The element extraction apparatus 200 according to the present invention extracts a predetermined element from a Web page, in order to extract a predetermined page from the Web page 52, coordinate acquisition unit 54, element selection unit 56, element arrangement unit 58, and hierarchical key generation unit 60 shown in FIG. , Similarity calculation means 62 and content data acquisition means 64 are provided. By using these means, even when an enormous number of elements are included in a Web page, article extraction rules for extracting only important elements are automatically determined for each Web page. Highly versatile article extraction is possible.

Ｗｅｂページ展開手段５２（図１５）は、Ｗｅｂページ（ＨＴＭＬ文書）を表示領域に展開し、座標取得手段５４（図１５）は、表示されたＷｅｂページ（ＨＴＭＬ文書）に含まれる各エレメントの表示範囲を特定する座標を取得する。ここで、「エレメント」とは、表示位置があるＨＴＭＬ文書の要素を意味する。 Web page expansion means 52 (FIG. 15) expands the Web page (HTML document) in the display area, and coordinate acquisition means 54 (FIG. 15) displays each element included in the displayed Web page (HTML document). Get the coordinates that specify the range. Here, “element” means an element of an HTML document having a display position.

エレメント選択手段５６（図１５）は、表示領域上に、エレメントの配置方向に複数の点を配置し、配置した点を表示範囲に含むエレメントを選択する。エレメント配列手段５８（図１５）は、選択した前記エレメントの階層構造を順に配列する。階層キー生成手段６０（図１５）は、前記エレメントの各階層構造をタグ単位で分割して階層キーを生成する。 The element selection means 56 (FIG. 15) arranges a plurality of points in the element arrangement direction on the display area, and selects an element including the arranged points in the display range. The element arrangement means 58 (FIG. 15) arranges the hierarchical structure of the selected elements in order. The hierarchy key generating means 60 (FIG. 15) generates a hierarchy key by dividing each hierarchical structure of the element in units of tags.

類似度算出手段６２（図１５）は、隣接するエレメントの間で、一致する階層キーの数を計数し、その結果に基づいて類似度を算出する。内容データ取得手段６４（図１５）は、前記類似度に基づいて、２以上の隣接するエレメントの階層構造を特定し、当該階層構造に対応する内容データをＷｅｂページから取得する。 The similarity calculation means 62 (FIG. 15) counts the number of matching hierarchical keys between adjacent elements, and calculates the similarity based on the result. The content data acquisition unit 64 (FIG. 15) specifies the hierarchical structure of two or more adjacent elements based on the similarity, and acquires content data corresponding to the hierarchical structure from the Web page.

２−２．エレメント抽出装置２００のハードウェア構成
図１６に、エレメント抽出装置２００のハードウェア構成を示す。エレメント抽出装置２００は、図１６に示すＣＰＵ２０、ＲＡＭ２２、ディスプレイ２４、ハードディスク２６、キーボード／マウス２８、記録媒体ドライブ３０を備えたコンピュータで構成される。 2-2. Hardware Configuration of Element Extraction Device 200 FIG. 16 shows a hardware configuration of the element extraction device 200. The element extraction device 200 is configured by a computer including the CPU 20, RAM 22, display 24, hard disk 26, keyboard / mouse 28, and recording medium drive 30 shown in FIG.

図１６のハードディスク２６には、エレメント抽出プログラム３８が記録されている。エレメント抽出プログラム３８は、ＣＰＵ２０、ＲＡＭ２２、ディスプレイ２４、ハードディスク２６、キーボード／マウス２８、記録媒体ドライブ３０を備えたコンピュータを、図１６に示すＷｅｂページ展開手段５２、座標取得手段５４、エレメント選択手段５６、エレメント配列手段５８、階層キー生成手段６０、類似度算出手段６２、内容データ取得手段６４として機能させるプログラムである。 An element extraction program 38 is recorded on the hard disk 26 of FIG. The element extraction program 38 is a computer that includes the CPU 20, RAM 22, display 24, hard disk 26, keyboard / mouse 28, and recording medium drive 30. The page extraction means 52, coordinate acquisition means 54, and element selection means 56 shown in FIG. , Element arrangement means 58, hierarchical key generation means 60, similarity calculation means 62, and content data acquisition means 64.

また、図１６のハードディスク２６には、ブラウザ４０が記憶されている。ブラウザ４０は、コンピュータ内部において、Ｗｅｂページ（ＨＴＭＬ文書）を仮想的に表示領域（例えば、縦２０４０px×横１２００px）に展開する。 In addition, a browser 40 is stored in the hard disk 26 of FIG. The browser 40 virtually expands a Web page (HTML document) in a display area (for example, vertical 2040 px × horizontal 1200 px) inside the computer.

また、図１６のハードディスク２６には、エレメント座標ＤＢ４２が記憶されている。エレメント座標ＤＢ４２には、ブラウザ４０の表示領域におけるエレメントの表示範囲を示す座標（例えば、表示領域上にX、Y座標を設定した場合における各エレメントの左上および右下の座標）が蓄積される。 Further, the element coordinate DB 42 is stored in the hard disk 26 of FIG. In the element coordinate DB 42, coordinates indicating the display range of the element in the display area of the browser 40 (for example, the upper left and lower right coordinates of each element when the X and Y coordinates are set on the display area) are accumulated.

２−３．エレメント抽出処理のフロー
図１７は、エレメント抽出プログラム３８（図１６）による処理を示すフローチャートである。なお、以下の例では、掲示板のページに複数人の書き込みがあった場合に、ユーザーによって書き込まれた内容の単位で、テキストを抽出する場合について説明する。 2-3. Element Extraction Processing Flow FIG. 17 is a flowchart showing processing by the element extraction program 38 (FIG. 16). In the following example, a case will be described in which text is extracted in units of contents written by the user when a plurality of people have written on the bulletin board page.

エレメント抽出プログラム３８が起動されると、ブラウザ４０（図１６）は、Ｗｅｂから抽出対象となるＨＴＭＬ文書を取得して、エレメントを表示領域上に配置（すなわち、イメージデータに変換）する（ステップＳ２０２）。図１８は、イメージデータに変換される前の掲示板のＨＴＭＬデータ（ソースコード）（図１８Ａ）およびこれをイメージデータに変換して画面上に表示した表示例（図１８Ｂ）である。 When the element extraction program 38 is activated, the browser 40 (FIG. 16) acquires an HTML document to be extracted from the Web, and arranges the elements on the display area (that is, converts them into image data) (step S202). ). FIG. 18 shows HTML data (source code) (FIG. 18A) of the bulletin board before being converted into image data, and a display example (FIG. 18B) in which this is converted into image data and displayed on the screen.

さらに、ＣＰＵ２０は、イメージデータ上における各エレメントの表示領域を示す座標（例えば、左上および右下）を取得し、エレメント座標ＤＢ４２（図１７）に記憶する（ステップＳ２０４）。図１９に、エレメントの座標を算出する方法を示す。例えば、図１９において「山田花子」を表示するエレメントEl4の表示範囲を示す左上座標（X4,Y4）は、X 4 = x1 + x2 + x3 + x4、Y 4 = y1 + y2 + y3 + y4で算出される。右下座標は、各エレメントの高さおよび幅を左上座標に加算すれば得られる。なお、図１９に示すような画像を実際に画面上に表示しない場合であっても、コンピュータ内部において（すなわち、図１６のディスプレイ２４に表示せずに）上記処理が行うことは可能である。 Further, the CPU 20 acquires coordinates (for example, upper left and lower right) indicating the display area of each element on the image data, and stores them in the element coordinate DB 42 (FIG. 17) (step S204). FIG. 19 shows a method for calculating the coordinates of an element. For example, in FIG. 19, the upper left coordinates (X4, Y4) indicating the display range of the element El4 displaying “Yamada Hanako” are X 4 = x1 + x2 + x3 + x4, Y 4 = y1 + y2 + y3 + y4 Calculated. The lower right coordinates can be obtained by adding the height and width of each element to the upper left coordinates. Even if the image as shown in FIG. 19 is not actually displayed on the screen, the above processing can be performed inside the computer (that is, without being displayed on the display 24 in FIG. 16).

つぎに、エレメントの選択処理に移行する。まず、ＣＰＵ２０は、エレメントの長さ方向（幅方向）に対して垂直方向（縦方向）の仮想直線を設定する（ステップＳ２０６）。図２０に示す例では、１本の直線Ｌ１を中央に設定している。これは、特にプロフなどにおいては、重要な要素は中央に配置されている可能性が高いためである。 Next, the process proceeds to element selection processing. First, the CPU 20 sets a virtual straight line in the vertical direction (longitudinal direction) with respect to the length direction (width direction) of the element (step S206). In the example shown in FIG. 20, one straight line L1 is set at the center. This is because there is a high possibility that important elements are arranged in the center, particularly in a prof.

さらに、ＣＰＵ２０は、設定した上記直線上に、所定の間隔で点をプロットする（ステップＳ２０８）。図２０に示す例では、等間隔（例えば、１０px〜２０px程度）で点をプロットしている。ＣＰＵ２０は、エレメント座標ＤＢ４２（図１７）を参照して、プロットされた点を表示範囲内に含むエレメントを抽出し、順に配列する（ステップＳ２１０）。 Further, the CPU 20 plots points at predetermined intervals on the set straight line (step S208). In the example shown in FIG. 20, the points are plotted at regular intervals (for example, about 10 px to 20 px). The CPU 20 refers to the element coordinate DB 42 (FIG. 17), extracts elements that include the plotted points within the display range, and arranges them in order (step S210).

ＣＰＵ２０は、抽出したエレメントの階層構造をタグ単位に分割して、階層キーを抽出する（ステップＳ２１２）。例えば、図２０の上から３つ目の点Ｐ３を例として、階層キーの抽出を説明すると次のとおりである。３つ目の点Ｐ３は、図１８のＨＴＭＬ文書では、１２行目の「友達なろ〜よ」に対応している。この「友達なろ〜よ」を挟んで記述されている一対のタグ（<body>と</body>のように、制御の開始と終了を示すタグをいう）を探し出し、これを階層キーとしている。 The CPU 20 divides the hierarchical structure of the extracted elements into tag units, and extracts a hierarchical key (step S212). For example, taking the third point P3 from the top of FIG. 20 as an example, the extraction of the hierarchy key is described as follows. The third point P3 corresponds to “Friends Nara ~ yo” on the 12th line in the HTML document of FIG. Search for a pair of tags (such as <body> and </ body>, which indicates the start and end of control) described between these “friends”, and use this as a hierarchical key .

「友達なろ〜よ」を挟む一対のタグは、２行目の<body>と下から２行目の</body>、７行目の<form>と１５行目の</form>、８行目の<table>と１４行目の</table>、８行目の<tr>と１４行目の</tr>、８行目の<td>と１４行目の</td>、９行目の<div>と１２行目の</div>である。したがって、body,form,table,tr,td,divが階層キーとして抽出される。図２１に、各エレメントの階層構造をタグ単位で分割して生成した階層キーの例を示す。 A pair of tags sandwiching "Friends ~~" are the <body> on the second line and the second </ body> on the second line, the <form> on the seventh line and the </ form> on the 15th line, 8 <Table> and 14th line </ table>, 8th line <tr> and 14th line </ tr>, 8th line <td> and 14th line </ td>, The <div> on the 9th line and the </ div> on the 12th line. Therefore, body, form, table, tr, td, and div are extracted as hierarchical keys. FIG. 21 shows an example of a hierarchical key generated by dividing the hierarchical structure of each element in units of tags.

階層キーを抽出した後、隣接する各エレメント間における階層キーの類似度を、図２２に示す式（図１３ｂに示す式と同じ）に基づいて算出する（ステップＳ２１４）。図２２に、類似度算出した結果の例を示す。なお、この実施形態では、式の分母count(Element(wn_i))を、隣接する階層キーの数のうち、大きい方の数としている。 After extracting the hierarchy key, the similarity of the hierarchy key between adjacent elements is calculated based on the formula shown in FIG. 22 (the same as the formula shown in FIG. 13b) (step S214). FIG. 22 shows an example of the result of calculating the similarity. In this embodiment, the denominator count (Element (wn _i )) of the equation is the larger number of adjacent hierarchical keys.

つぎに、ＣＰＵ２０は、図２３に示すような、類似値ピラミッドをコンピュータ内部において生成する（ステップＳ２１６）。なお、図２３に示す類似値ピラミッドを生成するには、隣接する２つの類似度の平均を計算して上位に積み上げて行けばよい。 Next, the CPU 20 generates a similar value pyramid as shown in FIG. 23 (step S216). In order to generate the similarity value pyramid shown in FIG. 23, it is only necessary to calculate the average of two adjacent degrees of similarity and stack them up.

ＣＰＵ２０は、類似値ピラミッドの中から、しきい値以上（例えば、0.7）の類似度を頂点とするピラミッドを探索する（ステップＳ２１８）。このとき、ピラミッドの底辺が重ならないように探索する。例えば、図２４に点線で示すピラミッドが検出された場合、当該ピラミッドの底辺を含まないように、N1またはN2を底辺とするピラミッドについて探索を行う。なお、探索の方向は、図２４において、上位から下位の方向または底辺の左から右の方向もしくは底辺の右から左に行うことができる。なお、類似度ピラミッドの中にしきい値以上の類似度が全く含まれていない場合には、全ての類似度を探索して処理を終了することになる。 The CPU 20 searches the similarity value pyramid for a pyramid having a similarity equal to or higher than a threshold (for example, 0.7) as a vertex (step S218). At this time, search is made so that the bases of the pyramids do not overlap. For example, when a pyramid indicated by a dotted line in FIG. 24 is detected, a search is performed for a pyramid having N1 or N2 as a base so as not to include the bottom of the pyramid. In FIG. 24, the direction of search can be performed from the top to the bottom, from the bottom left to the right, or from the bottom right to the left. If the similarity pyramid does not include any similarity above the threshold value, all similarities are searched and the process is terminated.

さらに、探索された各ピラミッドに対して以下の統合処理を実行し、エレメント抽出の対象となる階層構造を特定する（ステップＳ２２０）。図２５を用いて、階層構造を特定する処理の詳細について説明する。 Further, the following integration process is executed on each searched pyramid to identify the hierarchical structure that is the target of element extraction (step S220). Details of the process of specifying the hierarchical structure will be described with reference to FIG.

ＣＰＵ２０は、まず、探索された対象ピラミッドの底辺について、包含関係を判定する（ステップＳ２２２１）。包含関係にあるか否かは、エレメント座標ＤＢ４２に記憶されている各エレメントの座標を比較し、またはエレメント間におけるタグの階層構造の関係を参照して判定することができる。なお、図２４に点線で示すピラミッドの底辺に含まれる異なる階層構造の「/body/form/table/tr/td」と、「/body/form/table/tr/td/div」とは包含関係にある。 First, the CPU 20 determines an inclusion relationship for the bottom of the searched target pyramid (step S2221). Whether or not there is an inclusion relationship can be determined by comparing the coordinates of each element stored in the element coordinate DB 42 or by referring to the relationship of the hierarchical structure of tags between elements. It should be noted that “/ body / form / table / tr / td” and “/ body / form / table / tr / td / div” having different hierarchical structures included in the bottom of the pyramid indicated by the dotted line in FIG. 24 are inclusive relations. It is in.

エレメントが包含関係にない場合（ステップＳ２２２１のＮｏ）は、各エレメントを抽出対象として特定する（ステップＳ２２２２）。例えば、図２１に示す第１列のエレメント「/body/div」と第２列のエレメント「/body/form/table/tr/td」は包含関係にない。 If the elements are not in an inclusion relationship (No in step S2221), each element is specified as an extraction target (step S2222). For example, the element “/ body / div” in the first column and the element “/ body / form / table / tr / td” in the second column shown in FIG. 21 are not in an inclusive relationship.

エレメントが包含関係にあると判定した場合（ステップＳ２２２１のＹｅｓ）は、各エレメントに含まれるテキストの文字数の差がしきい値（例えば、当該差が、上位階層のエレメントに含まれるテキスト文字数の0.8）以下のとき（ステップＳ２２２３のＹｅｓ）には、包含するエレメントを抽出対象として特定する（ステップＳ２２２４）。すなわち、階層構造がより上位の方（図２６に示すdata2に対応する階層構造）を抽出対象として特定する。これにより、下位の階層構造（図２６に示すdata1に対応）は、上位の階層構造（図２６に示すdata2に対応）に統合されることになる。 If it is determined that the elements are in an inclusive relationship (Yes in step S2221), the difference in the number of characters in the text included in each element is a threshold value (for example, the difference is 0.8 of the number of text characters included in the upper layer element). In the following cases (Yes in step S2223), the included element is specified as an extraction target (step S2224). That is, the higher hierarchical structure (hierarchical structure corresponding to data2 shown in FIG. 26) is specified as an extraction target. As a result, the lower hierarchical structure (corresponding to data1 shown in FIG. 26) is integrated into the upper hierarchical structure (corresponding to data2 shown in FIG. 26).

一方、各エレメント内のテキスト差分がしきい値を超えるとき（ステップＳ２２２３のＮｏ）には、包含されるエレメントを抽出対象として特定する（ステップＳ２２２５）。すなわち、階層構造がより下位の方を抽出対象として特定する。 On the other hand, when the text difference in each element exceeds the threshold value (No in step S2223), the included element is specified as an extraction target (step S2225). That is, the lower hierarchical structure is specified as an extraction target.

例えば、図２７に示す階層構造「/body/div」(data3に対応)と「/body/form/table/tr/td/div」(data1に対応)が抽出された場合は、テキスト差分がしきい値を超えるので、階層構造がより下位の方（図２６のdata1に対応）を抽出対象として特定する。この場合、上位の階層は、主要エレメント自体ではなく、主要エレメントを複数集めた上位エレメントである可能性が高いからである。これにより、上位の階層構造（図２７に示すdata3に対応）は、下位の階層構造（図２６のdata1に対応）に統合されることになる。 For example, if the hierarchical structure “/ body / div” (corresponding to data3) and “/ body / form / table / tr / td / div” (corresponding to data1) shown in FIG. Since the threshold value is exceeded, the lower hierarchical structure (corresponding to data1 in FIG. 26) is specified as the extraction target. In this case, it is highly likely that the upper hierarchy is not the main element itself but an upper element obtained by collecting a plurality of main elements. As a result, the upper hierarchical structure (corresponding to data3 shown in FIG. 27) is integrated into the lower hierarchical structure (corresponding to data1 in FIG. 26).

以上により、抽出する対象の区分となるタグが特定される。その上で、ＣＰＵ２０は、抽出対象の階層構造がＷｅｂページに複数含まれるか否かを判定し、該当する階層構造の内容データを当該ＷｅｂページのＨＴＭＬ文書から抽出する（ステップＳ２２６）。なお、ステップＳ２１８において複数のピラミッドが探索された場合には、各ピラミッド内で統合された階層構造が同じであれば、同じものだけを抽出対象とすればよく、階層構造が異なる場合には、それぞれについてデータを抽出すればよい。 As described above, the tag to be extracted is identified. Then, the CPU 20 determines whether or not a plurality of hierarchical structures to be extracted are included in the Web page, and extracts content data of the corresponding hierarchical structure from the HTML document of the Web page (step S226). In addition, when a plurality of pyramids are searched in step S218, if the hierarchical structure integrated in each pyramid is the same, only the same thing needs to be extracted, and when the hierarchical structures are different, What is necessary is just to extract data about each.

ＨＴＭＬ文書から対応する階層構造の内容データを抽出した上で、さらに、ＣＰＵ２０は、抽出した内容データの属性をＡタグなどで判別し（ステップＳ２２８）、抽出したデータと共にその属性を記憶する（ステップＳ２３０）。 After extracting the content data of the corresponding hierarchical structure from the HTML document, the CPU 20 further discriminates the attribute of the extracted content data by A tag or the like (step S228), and stores the attribute together with the extracted data (step S228). S230).

図２８は、最終的に抽出されたデータ例およびその属性を示す図である。図２８に示す例では、図２６に示すdata1から抽出された、タグ以外のテキストデータ「山田太郎」「プロフみたよ！」「友達なろ〜よ」を内容データとして記憶している。なお、タグを含めた図２６に示すdata1の全てを、内容データとして記憶してもよい。 FIG. 28 is a diagram showing an example of data finally extracted and its attributes. In the example shown in FIG. 28, text data “Taro Yamada”, “Prof. Mitayo!”, And “Naruto Friends” extracted from data1 shown in FIG. 26 are stored as content data. Note that all of data1 shown in FIG. 26 including the tag may be stored as content data.

また、図２８に示す例では、図２６に示すdata1に含まれるＡタグの部分data4から得られるＵＲＬ「http://pr.cccboy.com/0123456」が、データの属性として関連付けて記憶されている。これにより、ＵＲＬ「http://pr.cccboy.com/0123456」でソートすれば、山田太郎のトップページＵＲＬのデータだけを取得することができる。 In the example shown in FIG. 28, the URL “http://pr.cccboy.com/0123456” obtained from the A tag portion data4 included in data1 shown in FIG. 26 is stored in association with the data attribute. Yes. Thus, if sorting is performed by the URL “http://pr.cccboy.com/0123456”, only the data of the top page URL of Taro Yamada can be acquired.

なお、上記のようなＡタグのＵＲＬに限らず、自然言語によるマッチングで得られた属性を、エレメントに関連付けて記憶してもよい。自然言語処理によるマッチングとは、例えば、予め属性を決定するための辞書（女性用語辞書など）を用意しておき、マッチングした度合いによりその属性を付するか否かを決定する手法である（参考文献「Ｗｅｂリンク構造解析と自然言語処理による組織関係の抽出についての研究」、情報処理学会論文誌、２００６年６月号）。エレメントの属性としては、例えば、エレメントの作成者名、エレメントの作成者、作成日、エレメント作成者の性別、年齢、所属団体の他、エレメントに記載されている話題（特定の商品、サービスなどに関するもの）などが該当する。 Not only the URL of the A tag as described above but also an attribute obtained by natural language matching may be stored in association with the element. Matching by natural language processing is, for example, a method of preparing a dictionary (such as a female term dictionary) for determining an attribute in advance and determining whether or not to attach the attribute depending on the degree of matching (reference) Document “Research on Extracting Organizational Relationships Using Web Link Structure Analysis and Natural Language Processing”, Transactions of Information Processing Society of Japan, June 2006 issue). Element attributes include, for example, the name of the element creator, the element creator, the creation date, the gender, age, and organization of the element creator, as well as the topics described in the element (specific products, services, etc.) Etc.).

以上のように、Ｗｅｂページから各タグ（エレメント）の表示位置情報と、記事などの内容データが（i）Ｗｅｂページ内で繰り返し登場する、（ii）Ｗｅｂページの大部分を占める（iii）中央部に存在するといった特性とに基づき，記事部分の可能性の高い領域を特定し、抽出した前後の階層構造（XPATH）の類似度を見ることで記事部分の抽出規則を生成し、そのルールに基づき自動的に記事部分を抽出することができる。 As described above, the display position information of each tag (element) and content data such as articles appear repeatedly in the web page from the web page, (ii) occupy most of the web page, (iii) the center Based on the characteristics such as existing in the part, the possibility of the article part is specified, the extraction rule of the article part is generated by looking at the similarity of the hierarchical structure (XPATH) before and after the extraction, and the rule Article parts can be automatically extracted based on this.

２−４．他の実施形態
なお、上記実施形態では、掲示板ページから書き込みデータを抽出する場合を例に説明したが、これに限定されるものではなく、他の要素（ニュース記事など）の抽出に用いてもよい。 2-4. Other Embodiments In the above embodiment, the case where write data is extracted from the bulletin board page has been described as an example. However, the present invention is not limited to this, and may be used for extracting other elements (such as news articles). Good.

なお、上記実施形態では、点を配置する仮想線を中央に１本だけ設定したが（図２０）、これに限定されるものではなく、複数の仮想線を設定して点を配置してもよい。例えば、図２９に示す点の配置例では、５本の直線とし、表示されたＷｅｂページに含まれるエレメントのｘ座標の最大値と最小値の中間位置に第１の直線を設定し、当該第１の直線から等間隔で両側に２本ずつ所定幅だけオフセットした直線をさらに設定している。この場合、図１７に示すステップＳ２１０において、同一ｙ座標にプロットされた５つの点を最も多く含むエレメントを抽出すればよい。 In the above embodiment, only one virtual line for placing points is set at the center (FIG. 20), but the present invention is not limited to this, and a plurality of virtual lines may be set to place points. Good. For example, in the dot arrangement example shown in FIG. 29, five straight lines are set, and the first straight line is set at the intermediate position between the maximum value and the minimum value of the x-coordinates of the elements included in the displayed Web page. Further, two straight lines offset from the straight line by a predetermined width on both sides at equal intervals are further set. In this case, in step S210 shown in FIG. 17, it is sufficient to extract an element including the most five points plotted on the same y coordinate.

なお、上記実施形態では、仮想線に沿って等間隔で点を配置したが（図２０）、これに限定されるものではなく、異なる間隔で点を配置してもよい。 In the above embodiment, the points are arranged at equal intervals along the virtual line (FIG. 20), but the present invention is not limited to this, and the points may be arranged at different intervals.

なお、上記実施形態では、ステップＳ２１６において類似度ピラミッドを事前に生成したが、これに限定されるものではなく、類似度ピラミッドを前もって生成しないで階層構造を探索してもよい。 In the above embodiment, the similarity pyramid is generated in advance in step S216. However, the present invention is not limited to this, and the hierarchical structure may be searched without generating the similarity pyramid in advance.

なお、上記実施形態では、ステップＳ２１６において類似度ピラミッドを生成したが、類似度ピラミッドを生成しないで階層構造を探索してもよい。例えば、類似度の同じ階層構造が隣接はしていないが、１つ離れて存在するよう場合には間の階層構造を両側の階層構造に統合するといった方法が考えられる。 In the above embodiment, the similarity pyramid is generated in step S216. However, the hierarchical structure may be searched without generating the similarity pyramid. For example, when hierarchical structures having the same degree of similarity are not adjacent to each other but are separated by one, a method of integrating the hierarchical structure between them into the hierarchical structures on both sides is conceivable.

なお、上記実施形態では、抽出するエレメントがテキストである場合について説明したが、抽出するエレメントが、画像、グラフ、動画であってもよい。 In the above embodiment, the case where the element to be extracted is text has been described. However, the element to be extracted may be an image, a graph, or a moving image.

なお、上記実施形態では、ＨＴＭＬ文書の表示を、仮想ブラウザを用いて内部的に実行するとしたが、実際に画面上に表示しながら行ってもよい。 In the above embodiment, the HTML document is displayed internally using a virtual browser. However, the HTML document may be displayed on the screen.

３．データ解析への応用例
（i）個人領域として特定し、または分割されたデータから抽出される言葉からSupport Vector Machine (SVM)を用いたサンプリングを行って分離平面を生成し、対象となるユーザーのページで用いられる言葉について当該分離平面との距離を算出することで、対象ユーザーの有害度を算出することができる。 3. Application example to data analysis (i) Specify a personal area or generate a separation plane by sampling using Support Vector Machine (SVM) from words extracted from the divided data. By calculating the distance between the words used on the page and the separation plane, it is possible to calculate the harmful degree of the target user.

「有害度」とは、対象者が、他のユーザーにとってどの程度の悪影響を与える存在であるかを数値化した指標である。以下の（a）事前準備、（b）有害判定、（c）有害度の算出の各段階に分けて説明する。 “Harmfulness” is an index that quantifies the degree to which a target person has an adverse effect on other users. The explanation is divided into the following stages: (a) Advance preparation, (b) Hazard determination, and (c) Hazard calculation.

（a）事前準備
まず、ネット上に存在するユーザーの中から典型的な有害者と無害者と考えられる者を選び出す。その上で、複数の有害者について、プロフィール・日記内の頻出単語を抽出する。同様に、複数の無害者について、プロフィール・日記内の頻出単語を抽出する。その上で、これら有害者および無害者を、全ての頻出単語の出現数を軸とする多次元空間（図３０Ａに示す）上にプロットする。その際、有害人と無害人とを区別してプロットする。この多次元空間上において、SVMの手法を用いて、分離平面が決定される。 (A) Advance preparation First, select those who are considered to be typical harmful and harmless from users on the Internet. On that basis, frequent words in the profile / diary are extracted for a plurality of harmful persons. Similarly, frequent words in the profile / diary are extracted for a plurality of harmless persons. Then, the harmful person and the harmless person are plotted on a multi-dimensional space (shown in FIG. 30A) around the number of occurrences of all the frequent words. At that time, the plot is made by distinguishing harmful persons from harmless persons. In this multidimensional space, the separation plane is determined using the SVM technique.

（b）有害・無害判定
つぎに、対象者のプロフ・日記内に含まれる単語の数を計数し、上記多次元空間上にプロットする。プロットされた対象者（△で示す）が、分離平面のどちら側にあるかで、有害または無害のユーザーを判定することができる。実施形態では、有害側にプロットされた対象者の全てを有害と判断するのではなく、分離平面からの距離がしきい値以上離れている場合のみ有害と判断する。分離平面からの距離がしきい値以内の場合は、無害と取り扱う。 (B) Harmful / Harmless Judgment Next, the number of words contained in the subject's prof / diary is counted and plotted on the multidimensional space. A harmful or harmless user can be determined by which side of the separation plane the plotted subject (indicated by Δ) is. In the embodiment, not all of the subjects plotted on the harmful side are determined to be harmful, but are determined to be harmful only when the distance from the separation plane is more than a threshold. If the distance from the separation plane is within the threshold, it is treated as harmless.

（c）有害度の算出
最後に、有害と判断された対象者について、以下の式から有害度を算出する。 (C) Calculation of hazard level Finally, the hazard level is calculated from the following formula for the subject who is determined to be harmful.

有害度＝「分離平面からの距離」×「非行辞書単語出現数」
ここで、非行辞書とは、非行に関連すると考えられる単語を予め登録しておいた辞書である。対象者のプロフ・日記内の非行単語の数を計数する。 Harmfulness = “distance from separation plane” × “number of delinquent dictionary words”
Here, the delinquency dictionary is a dictionary in which words that are considered to be related to delinquency are registered in advance. Count the number of delinquent words in the subject's profile / diary.

（ii）また、対象ユーザーの人間関係が更新された場合、または新たなユーザーが発見された場合に、個人領域として特定し、または分割されたデータについて、危険度を更新すればリアルタイムで個人の追跡を行える。「危険度」とは、対象者が、他のユーザーからの危険にどの程度さらされているかを数値化した指標である。 (Ii) In addition, when the relationship of the target user is updated or when a new user is discovered, if the risk is updated for the data identified or divided as a personal area, Can be tracked. The “risk level” is an index that quantifies how much the target person is exposed to danger from other users.

各ユーザーの危険度は、上記有害度などに基づいて算出できる。具体的には、対象者の危険度は、リンク先ユーザーの有害度を親密度で乗算した値を算出し、直接または間接的にリンクされている各ユーザーについてこれを合計して得られる。なお、ユーザー間の親密度は、リンク関係（図３０Ｂ）または自然言語処理による関係から算出できる。 The risk level of each user can be calculated based on the above-described degree of harm. Specifically, the risk level of the target person is obtained by calculating a value obtained by multiplying the harm degree of the linked user by the familiarity, and totaling each user linked directly or indirectly. The closeness between users can be calculated from a link relationship (FIG. 30B) or a relationship by natural language processing.

図３０Ｂにおける対象者Ａの有害度は、例えば、（Ｂの有害度×ＡＢの新密度）＋（Ｄの有害度×ＡＤの新密度）で得た値を合計すれば得られる。なお、親密度が低い対象者Ｃとの関係は考慮していない。 The harmfulness level of the subject A in FIG. 30B can be obtained, for example, by summing up the values obtained by (B harmfulness × AB new density) + (D harmfulness × AD new density). In addition, the relationship with the subject C with low intimacy is not considered.

（iii）その他にも、検索システムにおける日付やユーザー単位での検索、フィルタリング（有害と判断された記事のみのフィルタリング処理）、口コミ記事の効率的な収集といった分野にも本発明を利用することができる。 (Iii) In addition, the present invention can also be used in fields such as search by date and user in a search system, filtering (filtering processing only for articles judged to be harmful), and efficient collection of word-of-mouth articles. it can.

Claims

A grouping program for grouping web page addresses,
Computer
Web page acquisition means for acquiring a Web page from a specific address;
First address key generating means for extracting a link from the web page and dividing the address of each extracted link with a delimiter to generate a first address key;
Second address key generation means for extracting a link from a Web page acquired from the link that generated the first address key, and generating a second address key by dividing the address of each extracted link by a delimiter ,
Matching the matching keys between the first address key and the second address key, counting the number of combinations in which the matching keys appear in the same order, and calculating the similarity based on the result Similarity calculation means,
Grouping means for associating the addresses of the links determined to have a similarity equal to or higher than a threshold as a specific address group;
A grouping program characterized by functioning as

The grouping program of claim 1,
The similarity calculation means includes:
Whether a key that matches one of the keys constituting the first address key exists in the keys constituting the second address key between the first address key and the second address key Or from the front,
If there is a corresponding key among the keys constituting the second address key, the keys match backward from the key next to the key detected as matching in the first and second address keys. Count the number of key combinations,
Calculating the ratio of the number of key combinations associated with the second address key to the total number of keys constituting the first address key as the similarity,
A grouping program characterized by that.

The grouping program according to claim 1 or 2, further comprising:
Grouping search means for associating, as a specific address group, addresses of links whose similarity is equal to or greater than a threshold value from the specific address to a predetermined number of link layers,
A link is extracted from the Web page acquired from the address associated as a specific address group by the grouping means, and a third address key is generated by dividing the address of each extracted link by a delimiter character, Grouping search means for counting the number of matching address keys between one address key and the third address key, and calculating the similarity based on the result,
A grouping program characterized by comprising:

In the grouping program in any one of Claims 1-3,
The first address key generating means or the second address key generating means replaces only an identifier that identifies a user at the specific address before generating at least the first address key or the second address key. Delete addresses that are determined to be
A grouping program characterized by that.

In the grouping program in any one of Claims 1-4,
It is determined that the first address key generation means or the second address key generation means matches the address registered as a deletion target before at least generating the first address key or the second address key. Delete address,
A grouping program characterized by that.

In the grouping program in any one of Claims 1-5, Furthermore,
Address storage means for automatically storing the specific address every predetermined time by accessing a server;
A grouping program characterized by comprising:

A grouping device for grouping web page addresses,
Web page acquisition means for acquiring a Web page from a specific address;
First address key generating means for extracting a link from the web page and dividing the address of each extracted link by a delimiter to generate a first address key;
Second address key generating means for extracting a link from a Web page corresponding to the link that generated the first address key, and generating a second address key by dividing the address of each extracted link by a delimiter character; ,
Matching the matching keys between the first address key and the second address key, counting the number of combinations in which the matching keys appear in the same order, and calculating the similarity based on the result Similarity calculation means,
Grouping means for associating addresses of links whose similarity is equal to or greater than a threshold as a specific address group;
A grouping device characterized by comprising:

A grouping method for grouping addresses of Web pages by a computer, wherein the computer
Get a web page from a specific address,
Extracting a link from the web page, dividing the address of each extracted link with a delimiter to generate a first address key;
Extracting a link from the web page corresponding to the link that generated the first address key, dividing the address of each extracted link with a delimiter to generate a second address key;
Matching the matching keys between the first address key and the second address key, counting the number of combinations in which the matching keys appear in the same order, and calculating the similarity based on the result And
Associating addresses of links whose similarity is equal to or greater than a threshold as a specific address group;
A grouping method characterized by

An element extraction program for extracting a predetermined element from a web page,
Computer
Web page expansion means for expanding the Web page in the display area;
Coordinate acquisition means for acquiring coordinates for specifying the display range of each element;
An element selection means for arranging a plurality of points in the arrangement direction of the elements on the display area and selecting an element including the arranged points in the display range;
Element arrangement means for arranging the hierarchical structure of the selected elements in order;
Hierarchy key generation means for generating a hierarchy key by dividing each hierarchical structure of the element in units of tags,
Similarity calculation means for counting the number of matching hierarchical keys between adjacent elements and calculating the similarity based on the result.
Content data acquisition means for specifying a hierarchical structure of two or more adjacent elements based on the similarity and acquiring content data corresponding to the hierarchical structure from a Web page;
Element extraction program characterized by functioning as

In the element extraction program of Claim 9,
Computer, and
An average of the similarities of the hierarchical structures of adjacent elements is calculated in the upper direction to generate a similarity pyramid, whether or not the similarity is equal to or greater than a threshold value, and the base of the detected similarity Content data acquisition means for specifying a hierarchical structure that matches a predetermined rule among the hierarchical structures of elements included in the URL, and acquiring content data corresponding to the hierarchical structure from a Web page;
Element extraction program characterized by functioning as

In the element extraction program of Claim 9 or Claim 10,
The content data acquisition means includes
Of the elements included in the bottom of the similarity pyramid, determine whether adjacent elements are inclusive,
If adjacent elements are not in an inclusive relationship, get the corresponding content data for each hierarchical structure,
An element extraction program characterized by that.

In the element extraction program in any one of Claims 9-11,
The element extraction means includes
Of the elements included in the bottom of the similarity pyramid, determine whether the hierarchical structure of adjacent elements is inclusive,
When adjacent elements are in an inclusive relationship, if the text difference is less than or equal to the threshold value, the included lower hierarchical structure is deleted, and content data corresponding to the upper hierarchical structure included is acquired,
When adjacent elements are in an inclusion relationship, if the text difference exceeds a threshold value, the upper hierarchical structure to be included is deleted, and content data corresponding to the lower hierarchical structure to be included is acquired.
An element extraction program characterized by that.

In the element extraction program in any one of Claims 9-12,
The element selection means arranges a plurality of points at equal intervals in a predetermined direction on the display area, and selects an element including the arranged points in a display range.
An element extraction program characterized by that.

In the element extraction program in any one of Claims 9-13,
The element selecting means arranges a plurality of points on a straight line perpendicular to the predetermined direction on the display area, and selects an element that includes the most points arranged on the same straight line in the display range.
An element extraction program characterized by that.

In the element extraction program in any one of Claims 9-14,
The element extraction means stores the attribute obtained by matching the URL of the A tag included in the element or the natural language in association with the element,
An element extraction program characterized by that.

An element extraction device for extracting a predetermined element from a web page,
Web page expansion means for expanding the Web page in the display area;
Coordinate acquisition means for acquiring coordinates for specifying the display range of each element;
An element selection means for arranging a plurality of points in the arrangement direction of the elements on the display area and selecting an element including the arranged points in the display range;
Element arrangement means for arranging the hierarchical structure of the selected elements in order;
Hierarchy key generation means for generating a hierarchy key by dividing each hierarchical structure of the element in units of tags,
Similarity calculation means for counting the number of matching hierarchical keys between adjacent elements and calculating the similarity based on the result.
Content data acquisition means for specifying a hierarchical structure of two or more adjacent elements based on the similarity and acquiring content data corresponding to the hierarchical structure from a Web page;
An element extraction device characterized by comprising:

An element extraction method for extracting a predetermined element from a Web page by a computer, wherein the computer
Expand the web page in the display area,
Get the coordinates that specify the display range of each element,
A plurality of points are arranged in the element arrangement direction on the display area, and an element including the arranged points in the display range is selected.
Arrange the hierarchical structure of the selected elements in order,
A hierarchical key is generated by dividing each hierarchical structure of the element in units of tags,
Count the number of matching hierarchical keys between adjacent elements, calculate the similarity based on the result,
Identifying a hierarchical structure of two or more adjacent elements based on the similarity, and acquiring content data corresponding to the hierarchical structure from a Web page;
Element extraction method characterized by