JP2009258923A

JP2009258923A - Information space search apparatus and program

Info

Publication number: JP2009258923A
Application number: JP2008106160A
Authority: JP
Inventors: Minako Izawa; 味奈子井沢; Shuichi Nakawatase; 秀一中渡瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-04-15
Filing date: 2008-04-15
Publication date: 2009-11-05

Abstract

【課題】収集条件を指定しておき、その条件に合うＷｅｂページを短時間で情報空間から収集する。
【解決手段】本発明は、到達したページを収集するか否かを判断する基準である収集ページ判別ルールを作成する収集ページ判別ルール作成手段と、収集する際の終了条件を設定する終了条件設定手段と、起点ＵＲＬリスト、終了条件、収集ページ判別ルールに基づいてＷｅｂページを収集し、Ｗｅｂページ記憶手段に格納する処理を該終了条件に達するまで繰り返す。
【選択図】図１A collection condition is specified, and Web pages that meet the condition are collected from an information space in a short time.
The present invention relates to a collection page discrimination rule creating means for creating a collection page discrimination rule, which is a reference for judging whether or not to collect a reached page, and an end condition setting for setting an end condition at the time of collection. The process of collecting the web page based on the means, the starting URL list, the end condition, and the collected page discrimination rule and storing it in the web page storage means is repeated until the end condition is reached.
[Selection] Figure 1

Description

本発明は、情報空間探索装置及びプログラムに係り、特に、ネットワーク構造の情報空間において、目的に応じて効率良くＷｅｂページを収集するための情報空間探索装置及びプログラムに関する。 The present invention relates to an information space search apparatus and program, and more particularly to an information space search apparatus and program for efficiently collecting Web pages according to the purpose in an information space having a network structure.

ネットワーク構造型情報空間の探索方法に関しては、指定された地点を全て収集する方法の他に、予め指定された起点からリンクされている情報資源群を順次アクセスしていくという過程を繰り返すことにより、網羅的にネットワーク構造型の情報空間の探索を行う方法がある（例えば、特許文献１参照）。
特許第３２８２０８９号公報 Regarding the network structure type information space search method, in addition to the method of collecting all designated points, by repeating the process of sequentially accessing the information resource group linked from the designated starting point, There is a method for exhaustively searching a network structure type information space (see, for example, Patent Document 1).
Japanese Patent No. 322889

しかしながら、ＷＷＷに代表される情報空間は莫大な資源を内包している。上記の特許文献１の方式を用い、ＷＷＷ内の情報収集を行うと、リンクされた全ページを収集するので、必要とする情報が掲載されているページ以外のものも多数収集してしまう。 However, information spaces represented by the WWW contain enormous resources. When the information in the WWW is collected using the method of the above-mentioned Patent Document 1, all linked pages are collected, so that many items other than the page on which necessary information is posted are also collected.

結果的に、収集に多大な時間がかかるようになり、必要とする情報が掲載されているページの情報更新が遅くなるという問題がある。 As a result, it takes a lot of time to collect, and there is a problem that information update of a page on which necessary information is posted is delayed.

本発明は、上記の点に鑑みなされたもので、短時間で情報空間のＷｅｂページを収集することが可能な情報空間探索装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to provide an information space search apparatus and program capable of collecting Web pages in an information space in a short time.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、ネットワーク構造の情報空間からＷｅｂページを収集する情報空間探索装置であって、
収集開始地点を表す起点（起点ＵＲＬリスト）を格納する起点リスト記憶手段１０１と、
到達したページを収集するか否かを判断する基準である収集ページ判別ルールを格納するルール記憶手段１０２と、
収集されたＷｅｂページを格納するＷｅｂページ記憶手段１０３と、
収集ページ判別ルールを作成し、ルール記憶手段１０２に格納する収集ページ判別ルール作成手段１０４と、
収集する際の終了条件を設定する終了条件設定手段１０５と、
起点リスト記憶手段１０１の起点ＵＲＬリスト、終了条件、ルール記憶手段１０２の収集ページ判別ルールに基づいてＷｅｂページを収集し、Ｗｅｂページ記憶手段１０３に格納する処理を該終了条件に達するまで繰り返すＷｅｂページ収集手段１０６と、を有する。 The present invention (Claim 1) is an information space search apparatus for collecting Web pages from an information space having a network structure,
Starting point list storage means 101 for storing a starting point (starting point URL list) representing a collection start point;
A rule storage means 102 for storing a collection page discrimination rule that is a criterion for determining whether or not to collect the reached page;
Web page storage means 103 for storing the collected Web pages;
A collection page discrimination rule creating unit 104 that creates a collection page discrimination rule and stores it in the rule storage unit 102;
An end condition setting means 105 for setting an end condition at the time of collection;
A web page that collects web pages based on the origin URL list in the origin list storage unit 101, the end condition, and the collection page discrimination rule in the rule storage unit 102 and repeats the process of storing in the web page storage unit 103 until the end condition is reached. Collecting means 106.

また、本発明（請求項２）は、Ｗｅｂページ収集手段１０６において、
起点リスト記憶手段１０１から起点ＵＲＬリストを取得し、該起点ＵＲＬリストのＵＲＬのＷｅｂページを取得し、該Ｗｅｂページがルール記憶手段１０４の収集ページ判別ルールに該当する場合には、当該Ｗｅｂページ及び該Ｗｅｂページのリンク先のＵＲＬを取得する手段と、
リンク先のＵＲＬの先頭のウェブページを取得し、収集ページ判別ルールに該当する場合には、当該Ｗｅｂページを収集する手段と、を含む。 Further, the present invention (Claim 2) is the Web page collecting means 106,
When the origin URL list is acquired from the origin list storage unit 101, the Web page of the URL of the origin URL list is acquired, and the Web page corresponds to the collection page determination rule of the rule storage unit 104, the Web page and Means for acquiring the URL of the link destination of the Web page;
Means for acquiring the first web page of the link destination URL and collecting the web page if the web page corresponds to the collection page discrimination rule.

また、本発明（請求項３）は、収集ページ判別ルール作成手段１０４において、収集ページ判別ルールとして、
・ＵＲＬに対しては、類似表記ＵＲＬ、同一ドメイン、
・Ｗｅｂ文書中に使用されている単語に対しては、特定のキーワードの有無、使用単語の偏り、
・タグに対しては、alt記載内容、imgタグの数
のいずれかまたは全てを設定する。 Further, according to the present invention (claim 3), the collected page discrimination rule creating means 104 uses the collected page discrimination rule as
-For URL, similar notation URL, same domain,
・ For words used in Web documents, the presence or absence of specific keywords, bias of used words,
・ Set any or all of the contents of alt and the number of img tags for tags.

また、本発明（請求項４）は、終了条件設定手段１０５において、前記終了条件として、
・起点に基づくもの；
・時間に基づくもの；
・データ量に基づくもの；
のいずれかを設定する
本発明（請求項５）は、請求項１乃至４のいずれか１項に記載の情報空間探索装置を構成する各手段としてコンピュータを機能させるための情報空間探索プログラムである。 Further, according to the present invention (Claim 4), in the end condition setting means 105, as the end condition,
・ Based on the starting point;
・ Based on time;
・ Based on data volume;
The present invention (Claim 5) is an information space search program for causing a computer to function as each means constituting the information space search apparatus according to any one of Claims 1 to 4. .

上記のように、本発明は、作成されたルールに基づいて特定のページのみを収集することにより、従来方式より短時間で収集を完了できるため、巡回サイクルを短くすることができ、特定のＷｅｂページに関しては最新の情報を保持し易くなる。 As described above, the present invention collects only a specific page based on a created rule, so that the collection can be completed in a shorter time than the conventional method. It becomes easy to hold the latest information about the page.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図２は、本発明の一実施の形態における情報空間探索装置の構成を示す。 FIG. 2 shows a configuration of an information space search apparatus according to an embodiment of the present invention.

同図に示す情報空間探索装置は、収集開始地点を表す起点（起点ＵＲＬリスト）を格納する起点リスト記憶部１０１、到達したページを収集するか否かを判断する基準であるルールを格納するルール記憶部１０２、収集されたＷｅｂページを格納するＷｅｂページ記憶部１０３、収集ページ判別ルールを作成する収集ページ判別ルール作成部１０４、収集する際の終了条件を設定する終了条件設定部１０５、起点リスト、終了条件、判別ルールに基づいてＷｅｂページを収集するＷｅｂページ収集部１０６、収集したページに対して検索の手がかりとなるインデックスを付与するインデックス付与部１０７から構成される。 The information space search apparatus shown in the figure includes a starting point list storage unit 101 that stores a starting point (starting point URL list) that represents a collection start point, and a rule that stores a rule that is a criterion for determining whether or not to collect a reached page. Storage unit 102, Web page storage unit 103 that stores collected Web pages, collection page discrimination rule creation unit 104 that creates a collection page discrimination rule, end condition setting unit 105 that sets an end condition for collection, and origin list The web page collection unit 106 collects Web pages based on the end condition and the discrimination rule, and the index assignment unit 107 assigns an index serving as a clue to the collected pages.

以下に、上記の構成における動作を説明する。 The operation in the above configuration will be described below.

図３は、本発明の一実施の形態における概要動作のフローチャートである。 FIG. 3 is a flowchart of an outline operation in one embodiment of the present invention.

ステップ１００）最初に、収集の起点となる起点ＵＲＬリストを起点リスト記憶部１０１に設定する。当該動作は外部から指定する。起点は単独でもよいし、図４に示すようにリスト状の複数になっても構わない。起点の作成方法の具体例としては、「検索エンジンにクエリを入力し、出力された検索結果のＵＲＬリスト」や「書籍に掲載されたＵＲＬ」があげられる。 Step 100) First, a starting URL list that is a starting point of collection is set in the starting list storage unit 101. The operation is specified from the outside. The starting point may be a single point or a plurality of lists as shown in FIG. Specific examples of the method of creating the starting point include “URL list of search results output by inputting a query to a search engine” and “URLs published in books”.

ステップ２００）終了条件設定部１０５において、情報を収集する際の終了条件を設定する。終了条件は、
・起点に基づくもの；
・時間に基づくもの；
・データ量に基づくもの；
等があげられる。具体的には、起点に基づく場合は、「起点ＵＲＬリスト及び各ページからｎホップしたリンクページを収集する」、時間に基づくものは「開始から３０分収集する」、データ量に基づくものは「収集データが１ＧＢになるまで収集する」といった設定条件が挙げられる。 Step 200) The end condition setting unit 105 sets an end condition for collecting information. The end condition is
・ Based on the starting point;
・ Based on time;
・ Based on data volume;
Etc. Specifically, when based on the starting point, “collect the starting URL list and n-hop linked pages from each page”, those based on time “collect 30 minutes from the start”, and those based on the data amount “ Setting conditions such as “collect until collected data reaches 1 GB” can be mentioned.

ステップ３００）収集ページ判別ルール作成部１０４において、到達したページが収集対象に該当するかどうかを判別するためのルールを作成する。ルール適用対象としては、
・ＵＲＬ；
・Ｗｅｂ文書中に使用されている単語；
・タグ；
等があげられる。 Step 300) The collection page discrimination rule creation unit 104 creates a rule for discriminating whether or not the reached page corresponds to the collection target. As a rule application target,
・ URL;
-Words used in web documents;
·tag;
Etc.

ルールの具体例としては、「ＵＲＬ」に対しては、類似表記ＵＲＬや同一ドメイン、「Ｗｅｂ文書中に使用されている単語」に対しては、特定のキーワードの有無や使用単語の偏り、「タグ」に対しては、Ａタグのalt記載内容やimgタグの数、といったものが考えられる。 Specific examples of rules include a similar notation URL and the same domain for “URL”, the presence / absence of a specific keyword and the bias of used words for “words used in a Web document”, “ For “tag”, the alt description content of the A tag and the number of img tags can be considered.

ステップ４００）Ｗｅｂページ収集部１０６は、上記の起点（起点リスト）、終了条件、収集ページ判別ルールに基づき、Ｗｅｂページを収集する。収集したページはインデックス作成部１０７に渡す。 Step 400) The web page collection unit 106 collects web pages based on the above starting point (starting point list), end condition, and collected page discrimination rule. The collected pages are transferred to the index creation unit 107.

ステップ５００）インデックス付与部１０７において、Ｗｅｂページ収集部１０６から伝達されたデータに対して、検索するための手かがりとなるインデックスを付与し、Ｗｅｂページ記憶部１０３に格納する。 Step 500) The index assigning unit 107 assigns an index serving as a clue to search the data transmitted from the Web page collecting unit 106 and stores the index in the Web page storage unit 103.

次に、上記のステップ４００のＷｅｂページ収集部１０６の具体的な動作を説明する。 Next, a specific operation of the web page collection unit 106 in step 400 will be described.

図５は、本発明の一実施の形態におけるＷｅｂページ収集部の詳細な動作のシーケンスチャートである。また、図６は、本発明の一実施の形態における起点リスト・終了条件・収集判別ルールの例を示す。 FIG. 5 is a sequence chart of detailed operations of the Web page collection unit according to the embodiment of the present invention. FIG. 6 shows an example of a start point list, end condition, and collection determination rule in one embodiment of the present invention.

まず、Ｗｅｂページ収集部１０６は、起点リスト記憶部１０１に格納されている起点ＵＲＬリスト（図６（ａ））を読込み（ステップ４０１）、先頭（123.com/）のページを取得し（ステップ４０２）、ルール記憶部１０２を参照し、当該ページが収集ページ判別ルール（「○○○」という単語が記載されているページ）に該当するかをチェックする（ステップ４０３）。該当した場合は（ステップ４０３、Ｙｅｓ）、そのページを収集し（ステップ４０４）、ページ内のリンク先のＵＲＬを全て取得する（ステップ４０５）。なお、該当しない場合には、目的によって当該ページを収集しても、または、しなくてもよい。 First, the web page collection unit 106 reads the origin URL list (FIG. 6A) stored in the origin list storage unit 101 (step 401), and acquires the first page (123.com/) (steps). 402), by referring to the rule storage unit 102, it is checked whether or not the page corresponds to a collected page discrimination rule (a page on which the word “XXX” is described) (step 403). If applicable (step 403, Yes), the page is collected (step 404), and all link destination URLs in the page are acquired (step 405). If not applicable, the page may or may not be collected depending on the purpose.

取得したリンク先ページ対して（ステップ４０６）、同様に収集ページ判別ルール記憶部１０２を参照し、収集ページ判別ルールに該当するか否かをチェックする（ステップ４０７）。該当した場合は（ステップ４０７、Ｙｅｓ）、ページを収集し、ステップ４１１に移行する（ステップ４０８）。 For the acquired link destination page (step 406), the collected page discrimination rule storage unit 102 is similarly referenced to check whether or not the collected page discrimination rule is met (step 407). When it corresponds (step 407, Yes), a page is collected and it transfers to step 411 (step 408).

一方、上記のステップ４０３において、収集判別ルールに該当しない場合は、図６（ｃ）の収集ページ判別ルール（「○○○」という単語が記載されていないページのリンク先は収集対象としない）に従い、起点（起点リスト）の次のＵＲＬがあるかを判断し（ステップ４０９）、ある場合は（ステップ４０９、Ｙｅｓ）、起点ＵＲＬリストの次のＵＲＬリストへ移動し、ステップ４０３に移行する（ステップ４１０）。次のＵＲＬがない場合は（ステップ４０９、Ｎｏ）、収集処理を終了する。 On the other hand, in the above-described step 403, if the collection determination rule is not met, the collection page determination rule in FIG. 6C (link destinations of pages not including the word “xxx” are not collected) Accordingly, it is determined whether or not there is a next URL of the starting point (starting point list) (step 409). If there is (step 409, Yes), the process moves to the next URL list of the starting point URL list and proceeds to step 403 ( Step 410). If there is no next URL (step 409, No), the collection process is terminated.

また、上位のステップ４０７において収集判別ルールに該当しない場合、または、ステップ４０８でページ収集後に、他にリンクページがあるかを判定し（ステップ４１１）、ある場合はリンク先の次のＵＲＬへ移動する（ステップ４１３）。ない場合はステップ４０９に移行し、ＵＲＬリストの次のＵＲＬがあるかを判定する。 In addition, if it does not correspond to the collection determination rule in the upper step 407, or after collecting the page in step 408, it is determined whether there is another linked page (step 411), and if there is, move to the next URL of the link destination (Step 413). If not, the process proceeds to step 409 to determine whether there is a next URL in the URL list.

同様の処理を全てのリンク先ページに行い、全てのリンク先ページに対してチェックが終了した場合は、起点（起点リスト）に戻り、次のＵＲＬに対して収集判別ルールの適用を行う。 The same processing is performed for all the linked pages, and when the check is completed for all the linked pages, the process returns to the starting point (starting list) and the collection determination rule is applied to the next URL.

上記の情報空間探索装置の構成要素の動作をプログラムとして構築し、情報空間探索装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operations of the components of the information space search device described above can be constructed as a program, installed in a computer used as the information space search device, executed, or distributed via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、Ｗｅｂページを収集する技術に適用可能である。 The present invention is applicable to a technique for collecting Web pages.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における情報空間探索装置の構成図である。It is a block diagram of the information space search apparatus in one embodiment of this invention. 本発明の一実施の形態における概要動作のフローチャートである。It is a flowchart of the outline | summary operation | movement in one embodiment of this invention. 本発明の一実施の形態における起点リストの例である。It is an example of the starting point list in one embodiment of the present invention. 本発明の一実施の形態におけるＷｅｂページ収集部の詳細な動作のフローチャートである。It is a flowchart of detailed operation | movement of the web page collection part in one embodiment of this invention. 本発明の一実施の形態における起点リスト・終了条件・収集判別ルールの例である。It is an example of a starting point list, an end condition, and a collection determination rule in an embodiment of the present invention.

Explanation of symbols

１０１起点リスト記憶手段、起点リスト記憶部
１０２ルール記憶手段、ルール記憶部
１０３Ｗｅｂページ記憶手段、Ｗｅｂページ記憶部
１０４収集ページ判別ルール作成手段、収集ページ判別ルール作成部
１０５終了条件設定手段、終了条件設定部
１０６Ｗｅｂページ収集手段、Ｗｅｂページ収集部
１０７インデックス付与部 101 starting point list storage unit, starting point list storage unit 102 rule storage unit, rule storage unit 103 Web page storage unit, Web page storage unit 104 collection page discrimination rule creation unit, collection page discrimination rule creation unit 105 end condition setting unit, end condition Setting unit 106 Web page collection means, Web page collection unit 107 Indexing unit

Claims

An information space search device that collects web pages from an information space having a network structure,
Starting point list storage means for storing a starting point (starting point URL list) representing a collection start point;
Rule storage means for storing a collection page discrimination rule that is a criterion for determining whether or not to collect the reached page;
Web page storage means for storing the collected Web pages;
A collection page discrimination rule creating means for creating the collected page discrimination rule and storing it in the rule storage means;
An end condition setting means for setting an end condition when collecting,
A process of collecting Web pages based on the starting URL list of the starting list storage unit, the end condition, and the collected page determination rule of the rule storage unit and storing the Web page in the Web page storage unit until the end condition is reached. Repeating web page collection means;
An information space search device characterized by comprising:

The web page collection means includes:
When the origin URL list is acquired from the origin list storage unit, the Web page of the URL of the origin URL list is acquired, and the Web page corresponds to the collected page determination rule of the rule storage unit, the Web Means for obtaining the URL of the page and the link destination of the web page;
Means for acquiring the first web page of the link destination URL and collecting the web page when the collected page discrimination rule is satisfied;
The information space search device according to claim 1, comprising:

The discrimination rule creating means is a collection page discrimination rule,
-For URL, similar notation URL, same domain,
・ For words used in Web documents, the presence or absence of specific keywords, bias of used words,
The information space search device according to claim 1 or 2, wherein any one or all of the contents described in alt and the number of img tags are set for the tag.

The end condition setting means includes the end condition as
・ Based on the starting point;
・ Based on time;
・ Based on data volume;
The information space search device according to claim 1, wherein any one of the above is set.

The information space search program for functioning a computer as each means which comprises the information space search apparatus of any one of Claims 1 thru | or 4.