JP2002082923A

JP2002082923A - Information collating method, information collating device, and recording medium with information collection program recorded thereon

Info

Publication number: JP2002082923A
Application number: JP2000273607A
Authority: JP
Inventors: Nobuharu Noto; 信晴能登; Hiroshi Takeno; 浩竹野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-09-08
Filing date: 2000-09-08
Publication date: 2002-03-22

Abstract

PROBLEM TO BE SOLVED: To make easily increasable simultaneously executable information collection processing units, and to make improvable the collecting performance (the number of information collectable per time) at need. SOLUTION: Computers CA, CB, and CC are connected through a network 7 to plural WWW servers. When the information of an URL received from a URL list reading part 1A is not collected, information processing parts 2A1 and 2A2 in the computer CA acquires a WWW page from a WWW indicated by the URL, and holds it on a file system. An URL extracting part 3A extracts the URL from the file in which the WWW page is prepared, and searches the number of the collection processing part in charge of collection according to calculation from the WWW server name indicated by the URL, and outputs the extracted URL to directories corresponding to the respective computers CA, CB, and CC to which the collection processing parts in charge of the collection belong on a disk 6A, and informs the respective computers CA, CB, and CC of the URL.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ネットワーク上に
配置され、情報提供サーバ名とそのサーバ上の識別子か
らなる情報アドレスが付与されている情報を提供する複
数の情報提供サーバから並行して情報を収集する情報収
集方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for providing information in parallel from a plurality of information providing servers which are provided on a network and provide information provided with an information address comprising an information providing server name and an identifier on the server. Information collection method for collecting information.

【０００２】[0002]

【従来の技術】従来、複数の情報提供サーバから並行し
て情報を収集する場合、SergeyとBrinによる論文「The
Anatomy of a Large-Scale Hypertextual Web Search E
ngine」（Proceedings of Seventh International Worl
d Wide Web Conference (1998)）で説明されているよう
に、情報アドレス（上記論文中ではＵＲＬと呼ばれてい
る）管理モジュール（上記論文中ではＵＲＬ ser ver
と呼ばれている）が同時に１つ実行されており、このモ
ジュールが、同じサーバからの収集を複数の処理単位が
実行しないように情報アドレスを配分して、収集させる
という方法がとられてきた（図２）。2. Description of the Related Art Conventionally, when collecting information in parallel from a plurality of information providing servers, Sergey and Brin's paper "The
Anatomy of a Large-Scale Hypertextual Web Search E
ngine "(Proceedings of Seventh International Worl
d Wide Web Conference (1998)), an information address (called URL in the above paper) management module (URL ser ver in the above paper)
One is executed at the same time, and this module distributes information addresses so that a plurality of processing units do not execute collection from the same server, and causes the collection to be performed. (FIG. 2).

【０００３】[0003]

【発明が解決しようとする課題】しかし、この方法で
は、収集中に発見された情報アドレスは全て情報アドレ
ス管理モジュールに処理されなくてはならず、処理単位
が多数になると情報アドレスが発見されるスピードに管
理モジュールの処理が追い付かなくなり、性能を低下さ
せる原因になるという問題がある。However, in this method, all the information addresses found during collection must be processed by the information address management module, and the information addresses are found when the number of processing units increases. There is a problem that the processing of the management module cannot keep up with the speed, which causes a decrease in performance.

【０００４】そこで、処理単位ごとに発見された情報ア
ドレスの収集を担当する処理単位を決定する方法が与え
られたと仮定する。その時、情報アドレスを通知する処
理単位間の接続の数を考えると、処理単位がｎ個なら
ば、処理単位が発見した情報アドレスを情報アドレス管
理モジュールを経由して通知するならば接続数は２ｎで
済むが、各処理単位間で直接通知を行うならば接続数は
ｎ²となり、接続数は処理単位数が増えると途端に多く
なる。１つの装置上で処理単位が増やせる場合は処理単
位間の通知が比較的簡易に行われるが、それ以上の処理
単位を必要とするとき複数の装置が利用されることにな
り、装置をまたがった通知は一般に難しくなるという問
題がある。また、通知相手となる処理単位が停止してい
る場合、通知すべき情報アドレスを通知元で保持する手
段が必要で、相手側処理単位が復帰したか定期的に調べ
る必要もあり、処理が複雑になるという問題もある。[0004] Therefore, it is assumed that a method has been provided for determining a processing unit responsible for collecting information addresses found for each processing unit. At this time, considering the number of connections between the processing units that notify the information addresses, if the number of processing units is n, the number of connections is 2n if the information addresses found by the processing units are notified via the information address management module. However, if notification is performed directly between the processing units, the number of connections is n ² , and the number of connections increases immediately as the number of processing units increases. When the number of processing units can be increased on one device, the notification between the processing units is relatively easily performed, but when more processing units are required, a plurality of devices are used, and the devices are straddled. There is a problem that notification is generally difficult. Further, when the processing unit to be notified is stopped, means for holding the information address to be notified at the notification source is required, and it is necessary to periodically check whether the processing unit on the other side has returned, which complicates the processing. There is also the problem of becoming.

【０００５】本発明の目的は、このような性能向上を阻
害する情報アドレス管理モジュールを不要とし、処理単
位間の通知方法を簡易にし、同時に実行できる処理単位
を簡単に増加させることができ、収集性能（時間あたり
に収集できる情報の数）を必要に応じて向上させること
ができる情報収集方法、装置、情報収集プログラムを記
録した記録媒体を提供することにある。An object of the present invention is to eliminate the need for an information address management module that hinders such performance improvement, simplify the method of notifying between processing units, and easily increase the number of simultaneously executable processing units. It is an object of the present invention to provide an information collection method and apparatus capable of improving performance (the number of information that can be collected per time) as needed, and a recording medium on which an information collection program is recorded.

【０００６】[0006]

【課題を解決するための手段】本発明の情報収集方法
は、複数の収集処理単位が並行して情報を収集し、その
際１つの情報提供サーバには同時に１つの収集処理単位
が接続でき、各収集処理単位は収集した情報から発見し
た情報アドレスに含まれる情報提供サーバ名を基に計算
を行なって、該アドレスの情報を収集すべき収集処理単
位を決定し、該収集処理単位に前記情報アドレスを通知
する。According to the information collection method of the present invention, a plurality of collection processing units collect information in parallel, and at that time, one collection processing unit can be connected to one information providing server at the same time. Each collection processing unit performs a calculation based on the information providing server name included in the information address found from the collected information, determines a collection processing unit in which information of the address should be collected, and assigns the information to the collection processing unit. Advertise the address.

【０００７】収集処理単位は、情報アドレスを入力とし
て取る。収集処理単位は、収集済情報アドレスデータベ
ースを持っており、情報アドレスが収集済かどうか判断
することができる。入力した情報アドレスは、まずこの
収集済情報アドレスデータベースを使って、未収集の情
報アドレスか確認される。収集済の情報アドレスだった
場合、その情報アドレスは廃棄される。The collection processing unit takes an information address as an input. The collection processing unit has a collected information address database, and can determine whether the information address has been collected. First, the input information address is confirmed as an uncollected information address using the collected information address database. If the information address has been collected, the information address is discarded.

【０００８】情報アドレスは、情報提供サーバ名と、情
報提供サーバ上での識別子とからなる。収集処理単位は
情報アドレスに基づいて情報提供サーバに接続し、サー
バ上の識別子を指定することで、情報アドレスの指し示
す情報を得る。情報を得た後に、その情報アドレスを収
集済情報アドレスデータベースに収集済として登録す
る。サーバから得た情報に情報アドレスが含まれること
があるので、これを解析して、情報アドレスを出力す
る。このような収集処理単位を同時に複数実行して、情
報収集を行わせる。[0008] The information address includes an information providing server name and an identifier on the information providing server. The collection processing unit connects to the information providing server based on the information address, and obtains information indicated by the information address by specifying an identifier on the server. After obtaining the information, the information address is registered as collected in the collected information address database. Since the information obtained from the server may include an information address, the information is analyzed and the information address is output. Information collection is performed by executing a plurality of such collection processing units simultaneously.

【０００９】各収集処理単位は、情報提供サーバ名を入
力として、その情報提供サーバからの情報収集を担当す
る収集処理単位を決定する計算方法を共有する。したが
って、どの収集処理単位内でその計算を行っても、入力
とする情報サーバ名が同じであれば、同じ収集処理単位
が収集を担当するように決定できる。Each collection processing unit receives a name of an information providing server and shares a calculation method for determining a collection processing unit responsible for collecting information from the information providing server. Therefore, no matter which collection processing unit performs the calculation, if the same information server name is input, it can be determined that the same collection processing unit is in charge of collection.

【００１０】各収集処理単位は、収集した情報から発見
した情報アドレスに含まれる情報提供サーバ名に対して
この計算方法を適用し、この情報アドレスの情報を収集
すべき収集処理単位を決定し、その収集処理単位に通知
する。Each of the collection processing units applies this calculation method to the information providing server name included in the information address found from the collected information, and determines a collection processing unit in which information of this information address should be collected. Notify the collection processing unit.

【００１１】この通知の際には、ファイルシステム上の
特定のディレクトリを利用する。出力元と出力先の収集
処理単位の組に１対１対応させて、ディレクトリを用意
する。このディレクトリに対し、出力元は番号を名称と
したファイルを作成し、通知すべき情報アドレスを出力
する。一定の時間や、一定の情報アドレスごとにファイ
ルを閉じ、閉じたファイルより１つ大きな番号のファイ
ルを新たに作成して、情報アドレスを出力する。このよ
うに、出力元は情報アドレスを、順次番号の付いたファ
イルに出力していく。出力元は、一度閉じたファイルに
は変更を加えない。At the time of this notification, a specific directory on the file system is used. A directory is prepared in one-to-one correspondence with a set of collection processing units of an output source and an output destination. For this directory, the output source creates a file whose name is a number and outputs an information address to be notified. The file is closed for a certain time or at a certain information address, a file having a number one larger than the closed file is newly created, and the information address is output. Thus, the output source outputs the information addresses to sequentially numbered files. The output source does not make any changes to the file once it has been closed.

【００１２】このディレクトリで一番大きな番号の付い
たファイルについて、現在も出力元が情報アドレスを書
き込んでいる可能性があるため、通知を受ける収集処理
単位は読み取らないようにする。それ以外のファイル
は、すでに出力元が出力し終わり変更が行われないこと
が保証されているので、読み取りを行ってよい。Regarding the file with the largest number in this directory, the output source may still write the information address, so that the collection processing unit to be notified is not read. The other files may be read because the output source has already output them and it is guaranteed that no changes will be made.

【００１３】また、この通知方法は、複数の収集処理単
位を一つのグループにまとめ、そのグループ間の通知に
適用してもよい。This notification method may be applied to notification between groups by collecting a plurality of collection processing units into one group.

【００１４】このように、収集処理単位間の通知にファ
イルシステムを介した場合、通知を受け取る収集処理単
位が何らかの問題で停止している場合でも、通知元は影
響を受けず情報アドレスをファイルシステムに書き出す
ことができ、また通知を受ける収集処理単位も活動を再
開した時点から、ファイルシステムにすでに書き出され
ている情報アドレスを読み込み始めることができる。こ
れによって、個別の収集処理単位の停止が他に影響を与
えないで済むという利点がある。As described above, when the notification between the collection processing units is transmitted via the file system, even if the collection processing unit receiving the notification is stopped due to some problem, the notification source is not affected and the information address is not changed. The collection processing unit to be notified can start reading the information address already written in the file system from the time when the activity is resumed. As a result, there is an advantage that the suspension of the individual collection processing unit does not affect the others.

【００１５】また、この方法では、ＮＦＳ（Network Fi
le System）など装置をまたがって利用できるファイル
システムも利用可能であるので、収集処理単位を増やす
ために利用する装置を増やしても、収集処理単位間の通
知方法を変更する必要がない。Further, in this method, NFS (Network Fi
le System), which can be used across devices, can be used. Therefore, even if the number of devices used to increase the number of collection processing units is increased, there is no need to change the notification method between the collection processing units.

【００１６】さらに、各処理単位あるいは複数の処理単
位を一つのグループとしたものが収集した情報から発見
した情報アドレスを通知した場合、各処理単位あるいは
複数の処理単位を一つのグループとしたもの自身が保持
する、既通知情報アドレスリストにその情報アドレスを
登録する。情報アドレスを通知する際は、その情報アド
レスが、既通知情報アドレスリストに登録されていない
か確認してから、通知を行う。Further, when an information address found from information collected by each processing unit or a plurality of processing units as one group is notified, the processing unit or each of the plurality of processing units is regarded as one group. The information address is registered in the already notified information address list held by. When notifying the information address, the notification is performed after confirming whether the information address is registered in the already notified information address list.

【００１７】ある情報アドレスを重複して、収集を担当
する収集処理単位に通知しないためには、全収集処理単
位間で共有するデータベースを用意し、それに問い合わ
せながら通知をするということも考えられるが、このデ
ータベースへの問い合わせが性能向上の阻害原因となる
ことが明らかであり、現実的な解決方法ではない。この
既通知情報アドレスリストを使うことによって、ある情
報アドレスはある収集処理単位からは１度しか出力しな
いことが保証されるので、収集を担当する収集処理単位
は、同一の情報アドレスを最大でも全収集処理単位数、
あるいは複数の収集処理単位をグループ化したグループ
数だけしか受け取らないで済み、重複通知の数を抑えら
れる。To avoid duplicating a certain information address and notifying the collection processing unit responsible for collection, a database shared by all collection processing units may be prepared and notified while inquiring about it. However, it is clear that an inquiry to this database will hinder performance improvement, and is not a practical solution. By using this already notified information address list, it is guaranteed that a certain information address is output only once from a certain collection processing unit. Number of collection processing units,
Alternatively, only the number of groups obtained by grouping a plurality of collection processing units needs to be received, and the number of duplicate notifications can be suppressed.

【００１８】[0018]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００１９】図１は本発明の一実施形態を示すシステム
構成図である。FIG. 1 is a system configuration diagram showing an embodiment of the present invention.

【００２０】ネットワーク７上に情報提供サーバである
複数のＷＷＷサーバ（不図示）が配置されている。ＷＷ
Ｗサーバから取得できる情報であるＷＷＷページには、
それぞれＵＲＬという情報アドレスが付与されている。
これは、ＷＷＷサーバの名前と、そのサーバ上での識別
子とからなる。A plurality of WWW servers (not shown), which are information providing servers, are arranged on the network 7. WW
The WWW page, which is information that can be obtained from the W server,
Each is provided with an information address called a URL.
It consists of the name of the WWW server and an identifier on that server.

【００２１】ＷＷＷサーバからの情報の収集には、１つ
あるいは複数のコンピュータ、ここでは３つのコンピュ
ータＣＡ，ＣＢ，ＣＣを利用するが、コンピュータの台
数は何台でもよい。収集に利用するコンピュータＣＡ，
ＣＢ，ＣＣ同士はネットワーク７で互いに接続されてい
る。また、収集に利用するコンピュータＣＡ，ＣＢ，Ｃ
Ｃと、ＷＷＷサーバもネットワーク７で接続されてい
る。To collect information from the WWW server, one or a plurality of computers, in this case, three computers CA, CB, and CC are used, but the number of computers may be any number. Computer CA used for collection,
CB and CC are connected to each other by a network 7. Computers CA, CB, C used for collection
C and the WWW server are also connected by the network 7.

【００２２】コンピュータＣＡ，ＣＢ，ＣＣでは、ＵＲ
Ｌリスト読み込み部１Ａ，１Ｂ，１Ｃ、収集処理部２Ａ
₁，２Ａ₂，２Ｂ₁，２Ｂ₂，２Ｃ₁，２Ｃ₂、ＵＲＬ抽出部
３Ａ，３Ｂ，３Ｃが動作している。ここでは、１つのＵ
ＲＬリスト読み込み部、２つの収集処理部、１つのＵＲ
Ｌ抽出部が組になって動作することを例に説明するが、
組となる要素の数は、それぞれのモジュール（部）が１
つ以上であれば任意の数でよい。また、この組が、１つ
のコンピュータ上でいくつ動いていてもよいが、ここで
は１組動いていることを例にとって説明する。In the computers CA, CB and CC, UR
L list reading units 1A, 1B, 1C, collection processing unit 2A
₁ , 2A ₂ , 2B ₁ , 2B ₂ , 2C ₁ , 2C ₂ , and URL extraction units 3A, 3B, 3C are operating. Here, one U
RL list reading unit, two collection processing units, one UR
An example will be described in which the L extraction units operate in pairs,
The number of elements to be set is 1 for each module (part).
Any number may be used as long as the number is at least one. Although any number of such sets may be running on one computer, an example in which one set is running will be described here.

【００２３】収集処理部２Ａ₁，２Ａ₂，…，２Ｃ₂には
通し番号（０，１，２，３，４，５）がついている。The collection processors 2A ₁ , 2A ₂ ,..., 2C ₂ are given serial numbers (0, ₁ , 2, ₃ , 4, 5).

【００２４】ＵＲＬリスト読み込み部１Ａ，１Ｂ，１
Ｃ、ＵＲＬ抽出部３Ａ，３Ｂ，３Ｃは、ＷＷＷサーバ名
を入力として、そのＷＷＷサーバからの収集を担当する
収集処理部の番号を一意に決定する計算式（担当決定計
算式）を持っている。この計算式は、例えばハッシュ関
数で実現できる。したがって、どのコンピュータのＵＲ
Ｌリスト読み込み部でも、ＵＲＬ抽出部でも、同じＷＷ
Ｗサーバ名が与えられれば、かならず同じ収集処理部の
番号を決定できる。URL list reading units 1A, 1B, 1
C, the URL extraction units 3A, 3B, and 3C have a calculation formula (assignment calculation formula) that uniquely determines the number of the collection processing unit that is in charge of collection from the WWW server by inputting the WWW server name. . This calculation formula can be realized by, for example, a hash function. Therefore, which computer UR
The same WW is used for both the L list reading unit and the URL extraction unit.
Given the W server name, the same collection processing unit number can always be determined.

【００２５】ＵＲＬリスト読み込み部１Ａ，１Ｂ，１Ｃ
には、まずこのＵＲＬリスト読み込み部に接続された収
集処理部が収集を担当するＷＷＷサーバから提供される
ＷＷＷページのＵＲＬが１つあるいは複数おさめられた
ファイル（初期収集ＵＲＬファイル）が与えられる。こ
のファイルを読み込み、それぞれのＵＲＬに対し、ＵＲ
Ｌの中からＷＷＷサーバ名を取りだし、これを担当決定
計算式に与え、得られた収集処理部の番号にしたがっ
て、収集処理部にそのＵＲＬを出力する。URL list reading units 1A, 1B, 1C
First, a file (initial collection URL file) in which one or a plurality of URLs of WWW pages provided by a WWW server which is in charge of collection by a collection processing unit connected to the URL list reading unit is provided. This file is read and the URL
The WWW server name is taken out from L, this is given to the charge determination formula, and the URL is output to the collection processing unit according to the obtained number of the collection processing unit.

【００２６】収集処理部２Ａ₁，２Ａ₂，２Ｂ₁，２Ｂ₂，
２Ｃ₁，２Ｃ₂は、それぞれ収集済ＵＲＬデータベース４
Ａ₁，４Ａ₂，４Ｂ₁，４Ｂ₂，４Ｃ₁，４Ｃ₂を持ってい
る。収集済ＵＲＬデータベース４Ａ₁〜４Ｃ₂は、ＵＲＬ
を与えると、そのＵＲＬがすでに収集済として登録され
ているかどうかを返答するデータベースである。The collection processing units 2A ₁ , 2A ₂ , 2B ₁ , 2B ₂ ,
2C ₁ and 2C ₂ are the collected URL databases 4 respectively.
Have _{_{_{A 1, 4A 2, 4B 1}}} , 4B 2, 4C 1, 4C 2. Collected URL database 4A ₁ ~4C ₂ is, URL
Is a database that replies whether the URL is already registered as collected.

【００２７】収集処理部２Ａ₁〜２Ｃ₂は、ＵＲＬリスト
読み込み部１Ａ〜１ＣからＵＲＬを受け取ると、収集済
ＵＲＬデータベース４Ａ₁〜４Ｃ₂にそのＵＲＬを与え、
そのＵＲＬが示す情報が未収集であるようならば、その
ＵＲＬが指し示すＷＷＷページを収集する。そうでなけ
れば、そのＵＲＬを捨て、ＵＲＬリスト読み込み部１Ａ
〜１Ｃから次に得たＵＲＬについて、収集済ＵＲＬデー
タベース４Ａ₁〜４Ｃ₂の検索を行う。収集すべきＵＲＬ
については、ＵＲＬの示すＷＷＷサーバに接続してＵＲ
Ｌが指し示すＷＷＷページを取得する。収集処理部２Ａ
₁〜２Ｃ₂は取得したＷＷＷページをファイルシステム
（不図示）上に保存し、収集済ＵＲＬデータベース４Ａ
₁〜４Ｃ₂にＵＲＬを収集済として登録する。その後、収
集したＷＷＷページを保存したファイルのファイル名を
ＵＲＬ抽出部３Ａ〜３Ｃに通知する。ＵＲＬ抽出部３Ａ
〜３Ｃは、通知されたファイル名に基づいてファイルを
解析し、そこにＵＲＬが含まれている場合は、ＵＲＬを
取り出す。取り出したＵＲＬが示すＷＷＷサーバ名を担
当決定計算式に与えて、収集を担当する収集処理部の番
号を得る。この番号に応じて、どのコンピュータのＵＲ
Ｌリスト読み込み部に通知すべきか判断する。When the collection processing units 2A _{1 to} 2C ₂ receive the URL from the URL list reading units 1A to 1C, the collection processing units 2A _{1 to} 2C ₂ provide the collected URL databases 4A _{1 to} 4C ₂ with the URLs.
If the information indicated by the URL has not been collected, the WWW page indicated by the URL is collected. Otherwise, the URL is discarded, and the URL list reading unit 1A
For then obtained URL from ~1C, do a search of the collection already URL database 4A ₁ ~4C _2. URL to collect
For the URL, connect to the WWW server indicated by the URL
The WWW page indicated by L is acquired. Collection processing unit 2A
_{1 to} 2C ₂ store the obtained WWW pages on a file system (not shown) and collect the URL database 4A.
To register to _{1 ~4C} ₂ a URL as a collection already. After that, the file names of the files storing the collected WWW pages are notified to the URL extraction units 3A to 3C. URL extractor 3A
3C analyzes the file based on the notified file name, and extracts the URL if the file contains the URL. The name of the WWW server indicated by the extracted URL is given to the charge determination formula, and the number of the collection processing unit responsible for collection is obtained. Depending on this number, the UR of any computer
It is determined whether to notify the L list reading unit.

【００２８】各コンピュータＣＡ，ＣＢ，ＣＣの特定の
ディレクトリの下にａ，ｂ，ｃというディレクトリがあ
る。コンピュータＣＡのディスク６Ａ上には／ＣＡ／
ａ，／ＣＡ／ｂ，／ＣＡ／ｃというディレクトリが、コ
ンピュータＣＢのディスク６Ｂ上には／ＣＢ／ａ，／Ｃ
Ｂ／ｂ，／ＣＢ／ｃというディレクトリが、コンピュー
タＣＣのディスク６Ｃ上には／ＣＣ／ａ，／ＣＣ／ｂ，
／ＣＣ／ｃというディレクトリがあるとする。There are directories a, b and c under specific directories of the computers CA, CB and CC. On the disk 6A of the computer CA, / CA /
The directories a, / CA / b, and / CA / c are stored on the disk 6B of the computer CB as / CB / a, / C.
The directories B / b and / CB / c are stored on the disk 6C of the computer CC as / CC / a, / CC / b,
Assume that there is a directory / CC / c.

【００２９】コンピュータＣＡのＵＲＬ抽出部３Ａが抽
出したＵＲＬはディスク６Ａ上に出力する。これは、た
とえコンピュータＣＢ，ＣＣが停止していたとしても、
コンピュータＣＡが動作している限り、抽出したＵＲＬ
を出力できるようにするためである。ＵＲＬ抽出部３Ａ
は、コンピュータＣＡの収集処理部２Ａ₁，２Ａ₂が収集
すべきＵＲＬを自ディスク６Ａのａのディレクトリに、
コンピュータＣＢの収集処理部２Ｂ₁，２Ｂ₂が収集すべ
きＵＲＬを自ディスク６Ａのｂのディレクトリに、コン
ピュータＣＣの収集処理部２Ｃ₁，２Ｃ₂が収集すべきＵ
ＲＬを自ディスク６Ａのｃのディレクトリに、それぞれ
出力する。出力する際のファイル名は特定の番号から始
め、１つのファイルに一定数のＵＲＬが出力された場合
か、あるいは出力をはじめて一定時間がたった場合は、
そのファイルを閉じ、１つ大きい番号をファイル名とし
たファイルを作成してＵＲＬの出力を続ける。ここで
は、０という番号を最初のファイル名にすると仮定して
説明する。The URL extracted by the URL extraction unit 3A of the computer CA is output on the disk 6A. This means that even if the computers CB and CC are stopped,
As long as the computer CA is operating, the extracted URL
Is to be output. URL extractor 3A
Stores the URL to be collected by the collection processing units 2A ₁ and 2A ₂ of the computer CA in the directory a of the own disk 6A,
The URLs to be collected by the collection processing units 2B ₁ and 2B _{2 of the} computer CB are stored in the directory b of the own disk 6A, and the URLs to be collected by the collection processing units 2C ₁ and 2C ₂ of the computer CC.
The RL is output to the directory c of the own disk 6A. The file name at the time of output starts from a specific number, or when a certain number of URLs are output to one file, or when a certain time has elapsed after the output has started,
The file is closed, a file having a file name with the next higher number is created, and the output of the URL is continued. Here, description will be made assuming that the number 0 is used as the first file name.

【００３０】ＵＲＬ抽出部３Ａ，３Ｂ，３Ｃは、ＵＲＬ
を出力する際には、ＵＲＬ抽出部３Ａ，３Ｂ，３Ｃ自身
が持つ出力済ＵＲＬデータベース５Ａ，５Ｂ，５Ｃに、
そのＵＲＬが出力済かどうか問い合わせる。出力済でな
い場合は、そのＵＲＬを出力済ＵＲＬデータベース５
Ａ，５Ｂ，５Ｃに登録してから、上記のようにＵＲＬを
出力する。The URL extraction units 3A, 3B, 3C
Is output to the output URL databases 5A, 5B, and 5C of the URL extraction units 3A, 3B, and 3C.
Inquires whether the URL has been output. If the URL has not been output, the URL is output to the output URL database 5
After registering in A, 5B, and 5C, the URL is output as described above.

【００３１】各コンピュータＣＡ，ＣＢ，ＣＣはネット
ワーク・ファイル・システム（ＮＦＳ）を使って、これ
らのディレクトリを他のコンピュータにエクスポートし
ている。したがって、各コンピュータＣＡ，ＣＢ，ＣＣ
では、他のコンピュータのディスク上にあるディレクト
リやファイルに、あたかも自分のディスク上にあるかの
ようにアクセスすることができる。Each of the computers CA, CB, and CC exports these directories to other computers using a network file system (NFS). Therefore, each computer CA, CB, CC
Now you can access directories and files on another computer's disk as if they were on your own disk.

【００３２】各コンピュータＣＡ，ＣＢ，ＣＣのＵＲＬ
リスト読み込み部１Ａ，１Ｂ，１Ｃは、初期収集ＵＲＬ
ファイルを読み込んだ後、ＵＲＬ抽出部３Ａ，３Ｂ，３
Ｃが出力したＵＲＬファイルを読み取ろうとする。例え
ば、コンピュータＣＡのＵＲＬリスト読み込み部１Ａ
は、コンピュータＣＡの／ＣＡ／ａ，コンピュータＣＢ
の／ＣＢ／ａ，コンピュータＣＣの／ＣＣ／ａからファ
イルを読み取る。この際、各ディレクトリについて、１．ｎ＝１とする。URL of each computer CA, CB, CC
The list reading units 1A, 1B, and 1C use the initial collection URL
After reading the file, the URL extraction units 3A, 3B, 3
Attempts to read the URL file output by C. For example, the URL list reading unit 1A of the computer CA
Are / CA / a of computer CA and computer CB
/ CB / a of the computer and / CC / a of the computer CC. At this time, for each directory: It is assumed that n = 1.

【００３３】２．ｎという番号のファイルが、そのディ
レクトリにできているか？なければできるまで待つ。あ
れば、（ｎ−１）というファイルを読み込む。2. Is a file with the number n created in that directory? If not, wait until you can. If there is, the file (n-1) is read.

【００３４】３．ｎを１増やす。3. Increment n by 1.

【００３５】４．１〜３を繰り返す。という処理を行
う。このファイルを読み込む際には、初期収集ＵＲＬフ
ァイルを読み込んだ際に行ったのと同じ処理を行う。こ
れによって、ＵＲＬ抽出部３Ａ，３Ｂ，３Ｃが書き出し
中のファイルからは、ＵＲＬリスト読み込み部１Ａ，１
Ｂ，１Ｃが読み込みを行わないというルールができる。4. Repeat steps 1-3. Is performed. When this file is read, the same processing as that performed when the initial collection URL file is read is performed. As a result, the files being written by the URL extraction units 3A, 3B, 3C are read from the URL list reading units 1A, 1A.
There is a rule that B and 1C do not read.

【００３６】なお、各コンピュータＣＡ，ＣＢ，ＣＣ内
のＵＲＬリスト読み込み部１Ａ，１Ｂ，１Ｃ、収集処理
部２Ａ₁，２Ａ₂，２Ｂ₁，２Ｂ₂，２Ｃ₁，２Ｃ₂、ＵＲＬ
抽出部３Ａ，３Ｂ，３Ｃの処理は実際には情報収集プロ
グラムとして、フロッピィ・ディスク、ＣＤ−ＲＯＭ、
光磁気ディスク等の記録媒体に格納されて実行される。[0036] Each computer CA, CB, URL in the CC list reading unit 1A, 1B, 1C, collection processing unit _{_{_{2A 1, 2A 2, 2B 1}}} , 2B 2, 2C 1, 2C 2, URL
The processing of the extraction units 3A, 3B, 3C is actually performed by a floppy disk, CD-ROM,
It is executed by being stored in a recording medium such as a magneto-optical disk.

【００３７】[0037]

【発明の効果】以上説明したように、本発明によれば、
同時に実行できる情報収集処理単位を簡単に増加させる
ことができ、収集性能（時間当たりに収集できる情報の
数）を必要に応じて向上させることができるという効果
が得られる。As described above, according to the present invention,
The effect is that the number of information collection processing units that can be executed simultaneously can be easily increased, and the collection performance (the number of pieces of information that can be collected per time) can be improved as needed.

[Brief description of the drawings]

【図１】本発明の一実施形態を示すシステム構成図であ
る。FIG. 1 is a system configuration diagram showing an embodiment of the present invention.

【図２】従来技術を説明するための図である。FIG. 2 is a diagram for explaining a conventional technique.

[Explanation of symbols]

ＣＡ，ＣＢ，ＣＣコンピュータ１Ａ，１Ｂ，１ＣＵＲＬリスト読み込み部２Ａ₁，２Ａ₂，２Ｂ₁，２Ｂ₂，２Ｃ₁，２Ｃ₂ 収集処
理部３Ａ，３Ｂ，３ＣＵＲＬ抽出部４Ａ₁，４Ａ₂，４Ｂ₁，４Ｂ₂，４Ｃ₁，４Ｃ₂ 収集済
ＵＲＬデータベース５Ａ，５Ｂ，５Ｃ出力済ＵＲＬデータベース６Ａ，６Ｂ，６Ｃディスク７ネットワークCA, CB, CC computer 1A, 1B, 1C URL list reading unit _{_{_{2A 1, 2A 2, 2B 1}}} , 2B 2, 2C 1, 2C 2 collection processing unit 3A, 3B, 3C URL extractor 4A _1, 4A _2, 4B _{_{_{1, 4B 2, 4C 1,}}} 4C 2 collected URL database 5A, 5B, 5C output already URL database 6A, 6B, 6C disk 7 network

Claims

[Claims]

1. An information collection method for collecting information from a plurality of information providing servers which are provided on a network and provide information provided with an information address including an information providing server name and an identifier on the server. A plurality of collection processing units can collect information in parallel, in which case one collection processing unit can be connected to one information providing server at the same time, and each collection processing unit has an information address found from the collected information. An information collection method, comprising: performing a calculation based on an information providing server name included in the information processing server, determining a collection processing unit in which information of the address should be collected, and notifying the information processing unit of the information address.

2. A collection processing unit that, when a collection processing unit that has found an information address notifies another collection processing unit of the information address, sequentially outputs the information address to a numbered file on the file system and receives the notification. The method according to claim 1, wherein the communication is performed between the collection processing units or between the collection processing unit groups by reading the files other than the file with the largest number at that time in numerical order.

3. When outputting the file, a list of information addresses output in each collection processing unit is held, and the same information address is output only once.
The described method.

4. An information collecting apparatus for collecting information from a plurality of information providing servers provided on a network and providing information provided with an information address including an information providing server name and an identifier on the server. There is a collected information address database in which the information addresses from which the information has already been collected are registered. When the information addresses are received, the collected information address database is referred to, and the information of the information addresses is not collected. If so, connect to the information providing server indicated by the information address to collect the information, save the collected information in a file, register the information address as collected in the collected information address database, and then collect the collected information One or more collection processing units for notifying the file names of the files storing the information, and information for which the collection processing units are in charge of collection The information providing server name is extracted from each information address of the initial collection information address file in which the information address of the providing server is stored, and the collection processing unit responsible for collecting information from the information providing server is determined by calculation from the information providing server name. An information address list reading unit that outputs the information address to the collection processing unit; and a file having the file name notified from the collection processing unit is analyzed. If the information address is included, the information address is extracted and extracted. From the information providing server name indicated by the obtained information address, the collection processing unit of the own or other information collection device in charge of information collection is obtained, and the extracted information address is obtained on the file system of the own information collection device. An information address extraction unit that outputs the information to a file corresponding to the information collection device to which the collection processing unit belongs; The information address list reading unit, after reading the initial collection information address file, in the file system in which the information address extraction unit of its own and other information collection devices output the information address, the information collection device An information collection device that reads the corresponding file.

5. The information address extraction section sequentially outputs information addresses to numbered files on a file system, and the information address list reading section outputs information addresses other than the file with the largest number at that time. The apparatus according to claim 4, wherein the files are read in numerical order.

6. The information address extracting unit checks whether the extracted information address is registered in an output information address database when outputting the information address to the file system, and if not, outputs the information address. Outputting the information address to the file system after registering in the output information address database;
The device according to claim 4.

7. An information collection program which collects information from a plurality of information providing servers which are provided on a network and provide information to which an information address including a name of the information providing server and an identifier on the server is provided. When the information address is received, the collected information address database in which the information address for which the information has already been collected is registered is referred to. If the information of the information address has not been collected, the information provided by the information address is provided. Connect to the server to collect information, save the collected information in a file, register the information address as collected in the collected information address database, and then notify the file name of the file in which the collected information was stored One or more collection processes, and the first time the information address of the information providing server responsible for collection by the collection process is stored The information providing server name is extracted from each information address of the collected information address file, a collection process in charge of collecting information from the information providing server is determined by calculation from the information providing server name, and the information address is output to the collection process. The information address list reading process to be performed, the file of the file name notified from the collection process is analyzed, if the information address is included, the information address is taken out, and the information providing server name indicated by the taken out information address is calculated by the information providing server name, Requests the collection process of the own information collection program or other information collection program, and outputs the retrieved information address to a file on the file system of the own information collection program that corresponds to the information collection program to which the requested collection process belongs. Information address extraction processing, the information address list The reading process includes, after reading the initial collection information address file, information reading a file corresponding to the information collection program in the file system in which the information address extraction process of the own and other information collection programs output the information address. A recording medium that records a collection program.

8. The information address extracting process sequentially outputs information addresses to a numbered file on a file system, and the information address reading process includes a process for reading a file other than the file with the largest number at that time. 8. The recording medium according to claim 7, wherein the recording medium is sequentially read.

9. The information address extracting process, when outputting an information address to the file system, checks whether or not the extracted information address is registered in an output information address database. 9. The recording medium according to claim 7, wherein said information address is output to said file system after being registered in said output information address database.