JP2005135198A

JP2005135198A - Document collection apparatus and document collection method

Info

Publication number: JP2005135198A
Application number: JP2003371229A
Authority: JP
Inventors: Masayuki Sugizaki; 正之杉崎; Toshiaki Makino; 俊朗牧野; Hiroshi Takeno; 浩竹野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-10-30
Filing date: 2003-10-30
Publication date: 2005-05-26

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document collection apparatus and a document collection method in which the useless and repeated collection of documents of the same contents is prevented and an updating date of the contents of a document is highly accurately estimated in the case of automatically acquiring a document existing in another computer through a computer network. <P>SOLUTION: In the document collection apparatus and the document collection method, a date and time information extraction part calculates a date on which a prescribed document is to be collected next on the basis of the information of a plurality of dates extracted from the prescribed document and the prescribed document is collected on the calculated next collection date. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、コンピュータネットワークを介して、他のコンピュータに記録されている文書を自動的に収集する装置および方法に係り、特に、予め決められた日時に収集する装置および方法に関する。
The present invention relates to an apparatus and method for automatically collecting documents recorded in other computers via a computer network, and more particularly to an apparatus and method for collecting at a predetermined date and time.

近年、インターネット等のコンピュータネットワークを介して、大量の電子化された文書をやり取りし、不特定多数を対象にして情報発信することが可能である。 In recent years, it has been possible to exchange a large amount of electronic documents via a computer network such as the Internet and transmit information to an unspecified number of people.

この他、ネットワーク上で表現された文書について、その特徴を生かした表現が利用されている。中でも、ＷＷＷ（World Wide Web）上のＨＴＭＬと呼ばれる文書は、何らかの情報を記載するだけではなく、他のコンピュータ上に存在する他の人が書いた文書を参照することができる「ハイパーリンク」の機能がある。 In addition, for documents expressed on the network, expressions that make use of their characteristics are used. Among them, a document called HTML on the WWW (World Wide Web) not only describes some information but also a “hyperlink” that can refer to a document written by another person existing on another computer. There is a function.

これは、他の文書を信頼し、自分が記載した情報を補完し、同じ内容の文書を指すとき等に利用される。 This is used when, for example, trusting another document, supplementing information described by the user, and pointing to a document having the same content.

ＨＴＭＬ文書は、複数のコンピュータ上に分散して存在し、各文書を容易に参照できるように「ＵＲＬ（Uniform Resource Locator）」という表記法によって、アクセスできるように規定している（たとえば、非特許文献１参照）。一般に「http:://コンピュータアドレス/ディレクトリ名/…/ファイル名」という表記で表される。 HTML documents exist in a distributed manner on a plurality of computers, and are defined so that they can be accessed by a notation called “URL (Uniform Resource Locator)” so that each document can be easily referred to (for example, non-patent) Reference 1). Generally, it is represented by the notation "http :: // computer address / directory name /.../ file name".

様々なコンピュータ上に分散して存在する文書を収集し、検索する検索サービスが存在する（例：http://www.google.com/, http://www.goo.ne.jp/）。この検索サービスを実現するには、他のコンピュータ上に存在する文書を、定期的に収集する装置が必要である。 There are search services that collect and search for documents distributed on various computers (eg http://www.google.com/, http://www.goo.ne.jp/). In order to realize this search service, a device that periodically collects documents existing on other computers is required.

コンピュータ資源を有効的に活用する観点からは、全く同じ文書を何度も収集することは無駄であり、一方、内容が変更されている文書を収集しなければ、情報の新鮮さを維持することができない。つまり、できるだけ正確に、次回収集日時を決定する必要がある。 From the perspective of effectively using computer resources, it is useless to collect the exact same documents over and over, while maintaining the freshness of information unless documents with changed contents are collected. I can't. In other words, it is necessary to determine the next collection date and time as accurately as possible.

したがって、文書を何度か収集した結果を用いて、文書の内容が更新された日時を推定し、この推定された日時に文書を収集する方法が知られている（たとえば、特許文献１参照）。
特開２００３−０９９３５３号公報 T. Berners-Lee, L. Masinter, M. McCahill著「Uniform Resource Locator(URL) RFC1738」、１９９４年１２月 Therefore, a method of estimating the date and time when the content of the document was updated using the result of collecting the document several times and collecting the document at the estimated date and time is known (for example, see Patent Document 1). .
JP 2003-099353 A "Uniform Resource Locator (URL) RFC1738" by T. Berners-Lee, L. Masinter, M. McCahill, December 1994

上記従来例において、文書の内容が更新された日時の推定精度を上げるには、収集する日時の間隔を短くすればよいが、このように、収集日時の間隔を短くすると、同じ内容の文書を何度も収集するという無駄が生じるという問題がある。 In the above conventional example, in order to increase the estimation accuracy of the date and time when the content of the document is updated, the interval of the date and time of collection may be shortened. There is a problem of wasteful collection.

本発明は、コンピュータネットワークを介して、他のコンピュータ上に存在する文書を、自動的に取得する場合、同じ内容の文書を何度も収集するという無駄が生じることがなく、しかも文書の内容が更新された日時の推定精度が高い文書収集装置および文書収集方法を提供することを目的とするものである。
In the present invention, when a document existing on another computer is automatically acquired via a computer network, there is no waste of collecting a document having the same content over and over, and the content of the document is not reduced. It is an object of the present invention to provide a document collection apparatus and a document collection method with high accuracy in estimating the updated date and time.

本発明は、日時情報抽出部が、１つの所定の文書から抽出した複数の日時の情報に基づいて、上記所定の文書を次回収集する日時を計算し、この計算された次回収集日時に、上記所定の文書を収集する文書収集装置および文書収集方法である。
In the present invention, the date and time information extraction unit calculates the date and time when the predetermined document is collected next time based on the information of a plurality of dates and times extracted from one predetermined document. A document collection apparatus and a document collection method for collecting a predetermined document.

本発明は、１つの所定の文書から抽出した複数の日時の情報に基づいて、上記所定の文書を次回収集する日時を計算し、この計算された次回収集日時に、上記所定の文書を収集するので、無駄な収集を減らすことができ、しかも、文書の内容が更新された日時の推定精度が高いという効果を奏する。
The present invention calculates a date and time when the predetermined document is collected next time based on a plurality of date and time information extracted from one predetermined document, and collects the predetermined document at the calculated next collection date and time. Therefore, wasteful collection can be reduced, and the estimation accuracy of the date and time when the content of the document is updated is high.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

図１は、本発明の実施例１である文書収集装置１００の概略構成を示すブロック図である。 FIG. 1 is a block diagram illustrating a schematic configuration of a document collection apparatus 100 that is Embodiment 1 of the present invention.

文書収集装置１００は、コンピュータネットワークを介して、他のコンピュータ上に存在する文書を、自動的に取得する文書収集装置であり、収集対象管理部１０と、文書収集部２０と、日時情報抽出部３０と、次回収集日時計算部４０と、管理データベース５０とを有する。 The document collection device 100 is a document collection device that automatically acquires a document existing on another computer via a computer network, and includes a collection target management unit 10, a document collection unit 20, and a date / time information extraction unit. 30, a next collection date calculation unit 40, and a management database 50.

収集対象管理部１０は、収集対象文書と収集する日時とを記録する。 The collection target management unit 10 records the collection target document and the collection date and time.

文書収集部２０は、収集対象管理部１０が管理している収集対象文書と収集日時とに応じて文書を収集する。 The document collection unit 20 collects documents according to the collection target document managed by the collection target management unit 10 and the collection date and time.

日時情報抽出部３０は、文書収集部２０が収集した文書を解析し、文書内に記述されている複数の日時情報を抽出する。 The date information extraction unit 30 analyzes the document collected by the document collection unit 20 and extracts a plurality of date information described in the document.

次回収集日時計算部４０は、日時情報抽出部３０が、１つの文書について抽出した複数の日時情報に基づいて、上記所定の文書を次回収集する日時を計算し、この計算された次回収集日時の情報を、収集対象管理部１０に送る。 The next collection date and time calculation unit 40 calculates the date and time when the predetermined document is collected next time based on the plurality of date and time information extracted by the date and time information extraction unit 30 for one document, Information is sent to the collection target management unit 10.

ところで、１つの文書（ＨＴＭＬファイル）内に、複数の異なる情報を記述することが可能であり、記述された複数の情報のそれぞれについて、いつ記述したかを示す日時を同時に掲載している文書が存在する。 By the way, it is possible to describe a plurality of different information in one document (HTML file), and there is a document that simultaneously records the date and time indicating when each of the plurality of described information is described. Exists.

たとえば、新聞社が提供するＨＴＭＬファイルには、記事の見出しと、その記事をいつ頃掲載したかを示す時刻とが、合わせて掲載されている。また、インターネット上で「掲示板」と呼ばれる情報共有サービスの場合、不特定／特定多数が、１つのＨＴＭＬファイルに、情報を記入し、読むことができ、この場合、書き込まれた日時を合わせて掲載している。また、「日記」と呼ばれるサービス（個人が記録し、その内容を公開しているサービス）では、「何月何日」という単位で情報を公開している。 For example, in an HTML file provided by a newspaper company, a headline of an article and a time indicating when the article was posted are also posted. In addition, in the case of an information sharing service called “Bulletin Board” on the Internet, unspecified / specific large numbers can enter and read information in one HTML file. In this case, the date and time of writing are also posted. doing. In addition, in a service called “diary” (a service in which an individual records and discloses the contents), information is disclosed in units of “what month and day”.

これらの日時情報は、いつ書き込まれたかを示す記録であると同時に、更新するタイミングが刻まれていると考えられる。このような文書ファイルの中から、ファイル内に掲載されている複数の日時を抽出し、古い順に並び替え、次回の更新がいつ頃であるかを計算によって予測する。 These date / time information is a record indicating when the date / time is written, and at the same time, the timing for updating is considered. From such a document file, a plurality of dates and times listed in the file are extracted, rearranged in the oldest order, and when the next update is predicted is predicted by calculation.

日時情報の抽出には、日時を表現するルールを予め設定し、この設定されたルールに合致する文字列を、日時として抽出する。抽出ルールの例として、次のルールが考えられる。つまり、
ＹＹＹＹ／ＭＭ／ＤＤｈｈ：ｍｍ：ｓｓ … （１）
ＹＹＹＹ−ＭＭ−ＤＤｈｈ：ｍｍ：ｓｓ … （２）
ＹＹＹＹＭＭＤＤｈｈ：ｍｍ … （３）
ＹＹＹＹ年ＭＭ月ＤＤ日ｈｈ時ｍｍ分ｓｓ秒 … （４）
平成ｙｙ年ＭＭ月ＤＤ日ｈｈ時ｍｍ分ｓｓ秒 … （５）
等をルールとして考えることができる。 To extract date / time information, a rule expressing the date / time is set in advance, and a character string that matches the set rule is extracted as the date / time. The following rules can be considered as examples of extraction rules. That means
YYYY / MM / DDhh: mm: ss (1)
YYYY-MM-DDhh: mm: ss (2)
YYYYMMDDhh: mm (3)
YYYY year MM month DD day hh hour mm minute ss second (4)
Yyyy MM month DD day hh hour mm minute ss second (5)
Etc. can be considered as rules.

なお、ＹＹＹＹは西暦年、ＭＭは月、ＤＤは日、ｈｈは時、ｍｍは分、ｓｓは秒を示す。漢数字が使われる場合もあり得る。また、括弧等の記号で括られている場合（たとえば、［ｈｈ：ｍｍ］）や、月をアルファベットで表現した場合（たとえば、４月は「Ａｐｒ」や「Ａｐｒｉｌ」）等がある。これらをも考慮して、抽出ルールを作る。 YYYY is the year, MM is the month, DD is the day, hh is the hour, mm is the minute, and ss is the second. Chinese numerals may be used. In addition, there are cases where the characters are enclosed in symbols such as parentheses (for example, [hh: mm]), and the month is expressed in alphabets (for example, April is “Apr” or “Apri”). Taking these into account, the extraction rule is made.

次に、上記実施例において、次回の更新がいつ頃であるかを計算する計算方法の一例について説明する。 Next, an example of a calculation method for calculating when the next update is performed in the above embodiment will be described.

図２は、ＨＴＭＬファイルの表示例を示す図である。 FIG. 2 is a diagram illustrating a display example of an HTML file.

図３は、上記実施例の動作を示すフローチャートである。 FIG. 3 is a flowchart showing the operation of the above embodiment.

最初に、最も単純な例を挙げる。まず、収集する対象と、収集する時期とを、収集対象管理部１０が記述する。収集対象を指定にはＵＲＬを用いる。そして、スケジュールに従って、文書収集部２０が、ＨＴＭＬファイルを、他のコンピュータから取得する（Ｓ１）。そして、取得したＨＴＭＬから、予め用意したルールに従って、日時情報抽出部３０が、合致する日時情報を抽出する（Ｓ２）。 First, give the simplest example. First, the collection target management unit 10 describes the collection target and the collection timing. A URL is used to specify the collection target. Then, according to the schedule, the document collection unit 20 acquires an HTML file from another computer (S1). And the date information extraction part 30 extracts the date information which corresponds from the acquired HTML according to the rule prepared beforehand (S2).

抽出された同じ形式の日時情報を昇順にソートする（Ｓ３）。つまり、ファイル内から抽出された日時の情報を、古い順に並び替えする。 The extracted date and time information of the same format is sorted in ascending order (S3). That is, the date and time information extracted from the file is rearranged in the oldest order.

そして、次回収集日時経産部４０が、最も出現回数の多い日時情報における表記のルールと合致する日時情報を用いて、次回文書収集するための日時を決定する（Ｓ４）。つまり、ＨＴＭＬファイル内に出現する日時情報に対して、上記式（１）〜式（５）のルール（またはそれ以外のルール）のどれと合致するかを調べ、最も合致したルールを求める。次に、今度は、最も合致したルールと合致する日時情報のみを用いて、次回の情報収集のタイミングを何時にするかを決定する。 Then, the next collection date and time economy section 40 determines the date and time for the next document collection using the date and time information that matches the notation rule in the date and time information with the highest number of appearances (S4). That is, it is checked which of the rules (1) to (5) above (or other rules) matches the date / time information appearing in the HTML file, and the most matched rule is obtained. Next, using only the date and time information that matches the most matched rule, it is determined what time to collect the next information.

上記古い順に並び替えた結果について、ｎ番目の情報が掲載された日時を、ｔ_ｎとし、前回掲載された日時ｔ_ｎ−１と、今回掲載された日時ｔ_ｎとに基づいて、次回収集する日時（次回掲載されるであろう日時）ｔ_ｎ＋１を予測する。この場合、前回掲載された日時ｔ_ｎ−１と、今回掲載された日時ｔ_ｎとの差を用いて、次回収集する日時ｔ_ｎ＋１を決定する。つまり、次回収集する日時ｔ_ｎ＋１は、
ｔ_ｎ＋１＝ｔ_ｎ＋（ｔ_ｎ−ｔ_ｎ−１）＋ … （６）
である。 For the results sorted in the old order, the date and time at which the n-th of the information has been published, and t _n, and the date and time t _n-1, which was last published, on the basis of the now published in the date and time t _n, to collect next time Predict the date and time (the date and time that will be posted next time) t _{n + 1} . In this case, the date and time t _n-1, which was last published, using the difference between the currently published in the date and time t _n, to determine the date and time t _{n + 1} to collect next time. In other words, the date and time t _{n + 1} to be collected next time is
t _{n + 1} = t _n + (t _n −t _n−1 ) + (6)
It is.

このようにして計算された次回収集日ｔ_ｎ＋１の情報は、収集対象管理部１０に送信される。 Information on the next collection date t _{n + 1} calculated in this way is transmitted to the collection target management unit 10.

また、最小２乗法を用いて、次回収集する日時ｔ_ｎ＋１を計算するようにしてもよい。つまり、ＨＴＭＬファイルの中で、ｉ番目の情報が掲載された日時を、Ｔ_ｉとし、ｉ番目の情報が、所定の一定の日時で更新されている期待値をｘ_ｉとすると、 Further, the date and time t _{n + 1} to be collected next time may be calculated using the least square method. In other words, in the HTML file, when the i-th information is posted as T _i and the i-th information is updated at a predetermined constant date and time as x _i ,

を満たすｘ_ｉを求めればよい。ここでｘ_ｉが
ｘ_ｉ＝ａ＊ｉ＋ｂ … （８）
という直線で近似できる。

What is necessary is just to obtain x _i satisfying. Here, x _i is x _i = a * i + b (8)
It can be approximated by a straight line.

なお、最小２乗法については、「田中豊、脇本和昌著「多変量統計解析法」現代数学社、１９８３年、第１章」に記載されている。 The least square method is described in “Tataka Tanaka and Kazumasa Wakimoto,“ Multivariate Statistical Analysis ”, Modern Mathematics, 1983, Chapter 1”.

また、実際に集めた後に、内容の変更の有無等によって、収集する日時情報を調整するようにしてもよい。 In addition, after the actual collection, the collected date / time information may be adjusted depending on whether or not the contents have been changed.

次に、上記実施例の具体例に付いて説明する。 Next, a specific example of the above embodiment will be described.

まず、収集する対象の文書と、収集する日時とを、収集対象管理部１０に予め記述しておく。収集対象を指定には、ＵＲＬを用いる。また、日時を指定する場合、以下の形式を用いる。 First, the document to be collected and the date and time to be collected are described in advance in the collection target management unit 10. A URL is used to designate a collection target. When specifying the date and time, the following format is used.

ＹＹＹＹ／ＭＭ／ＤＤｈｈ：ｍｍ：ｓｓ … （９）
ここで、ＹＹＹＹは西暦、ＭＭは月、ＤＤは日、ｈｈは時、ｍｍは分、ｓｓは秒を示す。上記収集する日時と、上記収集する対象の文書とを、１つの対として記述する。たとえば、
２００３／１０／２２１２：００：１４ http://www.aaa.co.jp/ …（１０）
２００３／１０／２２１２：１０：２０ http://www.bbb.co.jp/pub/ …（１１）
２００３／１０／２２１２：１３：２０ http://www.aaa.co.jp/ …（１２）
等と記述する。 YYYY / MM / DDhh: mm: ss (9)
Here, YYYY is the year, MM is the month, DD is the day, hh is the hour, mm is the minute, and ss is the second. The date and time of collection and the document to be collected are described as one pair. For example,
2003/10/22 12:00:14 http://www.aaa.co.jp/ (10)
2003/10/22 12:10:20 http://www.bbb.co.jp/pub/ (11)
2003/10/22 12:13:20 http://www.aaa.co.jp/ (12)
And so on.

文書収集部２０は、収集対象管理部１０で管理されている収集対象を収集日時に応じて収集する。上記の例を用いると、現在時刻が「２００３／１０／２２１２：００：１４」になった時点で、「http://www.aaa.co.jp」と合致するＨＴＭＬファイルの取得を試みる。 The document collection unit 20 collects the collection targets managed by the collection target management unit 10 according to the collection date and time. Using the above example, when the current time is “2003/10/22 12:00:14”, an attempt is made to acquire an HTML file that matches “http://www.aaa.co.jp”. .

日時情報抽出部３０は、文書収集部２０が収集した文書を解析し、文書内に記述されている日時情報を抽出する。たとえば、所定の文書が、図２に示すように記述されていたとすると、ここから抽出される共通の形式に則った日時は、
１２００３／１１／０２１２：３５：００ …（１３）
２２００３／１１／０２１３：００：００ …（１４）
３２００３／１１／０２１３：１５：００ …（１５）
４２００３／１１／０２１３：２５：００ …（１６）
５２００３／１１／０２１３：３８：００ …（１７）
となる。 The date / time information extraction unit 30 analyzes the document collected by the document collection unit 20 and extracts date / time information described in the document. For example, if a predetermined document is described as shown in FIG. 2, the date and time in accordance with the common format extracted from here is
1 2003/11/02 12:35:00 (13)
2 2003/11/02 13:00: 00 (14)
3 2003/11/02 13:15:00 (15)
4 2003/11/02 13:25:00 (16)
5 2003/11/02 13:38:00 (17)
It becomes.

なお、「４月」という文字列は共通の形式という観点から無視される。つまり、図２中の「Ｎｏ．３」の発言者「ううう」さんが書き込みした「でも今年の４月の…」における「４月」は、発言者が書き込んだ時刻を指すものではないので、上記実施例において、次回にファイルが更新される時間を推定する場合には使用しない。 Note that the character string “April” is ignored from the viewpoint of a common format. In other words, “April” in “But this April…” written by “No. 3” speaker “Uu” in FIG. 2 does not indicate the time written by the speaker. In the above embodiment, it is not used when estimating the time when the file is updated next time.

上記５つの日時を使用し、最小２乗法を用い、近似直線を求めてみる。各時間の差を、
１０ …（１８）
２２５ …（１９）
３１５ …（２０）
４１０ …（２１）
５１３ …（２２）
とする（単位は分）。なお、上記「１０」における「０」は、起点という意味で、式１３−式１３＝０（すなわち、２００３／１１／０２１２：３５：００−２００３／１１／０２１２：３５：００＝０）である。 Using the above five dates and times, try to find an approximate line using the method of least squares. Difference of each time,
1 0 (18)
2 25 (19)
3 15 (20)
4 10 (21)
5 13 (22)
(Unit: minutes) In addition, “0” in the above “1 0” means a starting point, and Equation 13−Equation 13 = 0 (that is, 2003/11/02 12: 35: 00-2003 / 11/02 12: 35: 00 = 0).

ここから、上記式（１９）（２０）（２１）（２２）を用いて近似直線を求めると、
ｔ＿ｉ＝−４．１×ｉ＋２６ … （２３）
となる。これによって、第６回の更新日時は
−４．１×５＋２６＝５．５ … （２４）
分後である２００３／１１／０２１３：４３：３０であることが予想される。 From this, when an approximate straight line is obtained using the above equations (19), (20), (21), and (22),
t_i = −4.1 × i + 26 (23)
It becomes. As a result, the sixth update date and time is −4.1 × 5 + 26 = 5.5 (24).
It is expected to be 2003/11/02 13:43:30 after a minute.

また、上記実施例をプログラムの発明として把握することができる。つまり、上記実施例は、コンピュータネットワークを介して、他のコンピュータ上に存在する文書を、自動的に取得する文書収集プログラムにおいて、収集対象文書と収集する日時とを収集対象管理部に記録する収集対象管理手順と、上記収集対象管理部に記録されている収集対象文書と収集日時とに応じて、文書収集部が文書を収集する文書収集手順と、上記文書収集手順で収集した所定の文書を解析し、この解析された所定の文書内に記述されている複数の日時の情報を、日時情報抽出が抽出する日時情報抽出手順と、上記日時情報抽出手順で抽出された複数の日時の情報に基づいて、上記所定の文書を次回収集する日時を計算し、この計算された次回収集日時の情報を、上記収集対象管理部に送る次回収集日時計算手順とをコンピュータに実行させるプログラムの例である。 Moreover, the said Example can be grasped | ascertained as invention of a program. In other words, in the above embodiment, in the document collection program for automatically acquiring documents existing on other computers via the computer network, the collection target document and the collection date and time are recorded in the collection target management unit. Depending on the target management procedure, the collection target document recorded in the collection target management unit and the collection date and time, the document collection unit collects the document, and the predetermined document collected in the document collection procedure Analyzing the date and time information described in the analyzed document into the date and time information extraction procedure extracted by the date and time information extraction and the date and time information extracted by the date and time information extraction procedure. Based on this, the computer calculates the next collection date calculation procedure for calculating the next collection date and time and sending the calculated next collection date and time information to the collection target management unit. It is an example of a program executed by the.

なお、上記プログラムを、ＦＤ、ＣＤ、ＤＶＤ、ＨＤ、半導体メモリ等の記録媒体に記録するようにしてもよい。
The program may be recorded on a recording medium such as an FD, CD, DVD, HD, or semiconductor memory.

本発明の実施例１である文書収集装置１００の概略構成を示すブロック図である。1 is a block diagram illustrating a schematic configuration of a document collection apparatus 100 that is Embodiment 1 of the present invention. FIG. ＨＴＭＬファイルの表示例を示す図である。It is a figure which shows the example of a display of an HTML file. 上記実施例の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the said Example.

Explanation of symbols

１００…文書収集装置、
１０…収集対象管理部、
２０…文書収集部、
３０…日時情報抽出部、
４０…次回収集日時計算部、
５０…管理データベース、
６０…文書データベース。 100: Document collection device,
10 ... Collection target management department,
20 ... Document collection department,
30 ... date and time information extraction unit,
40 ... Next collection date calculation part,
50 ... management database,
60: Document database.

Claims

In a document collection device that automatically acquires a document existing on another computer via a computer network,
A collection target management unit for recording a collection target document and a collection date and time;
A document collection unit that collects documents according to the collection target document recorded by the collection target management unit and the collection date and time;
A date and time information extraction unit that analyzes a predetermined document collected by the document collection unit and extracts information of a plurality of dates and times described in the analyzed predetermined document;
Based on the information of a plurality of dates and times extracted by the date and time information extraction unit, calculates the date and time when the predetermined document is collected next time, and sends the information of the calculated next collection date and time to the collection target management unit A date and time calculator;
A document collecting apparatus comprising:

In a document collection method for automatically acquiring a document existing on another computer via a computer network,
A collection target management stage in which the collection target document and the date and time of collection are recorded in the collection target management unit;
A document collection stage in which the document collection unit collects documents according to the collection target document and the collection date and time recorded in the collection target management unit;
A date and time information extraction stage in which the predetermined document collected in the document collection stage is analyzed, and date and time information extraction extracts a plurality of date and time information described in the analyzed predetermined document;
Next time based on the plurality of date and time information extracted in the date and time information extraction stage, calculates the date and time when the predetermined document is collected next time, and sends the calculated next collection date and time information to the collection target management unit Collection date calculation stage;
A document collection method characterized by comprising:

In a document collection program for automatically acquiring a document existing on another computer via a computer network,
A collection target management procedure for recording a collection target document and a collection date and time in a collection target management unit;
A document collection procedure in which the document collection unit collects documents according to the collection target document and the collection date and time recorded in the collection target management unit;
A date and time information extraction procedure for analyzing a predetermined document collected by the document collection procedure, and extracting a plurality of date and time information described in the analyzed predetermined document by date and time information extraction;
Next time based on the plurality of date and time information extracted in the date and time information extraction procedure, calculates the date and time when the predetermined document is collected next time, and sends the calculated next collection date and time information to the collection target management unit Collection date calculation procedure;
A program that causes a computer to execute.