JP2002259407A

JP2002259407A - Document collection device for particular application, its method and program for execution with computer

Info

Publication number: JP2002259407A
Application number: JP2001379280A
Authority: JP
Inventors: Hiroshi Tsuda; 宏津田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-12-27
Filing date: 2001-12-12
Publication date: 2002-09-13
Anticipated expiration: 2021-12-12
Also published as: JP4094844B2

Abstract

PROBLEM TO BE SOLVED: To permit the quick collection of a document for a particular application through a network. SOLUTION: In the document collection device for collecting, through a network, a document for a community, prior to the collection start, an initial document group 20 that is to be the start point of the collection is provided. A reference relation extracting means 3, from the initial document group 20, extracts a reference relation. A next candidate determining means 5, on the basis of the reference relation for the collected document group 20, determines the uncollected document that satisfies a fixed condition as a next collection candidate 21 that is a candidate of the document to be collected subsequently. A document collection means 2 collects the document that has been determined as the next collection candidate and adds it to the collected document 20. Unless the number of documents in the collected document group 20 is equal to or more than a prescribed number, the reference relation extracting means 3, from the newly collected documents, further extracts a reference relation to repeat the above processing. In this way, the document collection is repeated until the number of the documents in the collected document group 20 is equal to or more than the prescribed value.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書の収集に関
し、特に特定用途に合わせて文書を効率的に収集する文
書収集装置、その方法に関する。[0001] 1. Field of the Invention [0002] The present invention relates to document collection, and more particularly, to a document collection apparatus and a document collection method for efficiently collecting documents according to specific applications.

【０００２】[0002]

【従来の技術】イントラネット、ＷＷＷ等のネットワー
ク上の文書の検索エンジンは、ネットワークから文書を
収集する文書収集装置（ロボット）と、収集した文書用
のキーワード索引を作成する検索エンジンとから実現さ
れる。2. Description of the Related Art A search engine for documents on a network such as an intranet or WWW is realized by a document collection device (robot) for collecting documents from the network and a search engine for creating a keyword index for the collected documents. .

【０００３】文書収集装置は、所与のネタＵＲＬ（Unif
orm Resorce Locator）集（収集を開始する際の開始点
となるＵＲＬ集）から文書収集を開始し、収集済みの文
書からアンカー（参照関係）により参照されている未収
集文書を次収集候補として収集し、といった処理を一定
の回数繰り返すことにより動作する。このようにして文
書収集ロボットは、数千万から数億のＵＲＬから文書を
定期的に収集する。ここで、ＵＲＬとは、ネットワーク
上の情報のありかと取得方法を指定する記述方式をい
う。A document collection device is provided with a given material URL (Unif
orm Resorce Locator) (starting collection when starting collection) starts document collection, and collects uncollected documents referenced by anchor (reference relationship) from collected documents as next collection candidates The operation is performed by repeating such a process a certain number of times. In this way, the document collection robot periodically collects documents from tens of millions to hundreds of millions of URLs. Here, the URL refers to a description method for specifying the location of the information on the network and the method of obtaining the information.

【０００４】ところで、今日、ネットワーク上の文書の
増加スピードは速く、2000年1月には、Inktomi等によっ
て、インターネットのユニーク文書は１０億文書に達し
たという調査結果が発表されている。また、2000年7月
には、アメリカCyveillance社によって、インターネッ
トの大きさは約２１億文書であり、2001年にはさらに倍
の大きさになると予測されるという調査結果が発表され
ている。[0004] By the way, today, the number of documents on the network is increasing rapidly, and in January 2000, Inktomi et al. Published a survey result that the number of unique documents on the Internet reached one billion. In July 2000, a report from Cyveillance of the United States announced that the size of the Internet was about 2.1 billion documents, and is expected to double in 2001.

【０００５】１０億ＵＲＬから文書を収集するともなる
と、一日１００万ＵＲＬずつ（毎秒約１０ＵＲＬ＝４０
Ｋバイト）収集したとしても収集し終わるには３年かか
ることになり、収集し終わった頃には最初の頃に収集し
た文書の情報は陳腐化していまう。そこで、用途に合わ
せて重要度の高い情報だけを効率よく収集する知的文書
収集装置が求められていた。When it comes to collecting documents from one billion URLs, one million URLs per day (about 10 URLs per second = 40
Even if it has been collected, it will take three years to complete the collection, and by the time collection is completed, information on documents collected in the early days will become obsolete. Therefore, there has been a demand for an intelligent document collection device that efficiently collects only information having high importance in accordance with the application.

【０００６】特定用途の文書を優先して収集する文書収
集装置には、以下のものがある。・例えば、特開平9-311802に開示される発明のように、
新しい情報を優先して収集する。・内容が類似していると考えられる文書を収集する。そ
の際に、以下の考え方を導入する。There are the following document collecting apparatuses for collecting documents of a specific use with priority. -For example, as in the invention disclosed in JP-A-9-311802,
Collect new information first.・ Collect documents whose contents are considered to be similar. At that time, the following concept is introduced.

【０００７】ａ）階層数で収集範囲を制限する。例えば、特開平9-218876に開示される発明のように、参
照関係を有する文書は内容的にも近いと考えられるが、
あまり階層的に離れると意味的な繋がりがなくなるた
め、階層数で収集範囲を制限して文書を収集するという
考え方。A) The collection range is limited by the number of layers. For example, as in the invention disclosed in Japanese Patent Laid-Open No. 9-218876, a document having a reference relationship is considered to be similar in content,
The concept of collecting documents by limiting the collection range by the number of layers is because there is no meaningful connection if the layers are too far apart.

【０００８】ｂ）意味的内容が近い文書のみ収集する。例えば、特開平10-105572に開示される発明のように、
文書の中身のマッチングから意味的な近さを計算し、参
照関係を有する文書のうち、意味的に近い文書だけを収
集するという考え方。[0008] b) Only documents having similar semantic contents are collected. For example, as in the invention disclosed in JP-A-10-105572,
The idea is to calculate the semantic closeness from the matching of the contents of the documents and collect only the semantically close documents from the documents having a reference relationship.

【０００９】ｃ）参照先を示す文字列が適当な文書のみ
収集する。例えば、特開平10-260979及び特開2000-9011に開示され
る発明のように、参照先を表している表現である参照表
現、例えばＨＴＭＬであればアンカータグの内容に基づ
いて、その参照表現で参照されている参照先文書を次に
収集するか否かを判定するという考え方。・一般的に、より人気度の高い文書から優先して収集す
る。C) Only documents whose character strings indicating the reference destination are appropriate are collected. For example, as in the inventions disclosed in JP-A-10-260979 and JP-A-2000-9011, a reference expression that is an expression representing a reference destination, for example, in the case of HTML, the reference expression based on the content of an anchor tag is used. The idea is to determine whether to collect the next referenced document referenced by. -Generally, documents with higher popularity are collected first.

【００１０】被参照数（その文書を参照している他の文
書の数）が多い文書は、人気度が高いと考えられる。収
集済みの文書群内の文書から参照されている数が多い文
書から順に収集することで、人気度の高い文書を優先し
て収集できるという考え方。A document having a large number of references (the number of other documents referencing the document) is considered to have high popularity. The idea that documents with high popularity can be preferentially collected by collecting documents in descending order of the number of documents referenced from documents in the collected document group.

【００１１】[0011]

【発明が解決しようとする課題】しかし、上述の従来技
術の枠組みだけでは、企業のようなコミュニティのポー
タルサイトに求められるような文書の収集に用いるため
には、不十分な点があった。例えば、企業内のポータル
サイト、つまりコーポレートポータルの要件として、以
下の点が要求される。・社内外でリアルタイムに発生する膨大な文書を自動的
に収集する。・自動で意味解析及び分類分け（カテゴライズ）する。・文書を収集し、分類した結果を画面の適当な場所に
（人に合わせて）フィードする。However, the above-described prior art framework alone is insufficient for use in collecting documents required for a portal site of a community such as a company. For example, the following points are required as requirements for a portal site in a company, that is, a corporate portal.・ Automatically collect a huge amount of documents generated in real time inside and outside the company. -Automatically perform semantic analysis and categorization. -Collect the documents and feed the classified results to the appropriate place on the screen (personally).

【００１２】このうち、文書収集において、社内外の膨
大な文書を漫然と収集するのではなく、文書の中から業
務に関係するという観点から文書を選別して収集するこ
とが必要とされる。業務に関係するという観点は、特定
の意味的内容を持つ、或いは重要度を持つということと
はやや異なる。例えば、ある程度の規模の企業が有する
イントラネットコミュニティでは、文書内容も意味的に
多様になるからである。また、社外（例えばインターネ
ット）の文書は、趣味に関する情報も人気度が高くそう
した情報は必ずしもコーポレートポータルにとって有用
であるとは限らない。[0012] Of these, in document collection, it is necessary to select and collect a large number of documents from inside and outside of the company from the viewpoint of being related to the business, rather than collecting them unnecessarily. A business-related perspective is slightly different from having a specific semantic content or importance. For example, in the intranet community of a company of a certain size, the contents of documents are also semantically diverse. In addition, for documents outside the company (for example, the Internet), information on hobbies is also very popular, and such information is not always useful for a corporate portal.

【００１３】しかし、従来の文書収集において用いられ
てきた枠組み、例えば、最新情報の優先取得、特定分野
情報の優先取得、人気度が高い情報の優先取得という枠
組みだけでは、このような趣味に関する情報のように、
一般的に重要度が高いが必ずしもこのコミュニティにと
って有用でない文書も収集されてしまうという問題があ
った。[0013] However, the framework used in the conventional document collection, such as the priority acquisition of the latest information, the priority acquisition of the information in a specific field, and the priority acquisition of the information with a high degree of popularity alone, is not the information relating to such hobbies. like,
There is a problem in that documents that are generally important but not necessarily useful to the community are collected.

【００１４】また、例えば、上述の従来技術の「意味的
内容が近い文書のみを収集する」と方法で文書を収集す
る場合、各々の考え方には以下の問題があった。・単に階層数を予め制限する考え方は、処理は簡単であ
るが、本当に意味内容が近い文書を優先して収集してい
るのか、また、重要な文書を収集し逃していないのか、
保証がない。・文書の内容を比べて意味的内容が近いか否か判定する
方式によれば、一般に自然言語処理を使って、文書に記
載された本文を解析してキーワードを取り出し、取り出
されたキーワードの類似度によって解析する。そのた
め、処理に時間がかかる。早くても、毎秒１００文書程
度しか処理できない。従って、数十億ともいわれる文書
を１つ１つ処理することは、現実的な時間内に行いがた
い。また、そのように時間をかけて処理したとしても、
その精度は７０から８０％程度である。さらに、この処
理は、言語の種類に大きく依存するため、言語毎に判定
ツールを備えることが必要となる。・参照表現に基づいて収集するか否か判定する場合で
も、参照表現で用いられる文字列には、「ホームペー
ジ」、「トップに戻る」及び「ここをクリック」といっ
たような決まった語句（定番的ば語句）も多く、必ずし
も参照先の意味的内容を表しているとは限らない。Further, for example, when documents are collected by the above-mentioned conventional technique of "collecting only documents having similar semantic contents", there are the following problems in each concept.・ The idea of simply limiting the number of layers is simple, but the process is simple, but do you prefer to collect documents that have really similar meanings, or do you miss important documents?
There is no guarantee. According to the method of comparing the contents of documents to determine whether the semantic contents are close or not, generally, using natural language processing, analyze the text described in the document to extract keywords, and resemble the extracted keywords. Analyze by degree. Therefore, processing takes time. At the earliest, it can process only about 100 documents per second. Therefore, it is difficult to process billions of documents one by one in a realistic time. Also, even if it takes such a long time,
Its accuracy is about 70 to 80%. Further, since this process greatly depends on the type of language, it is necessary to provide a determination tool for each language. -Even if it is determined whether or not to collect based on the reference expression, the character strings used in the reference expression include fixed words (such as "homepage", "return to top", and "click here"). Word), and does not always represent the semantic content of the reference destination.

【００１５】以上の問題を鑑み、用途にあった文書を言
語に依存せず、かつ精度良く迅速に収集することを可能
とすることが、本発明が解決しようとする課題である。[0015] In view of the above problems, it is an object of the present invention to make it possible to quickly and accurately collect documents suitable for use without depending on languages.

【００１６】[0016]

【課題を解決するための手段】本発明は、ネットワーク
から文書の収集を行なう装置または方法を前提とする。
そして、本発明の各態様に係わる装置では、ネットワー
クから文書を収集する文書収集装置において、収集済み
の文書群の参照関係に基づいて、次に収集すべき文書の
候補である次収集候補を決定する次候補判定手段と、ネ
ットワークから前記次収集候補を収集して収集済み文書
群に加える文書収集手段と、を備え、収集済み文書群の
文書がある数以上になるまで、次候補判定手段による次
収集候補の決定及び前記文書収集手段による文書の収集
を繰り返す。The present invention is based on an apparatus or method for collecting documents from a network.
In the apparatus according to each aspect of the present invention, in the document collection apparatus that collects documents from the network, a next collection candidate that is a candidate for a document to be collected next is determined based on a reference relation of a group of collected documents. And a document collection unit that collects the next collection candidate from the network and adds it to the collected document group. The next candidate determination unit performs until the number of documents in the collected document group reaches a certain number or more. The determination of the next collection candidate and the collection of documents by the document collection means are repeated.

【００１７】上記装置を、ネットワーク上のコミュニテ
ィにとって有用度の高い文書を収集するコミュニティ向
けの文書収集装置として構成するようにしてもよい。そ
のために、上記構成において、文書収集手段がネットワ
ーク上のコミュニティ内から文書をまんべんなく収集し
た後、次候補判定手段は、収集済み文書群の参照関係に
基づいてコミュニティ内外の文書から次収集候補を決定
する、こととしてもよい。The above apparatus may be configured as a document collection apparatus for a community that collects documents having a high degree of usefulness for a community on a network. To this end, in the above configuration, after the document collection unit has evenly collected the documents from within the community on the network, the next candidate determination unit determines the next collection candidate from the documents inside and outside the community based on the reference relation of the collected document group. You may do it.

【００１８】コミュニティ内外から文書を収集する前
に、コミュニティ内から文書をまんべんなく収集するこ
とにより、コミュニティ内で必要とされている多様な分
野の文書についての情報を入手することができる。この
ようにして入手した多様な分野に関する文書群の参照関
係を用いてコミュニティ内外から文書を収集することに
より、正確にコミュニティにとって有用度の高い文書を
収集することが可能となる。また、文書本文の内容を解
析しないため、言語に依存せず、迅速にコミュニティに
とって有用度の高い文書を収集することが可能となる。By collecting documents evenly from within the community before collecting documents from inside and outside the community, information on documents in various fields required in the community can be obtained. By collecting documents from inside and outside the community using the reference relation of the document group regarding the various fields obtained in this way, it becomes possible to accurately collect documents having a high degree of usefulness for the community. Further, since the content of the document body is not analyzed, it is possible to quickly collect documents having high utility for the community without depending on the language.

【００１９】上記構成において、収集済み文書群の参照
関係及び文書のネットワーク上の場所を示す情報、例え
ばＵＲＬ、に基づいて重要度を算出するランキング手段
を更に備え、次候補判定手段は、参照関係及び重要度に
基づいて次収集候補を決定することとしてもよい。[0019] In the above configuration, the apparatus further comprises ranking means for calculating the importance based on the reference relation of the collected document group and the information indicating the location of the document on the network, for example, a URL. Alternatively, the next collection candidate may be determined based on the importance.

【００２０】上記コミュニティ向け文書収集装置におい
て、ランキング手段は、重要度に基づいて、前記コミュ
ニティ内外に分けてランキングし、次候補判定手段は、
コミュニティ内及びコミュニティ外それぞれにおいて、
ランキングが高い文書を前記次収集候補とすることとし
てもよい。これにより、次収集候補がコミュニティ内又
はコミュニティ外に集中し、文書がコミュニティ内又は
コミュニティ外いずれかからばかり収集されてしまうこ
とを防ぐことが可能となる。In the above-mentioned document collection apparatus for a community, the ranking means divides the document into and out of the community based on the degree of importance, and the next candidate determination means comprises:
Within and outside the community,
A document with a high ranking may be set as the next collection candidate. As a result, it is possible to prevent the next collection candidates from being concentrated in or outside the community and collecting documents only from inside or outside the community.

【００２１】また、上記コミュニティ向け文書収集装置
は、更に、収集済み文書群を検索した結果を、前記コミ
ュニティ内外に分けて提示する提示手段を備えることと
しても良い。これにより、コミュニティに属するクライ
アントが、コミュニティ内外別に文書の検索結果を取得
することが可能となる。Further, the community document collection device may further include a presentation unit for presenting a search result of the collected document group inside and outside the community. This allows a client belonging to the community to obtain a document search result separately from inside and outside the community.

【００２２】また、上記コミュニティ向け文書収集装置
は、更に、文書がコミュニティ内の文書であるか否かを
文書のネットワーク上での場所を示す情報、例えばＵＲ
Ｌ、に基づいて判別するコミュニティ判別手段を備える
こととしても良い。文書のネットワーク上での場所を示
す情報に基づいて判定することにより、文書がコミュニ
ティ内の文書であるか否かの判定が迅速に行うことが可
能となる。Further, the document collection device for the community further determines whether or not the document is a document in the community by indicating information indicating a place of the document on the network, for example, UR.
L may be provided. By determining based on information indicating the location of the document on the network, it is possible to quickly determine whether the document is a document in a community.

【００２３】また、上記のネットワークから文書を収集
する文書収集装置を、特定の分野に関する文書を収集す
る特定分野向け文書収集装置として構成するようにして
もよい。そのために、本発明の更なる別の態様によれ
ば、ネットワークから文書を収集する装置において、文
書の収集に先立って、特定分野に関する文書群である正
例文書群と、特定分野と関連が少ない分野に関する文書
群である負例文書群とを収集済み文書群として与え、文
書収集手段は、収集された次収集候補を、正例文書群に
加え、収集済み文書群のうち、正例文書群の文書がある
数以上になるまで、次候補判定手段による次収集候補の
決定及び文書収集手段による収集を繰り返すように構成
する。これにより、特定分野に関する文書を、文書本文
の内容を解析せずに、参照関係に基づいて迅速に収集す
ることが可能となる。Further, the document collection device for collecting documents from the network may be configured as a document collection device for a specific field that collects documents related to a specific field. Therefore, according to still another aspect of the present invention, in a device for collecting documents from a network, prior to the collection of documents, a set of positive documents that are documents related to a specific field, and a document that is less relevant to the specific field. A document group relating to the field and a negative example document group are given as a collected document group, and the document collection unit adds the collected next collection candidate to the positive example document group. The determination of the next collection candidate by the next candidate determination unit and the collection by the document collection unit are repeated until the number of documents reaches a certain number or more. This makes it possible to quickly collect documents related to a specific field based on the reference relationship without analyzing the contents of the document body.

【００２４】また、上記の特定分野向け文書収集装置に
おいて、更に、収集済み文書の参照関係に基づいて、正
例文書群の文書からのみ参照される度合いである参照度
を算出する参照度算出手段を備え、次候補判定手段は、
参照度が高い文書を次収集候補として決定することとし
てもよい。また、上記の特定分野向け文書収集装置にお
いて、更に、収集済み文書の参照関係に基づいて、正例
文書群の文書を参照している収集済み文書群から参照さ
れている文書について、収集済み文書群から参照される
度合いを示す共参照度を算出する共参照度算出手段を備
え、次候補判定手段は、共参照度が高い文書を次収集候
補として決定することとしてもよい。参照度及び共参照
度を用いることにより、収集したい分野に関する文書
を、文書本文の内容を検討すること無く、迅速に収集す
ることが可能となる。Further, in the above-mentioned document collecting apparatus for specific fields, further, a reference degree calculating means for calculating a reference degree which is a degree of reference from only the documents of the positive example document group based on the reference relation of the collected documents. And the next candidate determination means includes:
A document with a high degree of reference may be determined as a next collection candidate. Further, in the above-described document collecting apparatus for a specific field, further, based on the reference relationship of the collected documents, the collected documents which are referred to from the collected documents which refer to the documents of the positive example documents are collected. A co-reference degree calculating unit that calculates a co-reference degree indicating a degree of reference from the group may be provided, and the next candidate determination unit may determine a document having a high co-reference degree as a next collection candidate. By using the reference degree and the co-reference degree, it is possible to quickly collect documents related to the field to be collected without considering the contents of the document body.

【００２５】また、上記の特定分野向け文書収集装置
は、複数の分野を対象とし、各分野に関する文書を同時
に収集する文書収集装置とすることもできる。そのため
に、上記の特定分野向け文書収集装置において、収集に
先立って与える収集済み文書群を複数の分野に関する文
書群の和集合とし、ある分野に関する文書群を正例文書
群として文書を収集する際に、他の残りの分野に関する
文書群の和集合を負例文書群とするように構成する。Further, the above-mentioned document collecting apparatus for a specific field may be a document collecting apparatus which targets a plurality of fields and collects documents related to each field at the same time. Therefore, in the above-described document collecting apparatus for a specific field, when collecting collected documents given before collection as a union of documents related to a plurality of fields and collecting documents as a group of documents related to a certain field as a positive document group Then, the union of the documents in the remaining fields is set as the negative example document.

【００２６】また、各文書収集装置は、更に、収集済み
文書で用いられている参照表現に基づいて収集済み文書
群をまとめあげるまとめあげる手段を更に備えることと
してもよい。参照表現のうち、参照先文書と参照元文書
の内容が同一であるのにネットワーク上で分散されて格
納されていることを示す参照表現がある。例えば、「次
へ」、「Ｎｅｘｔ」、「前へ」及び「Ｐｒｅｖ」等がそ
のような参照表現に該当する。まとめあげ手段は、この
ような参照表現による参照関係をもつ２つ以上の文書を
１つにまとめあげる。Further, each document collection device may further include means for collecting and collecting a group of collected documents based on a reference expression used in the collected documents. Among the reference expressions, there is a reference expression indicating that the contents of the reference destination document and the reference source document are the same, but are distributed and stored on the network. For example, “next”, “Next”, “previous”, “Prev”, etc. correspond to such reference expressions. The grouping unit groups together two or more documents having a reference relationship by such a reference expression.

【００２７】また、各文書収集装置は、更に、収集済み
文書群内の文書である収集済み文書で用いられている参
照表現に基づいて、収集済み文書にキーワードを付与す
るキーワード付与手段を備えることとしても良い。これ
により、文書本文の意味内容を解析することなく、か
つ、様々な各キーワードの異称をも、キーワードとする
ことが可能となる。Each of the document collection devices further includes a keyword assigning unit that assigns a keyword to the collected document based on a reference expression used in the collected document which is a document in the collected document group. It is good. As a result, it is possible to use the acronyms of various keywords as keywords without analyzing the semantic content of the document body.

【００２８】また、キーワード付与手段は、参照表現が
参照先文書に関係なく使用される参照表現の場合、キー
ワードとしないこととしても良い。ここで、参照先文書
に関係なく使用される参照表現の例として、「トップへ
戻る」、「ホームへ」等が考えられる。The keyword assigning means may not use a keyword when the reference expression is a reference expression used irrespective of the reference destination document. Here, examples of the reference expression used regardless of the reference destination document include “return to top” and “return to home”.

【００２９】また、キーワード付与手段は、参照表現が
参照する相異なる文書数を計数し、相異なる文書数があ
る数以上である場合、その参照表現をキーワードとしな
いこととしても良い。このような参照表現は、参照先文
書に関係なく使用される参照表現である可能性が高いか
らである。The keyword assigning means may count the number of different documents referred to by the reference expression, and if the number of different documents is equal to or more than a certain number, the keyword may not be used as the reference expression. This is because such a reference expression is likely to be a reference expression used regardless of the reference destination document.

【００３０】また、キーワード付与手段は、参照表現が
参照する相異なる文書数がある数未満である場合、更
に、各収集済み文書でその参照表現によって参照されて
いる回数である参照回数を計数し、相異なる文書数及び
参照回数に基づいて、その参照表現をキーワードとする
か否か判定することとしてもよい。When the number of different documents referred to by the reference expression is less than a certain number, the keyword assigning means further counts the number of times of reference, which is the number of times each collected document is referred to by the reference expression. Alternatively, it may be determined whether or not the reference expression is a keyword based on the number of different documents and the number of references.

【００３１】また、キーワード付与手段は、参照表現に
基づくキーワードに、収集済み文書の本文から抽出した
キーワード及び収集済み文書のネットワーク上の場所を
示す情報から抽出したキーワードを組み合せることとし
てもよい。これにより、多様な方法で抽出したキーワー
ドを組み合せることが可能となる。The keyword assigning means may combine a keyword based on the reference expression with a keyword extracted from the body of the collected document and a keyword extracted from information indicating a location on the network of the collected document. This makes it possible to combine keywords extracted by various methods.

【００３２】また、本発明の各構成により行われる処理
の過程からなる方法によっても、前述した課題を解決す
ることができる。また、上述した本発明の各構成により
行なわれる機能と同様の制御をコンピュータに行なわせ
るプログラムも、コンピュータに実行されることによっ
て、前述した課題を解決することができる。また、上述
のプログラムを記録したコンピュータで読み取り可能な
記録媒体も、その記録媒体からプログラムをコンピュー
タに読み出して実行することによって、前述した課題を
解決することができる。The above-mentioned problem can also be solved by a method comprising the steps of the processing performed by each configuration of the present invention. In addition, a program that causes a computer to perform the same control as the function performed by each configuration of the present invention described above can also solve the above-described problem by being executed by the computer. In addition, a computer-readable recording medium that stores the above-described program can also solve the above-described problem by reading the program from the recording medium to a computer and executing the program.

【００３３】[0033]

【発明の実施の形態】以下、本発明の実施の形態を図面
に基づいて説明する。本発明は、ネットワークから、用
途にあった文書を収集する文書収集装置に関する。な
お、以下の説明において、文書がＨＴＭＬで記述されて
いる場合について説明するが、本発明を限定する趣旨で
はない。言語をＨＴＭＬ（HyperText Markup Languag
e）に限定する趣旨ではない。文書の構造を記述するマ
ークアップ言語であれば、ＸＭＬ（eXtensibleMarkup L
anguage）及びＸＳＬ（eXtensible Stylesheet Languag
e）等その他言語でもよい。また、文書のネットワーク
上の場所を示す情報として、ＵＲＬ（Uniform Resource
Locators）を用いて説明するが、本発明を限定する趣
旨ではない。文書のネットワーク上の場所を示す情報で
あれば、ＵＲＬでなくともよい。なお、ＵＲＬは、ＵＲ
Ｉ（Uniform Resource Identifiers）の機能の一部であ
り、現在ネットワーク上で広く用いられている。Embodiments of the present invention will be described below with reference to the drawings. The present invention relates to a document collection device that collects documents suitable for use from a network. In the following description, a case where a document is described in HTML will be described, but this is not intended to limit the present invention. Change the language to HTML (HyperText Markup Languag
The purpose is not limited to e). XML (eXtensibleMarkup L) is a markup language that describes the structure of a document.
anguage) and XSL (eXtensible Stylesheet Languag)
e) and other languages. As information indicating the location of the document on the network, a URL (Uniform Resource) is used.
Locators), but is not intended to limit the present invention. The information need not be a URL as long as the information indicates the location of the document on the network. The URL is UR
It is a part of the function of I (Uniform Resource Identifiers) and is currently widely used on networks.

【００３４】図１に、本発明の原理図を示す。図１に示
すように、文書収集装置１は、インターネットやイント
ラネット等のネットワークに接続されている。文書収集
装置１は、文書収集手段２、参照関係抽出手段３、コミ
ュニティ判別手段４、次候補判定手段５、ランキング手
段６、ＵＲＬ判定手段７、参照度／共参照度算出手段
８、まとめあげ手段９、キーワード抽出手段１０を備え
る。図１において、点線で示される構成要素、つまり、
コミュニティ判別手段４及び参照度／共参照度算出手段
８は、実施形態によって用いられたり、用いられなかっ
たりする。同様に、点線で示される矢印、つまり、ラン
キング手段６による文書のランキング結果は、実施形態
によって、次候補判定手段１５による次収集候補の判定
に用いられたり、用いられなかったりする。FIG. 1 shows the principle of the present invention. As shown in FIG. 1, the document collection device 1 is connected to a network such as the Internet or an intranet. The document collection device 1 includes a document collection unit 2, a reference relationship extraction unit 3, a community determination unit 4, a next candidate determination unit 5, a ranking unit 6, a URL determination unit 7, a reference / co-reference calculation unit 8, and a grouping unit 9. , Keyword extracting means 10. In FIG. 1, the components indicated by dotted lines, that is,
The community discriminating means 4 and the reference / co-reference degree calculating means 8 may or may not be used depending on the embodiment. Similarly, the arrow indicated by the dotted line, that is, the ranking result of the document by the ranking unit 6 may or may not be used for the next collection candidate determination by the next candidate determination unit 15 depending on the embodiment.

【００３５】本発明の１実施形態に係わる文書収集装置
は、ネットワーク上のコミュニティ向けの文書を収集す
る。そのために、１実施形態に係わるコミュニティ向け
文書収集装置は、文書収集手段２、参照関係抽出手段
３、コミュニティ判別手段４、次候補判定手段５、ラン
キング手段６、まとめあげ手段９及びキーワード付与手
段１０を備える。コミュニティ向け文書収集装置におい
て、まず、コミュニティ内からまんべんなく文書を収集
した後、コミュニティ内外からコミュニティにとって有
用度が高い文書を収集する。A document collection device according to one embodiment of the present invention collects documents for a community on a network. For this purpose, the community document collection device according to one embodiment includes a document collection unit 2, a reference relation extraction unit 3, a community determination unit 4, a next candidate determination unit 5, a ranking unit 6, a grouping unit 9, and a keyword assignment unit 10. Prepare. In a document collection device for a community, first, documents are uniformly collected from within the community, and then documents having a high degree of usefulness to the community are collected from inside and outside the community.

【００３６】参照関係抽出手段３は、収集済み文書群２
０から参照関係を抽出し、文書間参照関係２２を抽出す
る。なお、収集開始時は、予め収集済み文書群２０とし
て初期文書群を与える。コミュニティ判別手段４は、収
集済み文書群２０の参照先文書であって、未収集の文書
がコミュニティ内の文書であるか否か判別する。The reference relation extracting means 3 collects the collected document group 2
The reference relation is extracted from 0, and the inter-document reference relation 22 is extracted. At the start of collection, an initial document group is given as the collected document group 20 in advance. The community discriminating means 4 discriminates whether or not an uncollected document which is a reference destination document of the collected document group 20 is a document in the community.

【００３７】次候補判定手段５は、収集済み文書群２０
の参照先であって、コミュニティ内の未収集文書を次収
集候補２１として判定する。文書収集手段２は、次収集
候補２１として判定された文書を収集し、新たに収集し
た文書群（新規収集文書群）を収集済み文書群２０に加
え、新たな収集済み文書群２０とする。文書収集手段２
は、収集済み文書群２０の文書数が規定された値以上で
あるか否か判定する。収集済み文書群２０の文書数が規
定された値より少ない場合、上述のようにしてコミュニ
ティ内から文書を収集する処理を繰り返す。このように
コミュニティ内の文書を規定数以上、まんべんなく収集
することにより、コミュニティ内の文書が属する多様な
分野についての情報を取得する。この情報は、コミュニ
ティにとって有用度が高い文書をコミュニティ内外から
収集することに役立てられる。The next candidate judging means 5 includes the collected document group 20
And the uncollected document in the community is determined as the next collection candidate 21. The document collection unit 2 collects the documents determined as the next collection candidates 21 and adds a newly collected document group (a newly collected document group) to the collected document group 20 to make a new collected document group 20. Document collection means 2
Determines whether the number of documents in the collected document group 20 is equal to or greater than a prescribed value. If the number of documents in the collected document group 20 is smaller than the prescribed value, the process of collecting documents from within the community is repeated as described above. As described above, information on various fields to which the documents in the community belong can be acquired by uniformly collecting the documents in the community in a specified number or more. This information is useful for collecting documents that are highly useful for the community from inside and outside the community.

【００３８】収集済み文書群２０の文書数が規定された
値以上である場合、次にコミュニティにとって有用度が
高い文書をコミュニティ内外から収集する。参照関係抽
出手段３により新規収集文書群から参照関係を抽出し、
コミュニティ判別手段４により参照先文書であって未収
集の文書がコミュニティ内の文書であるか否か判別す
る。ランキング手段６は、参照関係及び、文書のネット
ワーク上での場所を示す情報、例えばＵＲＬ、の特徴に
基づいて、収集済み文書の参照先となっている未収集の
文書をコミュニティ内外別にランキングする。ランキン
グ手段６は、ＵＲＬ判定手段７を備え、ＵＲＬ判定手段
７は、参照先文書と参照元文書のＵＲＬ文字列上の類似
を判定する。ランキング手段６は、ＵＲＬ判定手段７に
よって判定されＵＲＬの文字列上の類似に基づいて、文
書をランキングする。If the number of documents in the collected document group 20 is equal to or greater than a specified value, documents having the next highest usefulness for the community are collected from inside and outside the community. The reference relation extracting means 3 extracts the reference relation from the newly collected document group,
The community discriminating means 4 discriminates whether or not the reference document which has not been collected is a document in the community. The ranking unit 6 ranks uncollected documents, which are the reference destinations of the collected documents, in and out of the community based on the reference relationship and the information indicating the location of the documents on the network, for example, the URL. The ranking unit 6 includes a URL determination unit 7, and the URL determination unit 7 determines the similarity of the reference destination document and the reference source document on the URL character string. The ranking unit 6 ranks the documents based on the character string similarity of the URL determined by the URL determination unit 7.

【００３９】次候補判定手段５は、コミュニティ内外で
それぞれ上位にランキングされた未収集文書を次回にネ
ットワークから収集すべき文書である次収集候補２１と
して判定し、文書収集手段２は、次収集候補２１として
判定された文書を収集する。このように、本発明の１実
施形態に係わるコミュニティ向け文書収集装置は、多段
階に分けてコミュニティにとって有用度が高い文書を収
集する。ある規定された以上の文書をコミュニティ内外
から収集すると、まとめあげ手段９は、参照表現に基づ
いて収集済み文書２０をまとめあげる。キーワード付与
手段１０は、参照表現及び参照表現の出現頻度に基づい
て、収集済み文書２０にキーワードを付与する。ランキ
ング手段６は、上述のようにして、今度は収集済み文書
２０をランキングする。最終的にまとめあげられ、キー
ワードを付与し、ランキングした収収集済み文書２０
は、収集文書ファイル２３として格納される。上述のよ
うに、コミュニティ向け文書収集装置において、文書本
文の内容を解析していないため、言語に依存せず、迅速
に、用途に合った文書を収集することができる。The next candidate judging means 5 judges the uncollected documents ranked high in the community and outside as the next collection candidate 21 which is the document to be collected from the network next time. The document determined as 21 is collected. As described above, the community document collection device according to the embodiment of the present invention collects documents having high utility for the community in multiple stages. When documents exceeding a certain specified level are collected from inside and outside the community, the grouping means 9 groups together the collected documents 20 based on the reference expression. The keyword assignment unit 10 assigns a keyword to the collected document 20 based on the reference expression and the appearance frequency of the reference expression. The ranking means 6 ranks the collected documents 20 this time as described above. The collected and collected documents 20 that are finally compiled, assigned keywords, and ranked
Is stored as the collected document file 23. As described above, since the content of the document body is not analyzed in the document collection device for the community, it is possible to quickly collect documents suitable for the purpose without depending on the language.

【００４０】また、本発明の別の１実施形態に係わる文
書収集装置は、特定の分野に関する文書を収集する。そ
のために、上記特定分野に関する文書収集装置は、文書
収集手段２、参照関係抽出手段３、次候補判定手段５、
ランキング手段６、参照度／共参照度算出手段８、まと
めあげ手段９及びキーワード付与手段１０を備える。特
定分野に関する文書収集装置において、コミュニティ内
外の文書の区別は不要であるため、コミュニティ判別に
係わる処理はない。A document collection device according to another embodiment of the present invention collects documents related to a specific field. For this purpose, the document collection device relating to the specific field includes a document collection unit 2, a reference relation extraction unit 3, a next candidate determination unit 5,
The system includes a ranking unit 6, a reference / co-reference calculation unit 8, a grouping unit 9, and a keyword assignment unit 10. In a document collection device for a specific field, there is no need to distinguish between documents inside and outside the community, so there is no processing related to community discrimination.

【００４１】特定分野に関する文書収集装置において、
収集に先立って特定分野に関する文書群を正例文書群と
して、その特定分野との関連が少ない文書群を負例文書
群として与える。収集済み文書群２０は、正例文書群と
負例文書群の和集合とする。参照度／共参照度算出手段
８は、ある文書と正例文書群、その文書と負例文書群の
それぞれの参照関係に基づいて、その文書が特定分野に
関連する度合いを参照度及び共参照度として算出する。
次候補判定手段５は、ランキング手段６によるランキン
グの代わりに、参照度／共参照度算出手段８が算出した
参照度又は共参照度が高い未収集文書を次収集候補とし
て判定する。また、負例文書群に含まれる収集済み文書
２０のうち、参照度又は共参照度が高い文書を負例文書
群から除き、正例文書群に加える。文書収集手段２は、
次収集候補２１として判定された文書を収集し、正例文
書群に加える。そして、正例文書群の文書数が規定され
た数以上になるまで、次収集候補の決定及び文書の収集
を繰り返す。その他の動作は、上述の通りである。In a document collection device for a specific field,
Prior to collection, a document group relating to a specific field is given as a positive example document group, and a document group having little relation to the specific field is given as a negative example document group. The collected document group 20 is a union of the positive example document group and the negative example document group. The reference / co-reference degree calculation means 8 determines the degree of relevance of the document to a specific field based on the reference relation between a certain document and a group of positive examples, and the reference relation between the document and a group of negative examples. Calculate as degrees.
The next candidate determination unit 5 determines an uncollected document having a high reference degree or a high co-reference degree calculated by the reference degree / co-reference degree calculation unit 8 as a next collection candidate instead of the ranking by the ranking unit 6. Further, among the collected documents 20 included in the negative example document group, a document having a high reference degree or co-reference degree is excluded from the negative example document group and added to the positive example document group. Document collection means 2
The document determined as the next collection candidate 21 is collected and added to the positive document group. The determination of the next collection candidate and the collection of documents are repeated until the number of documents in the positive example document group becomes equal to or more than the specified number. Other operations are as described above.

【００４２】以下、第１実施形態に係わる、コミュニテ
ィにとって有用度の高い文書を収集するコミュニティ向
け文書収集装置について説明する。本発明の第１実施形
態において述べるネットワーク上のコミュニティとし
て、例えば、社内サイト、業界サイト及び特定トピック
のネットワーク上のユーザグループが考えられる。ここ
で、社内サイトは、しばしばイントラネットに代表され
る。業界サイトは、複数の会社のシステムからなるエク
ストラネットに代表される。なお、社内サイトに必要な
文書を収集する文書収集装置は、コーポレートポータル
（ＥＩＰ：EnterpriseInformation Portalともいわれ
る）ともいわれる、企業内のイントラネットポータルに
適用可能である。The following is a description of a community document collection device for collecting documents having high utility for the community according to the first embodiment. As the community on the network described in the first embodiment of the present invention, for example, an in-house site, an industry site, and a user group on the network of a specific topic can be considered. Here, the in-house site is often represented by an intranet. The industry site is represented by an extranet consisting of systems from multiple companies. Note that a document collection device that collects documents required for an in-house site is applicable to an intranet portal in a company, also called a corporate portal (EIP: EnterpriseInformation Portal).

【００４３】コミュニティのポータルにおいて、コミュ
ニティにとって有用度が高い文書を優先して自動収集す
るという要件が必要とされている。例えば、コーポレー
トポータルの場合、業務に関係する文書を自動収集する
必要がある。本発明の第１実施形態によれば、このよう
な文書の自動収集を実現する。そのために、第１実施形
態に係わる文書収集装置において、以下の考え方を採用
する。・特定のコミュニティにとって有用度の高い文書
は、そのコミュニティ内の文書の多くからよく参照され
ている文書である、またはコミュニティ内の重要文書か
ら参照されている文書である、と考える。In the portal of the community, there is a requirement that documents having a high degree of usefulness to the community be automatically collected with priority. For example, in the case of a corporate portal, it is necessary to automatically collect documents related to business. According to the first embodiment of the present invention, such automatic document collection is realized. For this purpose, the following concept is adopted in the document collection device according to the first embodiment. -A document that is highly useful for a specific community is considered to be a document that is frequently referred to by many documents in the community or a document that is referred to by important documents in the community.

【００４４】図２は、第１実施形態に係わる文書収集装
置の構成を示す。図１に示すように、文書収集装置１０
０は、文書収集部１０１、参照関係抽出部１０２、コミ
ュニティ判別部１０３、次候補判定部１０４、ランキン
グ部１０５、まとめあげ部１０６及びキーワード付与部
１０７を備える。FIG. 2 shows the configuration of the document collection device according to the first embodiment. As shown in FIG. 1, the document collection device 10
Reference numeral 0 includes a document collection unit 101, a reference relationship extraction unit 102, a community determination unit 103, a next candidate determination unit 104, a ranking unit 105, a grouping unit 106, and a keyword assignment unit 107.

【００４５】上述のように、本文書収集装置１００にお
いて、先にコミュニティ内の文書について複数回、収集
を行い、次に、コミュニティ内外の文書についても複数
回、収集を行う。このように多段階に分けて複数回、文
書収集を行うことが本文書収集装置１００の特徴の１つ
である。As described above, the document collection apparatus 100 first collects documents in a community a plurality of times, and then collects documents in and outside a community a plurality of times. One of the features of the document collection apparatus 100 is that document collection is performed a plurality of times in a multi-step manner.

【００４６】収集開始に先立って、まず、初期文書群を
収集済み文書群Ｓとして与える。この初期文書群は、収
集の開始点となる。初期文書群として、例えば、サイト
のトップページやトップページの参照集（リンク集）等
が考えられる。収集済み文書群Ｓ又は初期文書群は、具
体的には、ＵＲＬテーブル１２０として文書収集装置１
００に備えられる。Prior to the start of collection, first, an initial document group is given as a collected document group S. This initial document group is a starting point for collection. As the initial document group, for example, a top page of the site, a reference collection (link collection) of the top page, and the like can be considered. Specifically, the collected document group S or the initial document group is stored as a URL table 120 in the document collection device 1.
00 is provided.

【００４７】続いて、参照関係抽出部１０２は、収集済
み文書群Ｓから参照関係を抽出し、収集済み文書群Ｓの
参照先となる文書（以下、参照先文書という）のＵＲＬ
をＵＲＬテーブル１２０に格納し、抽出された参照関係
を参照関係テーブル１２１に格納する。コミュニティ判
別部１０３は、参照関係抽出部１０２が抽出した、収集
済み文書群Ｓの参照先文書が、コミュニティ内の文書で
あるのか、コミュニティ外の文書であるのか、ＵＲＬに
基づいて判定し、判別結果を参照関係テーブル１２１に
格納する。Subsequently, the reference relationship extracting unit 102 extracts a reference relationship from the collected document group S, and URLs of documents to be referred to by the collected document group S (hereinafter referred to as reference destination documents).
Is stored in the URL table 120, and the extracted reference relation is stored in the reference relation table 121. The community determining unit 103 determines whether the reference destination document of the collected document group S extracted by the reference relationship extracting unit 102 is a document in the community or a document outside the community based on the URL. The result is stored in the reference relation table 121.

【００４８】本文書収集装置１００は、先にコミュニテ
ィ内の文書について１回以上収集を行う。この際、収集
をまんべんなく行う。次候補判定部１０４は、参照関係
抽出部１０２が抽出した収集済み文書群Ｓの参照先文書
のうち、まだ収集されていない、コミュニティ内の文書
を次に収集すべき文書の候補（以下、次収集候補Ｎとい
う）として判定する。文書収集部１０１は、次収集候補
Ｎとして判定された文書群を収集し、収集した文書を収
集済み文書群に追加し、新たな収集済み文書群Ｓとす
る。このコミュニティ内の文書の収集は、規定された数
の文書を収集するまで行う。コミュニティ内の全ての文
書を収集しなくても良く、大体、コミュニティ内の全文
書の１／２から１／４程度で良い。まんべんなくコミュ
ニティ内の文書を収集することにより、コミュニティ内
で有用な文書の分野についての情報を入手する。The document collection apparatus 100 first collects documents in a community at least once. At this time, collection is performed evenly. The next candidate determination unit 104 is a candidate of the next document to be collected next among the documents in the community that have not been collected among the reference documents of the collected document group S extracted by the reference relationship extraction unit 102 (hereinafter, the next candidate). (Referred to as a collection candidate N). The document collection unit 101 collects the document group determined as the next collection candidate N, adds the collected document to the collected document group, and sets a new collected document group S. The collection of documents in the community is performed until a specified number of documents are collected. It is not necessary to collect all documents in the community, and it may be about １／ to の of all documents in the community. Obtain information about the areas of documentation that are useful within the community by uniformly collecting documents within the community.

【００４９】文書収集部１０１がコミュニティ内の文書
を規定された数だけ収集した後、文書収集装置１００
は、次に、コミュニティ内外の文書についても１回以上
収集を行う。この場合、上述のようにして、文書収集部
１０１は、文書を収集し、参照関係抽出部１０２及びコ
ミュニティ判別部１０３は、ＵＲＬテーブル１２０及び
参照関係テーブル１２１に情報を格納した後、さらに、
ランキング部１０５は、参照関係及び文書のＵＲＬに基
づいて、参照先文書に重要度を与え、その重要度に基づ
いて、参照先文書をランキングする。After the document collection unit 101 collects a specified number of documents in the community, the document collection device 100
Then, the document inside and outside the community is collected at least once. In this case, as described above, the document collection unit 101 collects documents, and the reference relationship extraction unit 102 and the community determination unit 103 store information in the URL table 120 and the reference relationship table 121, and then further
The ranking unit 105 assigns importance to the referenced document based on the reference relationship and the URL of the document, and ranks the referenced document based on the importance.

【００５０】候補判定部１０４は、ランキング部１０５
による判定結果に基づいて、まだ収集されていない参照
先文書であって、コミュニティ内の文書のうちで上位ｎ
１位内にある文書群、及び、コミュニティ外の文書のう
ちで上位ｎ２位内にある文書群を次収集候補Ｎとなる文
書として判定する。コミュニティ内外で分けて次収集候
補Ｎを決定することにより、コミュニティ内とコミュニ
ティ外のいずれかに文書が偏って収集されてしまうこと
を防ぐことが可能となる。The candidate determining unit 104 includes a ranking unit 105
Based on the result of the determination, the reference documents that have not been collected yet, and the top n among the documents in the community
The document group in the first place and the document group in the top n2 place among the documents outside the community are determined as the documents to be the next collection candidates N. By determining the next collection candidate N separately inside and outside the community, it is possible to prevent the documents from being unevenly collected either inside the community or outside the community.

【００５１】続いて、コミュニティ内の文書の収集と同
様にして、文書収集部１０１は、次収集候補Ｎをコミュ
ニティ内外から収集し、収集した文書を収集済み文書群
に追加して新たな収集済み文書群Ｓとする。文書収集装
置１００は、規定された数の文書を収集するまで、コミ
ュニティ内外からの文書収集を繰り返す。Subsequently, similarly to the collection of documents in the community, the document collection unit 101 collects next collection candidates N from inside and outside the community, and adds the collected documents to a collected document group to newly collect collected documents. A document group S is assumed. The document collection device 100 repeats document collection from inside and outside the community until the specified number of documents is collected.

【００５２】文書収集部１０１がコミュニティ内外から
規定数だけの文書を収集した後、収集した文書の選別を
行う。文書の選別は、まとめあげ部１０６、キーワード
付与部１０７及びランキング部１０５により行われる。
まず、まとめあげ部１０６は、文書において他文書を参
照する際に用いる文字列（参照表現ともいう）に基づい
て、収集済み文書のうち、同一内容であるが複数の文書
に分割されてい文書をまとめあげる。After the document collection unit 101 collects a specified number of documents from inside and outside the community, the collected documents are sorted out. The selection of documents is performed by the grouping unit 106, the keyword assigning unit 107, and the ranking unit 105.
First, the grouping unit 106 groups documents that have the same content but are divided into a plurality of documents among the collected documents based on a character string (also referred to as a reference expression) used when referring to another document in the document. .

【００５３】キーワード付与部１０７は、文書中の参照
表現に基づいて、キーワードを決定し、文書にキーワー
ドを付与する。より具体的には、キーワード付与部１０
７は、参照表現のうち、「トップに戻る」、「ホーム
へ」というような参照先文書の内容に関係なくしばしば
使用される参照表現を除く。続いて、キーワード付与部
１０７は、各参照表現が参照する相異なる文書数を計数
し、参照表現テーブル１２２に格納する（図２では不図
示）。また、各収集済み文書についてある参照表現で参
照されている頻度を計数し、参照回数テーブル１２３に
格納する（図２では不図示）。キーワード付与部１０７
は、これら計数結果に基づいて各収集済み文書について
参照表現の重みを算出し、重みが大きい順にある数だけ
の参照表現をキーワードとして各収集済み文書に付与す
る。The keyword assigning unit 107 determines a keyword based on the reference expression in the document, and assigns the keyword to the document. More specifically, the keyword assigning unit 10
Reference numeral 7 excludes reference expressions, such as “return to top” and “to home”, which are often used regardless of the contents of the referenced document. Subsequently, the keyword assigning unit 107 counts the number of different documents referenced by each reference expression, and stores the counted number in the reference expression table 122 (not shown in FIG. 2). Also, the frequency of referring to each collected document by a certain reference expression is counted and stored in the reference count table 123 (not shown in FIG. 2). Keyword assignment unit 107
Calculates the weight of the reference expression for each collected document based on these counting results, and assigns a certain number of reference expressions in descending order of the weight to each collected document as a keyword.

【００５４】ランキング部１０５は、参照関係及び文書
のＵＲＬに基づいて、各文書に重要度を付与し、その重
要度に基づいて文書をランキングする。このように、本
実施形態に係わるコミュニティ向け文書収集装置１００
は、文書本文の内容を解析すること無く、参照関係及び
ＵＲＬに基づいて文書を収集し、まとめあげ、キーワー
ドを付与し、ランキングする。The ranking unit 105 assigns importance to each document based on the reference relation and the URL of the document, and ranks the documents based on the importance. As described above, the community document collection device 100 according to the present embodiment is described.
Collects documents based on reference relationships and URLs, summarizes, assigns keywords, and ranks documents without analyzing the contents of the document body.

【００５５】上述のようにして、文書収集装置１００
は、まとめあげられ、キーワードが付与され、ランキン
グされた文書群を優良コンテンツ１３０として提供す
る。優良コンテンツ１３０は、検索エンジン１４０を介
して索引１４１として提供されたり、検索エンジン１４
０を介してサーバ１６０に提供されたり、分類エンジン
１５０によってディレクトリ編集されてサーバ１６０に
提供されたりする。サーバ１６０のクライアントは、サ
ーバ１６０に提供された優良コンテンツ１３０を、ブラ
ウザ１７０を介して閲覧することができる。As described above, the document collection device 100
Provides a group of documents that have been put together, given a keyword, and ranked as superior content 130. The excellent content 130 is provided as an index 141 via a search engine 140,
0 to the server 160, or a directory edited by the classification engine 150 and provided to the server 160. The client of the server 160 can browse the excellent content 130 provided to the server 160 via the browser 170.

【００５６】以下、図３から図６を用いて各テーブルの
データ構造について説明する。図３にＵＲＬテーブル１
２０のデータ構造の一例を示す。図３に示すように、Ｕ
ＲＬテーブルは、各文書について文書を識別する文書Ｉ
Ｄ（Identification information）、文書のＵＲＬ、収
集済みであるか否かを示す収集済みフラグ、コミュニテ
ィ内の文書であるか否かを示すコミュニティフラグ及び
文書の重要度を格納する。文書ＩＤ及びＵＲＬは、参照
関係抽出部１０２が収集済み文書の参照先文書を抽出し
た際に格納される。収集済みフラグは、文書収集部１０
１がその文書を収集した際に「オン（１）」にされる。
コミュニティフラグは、コミュニティ判別部１０３がそ
の文書がコミュニティ内の文書であると判定した場合に
「オン（１）」にされる。重要度は、ランキング部１０
５が文書の参照関係及びＵＲＬの文字列上の特徴に基づ
いて算出し、格納する。The data structure of each table will be described below with reference to FIGS. FIG. 3 shows URL table 1
20 shows an example of the data structure of No. 20. As shown in FIG.
The RL table contains a document I that identifies the document for each document.
D (Identification information), the URL of the document, a collected flag indicating whether the document has been collected, a community flag indicating whether the document is in the community, and the importance of the document are stored. The document ID and the URL are stored when the reference relation extracting unit 102 extracts the reference destination document of the collected document. The collected flag is set in the document collection unit 10
1 is turned on (1) when the document is collected.
The community flag is set to “ON (1)” when the community determination unit 103 determines that the document is a document in the community. The importance is determined by the ranking section 10
5 is calculated and stored based on the document reference relationship and the URL character string features.

【００５７】図４に参照関係テーブル１２１のデータ構
造の一例を示す。図４に示すように、参照関係テーブル
１２１は、文書の参照関係に関する情報を格納する。よ
り具体的には、参照関係テーブル１２１は、参照元文書
の文書ＩＤである参照元文書ＩＤ、参照元文書によって
参照されるコミュニティ内の文書の文書ＩＤである参照
先文書ＩＤ₁、及び、参照元文書によって参照されるコ
ミュニティ外の文書の文書ＩＤである参照先文書ＩＤ₂
を格納する。これら情報は、参照関係抽出部１０２によ
って格納される。FIG. 4 shows an example of the data structure of the reference relation table 121. As shown in FIG. 4, the reference relation table 121 stores information on the reference relation of the document. More specifically, the reference relation table 121 includes a reference source document ID which is a document ID of a reference source document, a reference destination document ID ₁ which is a document ID of a document in a community referred to by the reference source document, and Reference destination document ID ₂ which is the document ID of a document outside the community referenced by the original document
Is stored. These pieces of information are stored by the reference relation extracting unit 102.

【００５８】図５に参照表現テーブル１２２のデータ構
造の一例を示す。図５に示すように、参照表現テーブル
１２２は、収集済み文書で各参照表現が用いられる頻度
に関する情報を格納する。より具体的には、参照表現テ
ーブル１２２は、各参照表現について、参照表現を識別
する表現ＩＤ、参照表現（文字列）、参照表現が参照す
る相異なる文書の数である文書頻度ＤＦ（ｗ）、及び、
キーワードとして用いるべきか否かを示す要否フラグを
格納する。これら情報は全て、キーワード付与部１０７
によって格納される。FIG. 5 shows an example of the data structure of the reference expression table 122. As shown in FIG. 5, the reference expression table 122 stores information on the frequency at which each reference expression is used in a collected document. More specifically, the reference expression table 122 includes, for each reference expression, an expression ID for identifying the reference expression, a reference expression (character string), and a document frequency DF (w) indicating the number of different documents referred to by the reference expression. ,as well as,
A necessity flag indicating whether the keyword should be used is stored. All of this information is stored in the
Stored by

【００５９】図６に参照回数テーブル１２３のデータ構
造の一例を示す。図６に示すように、参照回数テーブル
１２３は、各収集済み文書が各参照表現で参照されてい
る回数である参照表現頻度ＴＦ（ｄ，ｗ）を格納する。
これら情報は全て、キーワード付与部１０７によって格
納される。例えば、ある文書中のある参照表現ｒｗ１に
埋め込まれたリンクを参照することによって、参照先文
書ｄｏｃ２が得られた場合、参照先文書ｄｏｃ２のＴＦ
（doc2,rw1）は、１インクリメントされる。図６におい
て、文書ＩＤがｄｏｃｉである文書が、表現ＩＤがｒｗ
ｊである参照表現によってＴＦ（doci,rwj）回参照され
ていることを示す。例えば、図６において、文書ＩＤが
ｄｏｃ１である文書は、表現ＩＤがｒｗ１である参照表
現によって１９回参照されていることがわかる。FIG. 6 shows an example of the data structure of the reference count table 123. As shown in FIG. 6, the reference count table 123 stores a reference expression frequency TF (d, w) which is the number of times each collected document is referred to by each reference expression.
All of this information is stored by the keyword assignment unit 107. For example, when a reference destination document doc2 is obtained by referring to a link embedded in a certain reference expression rw1 in a certain document, the TF of the reference destination document doc2 is obtained.
(Doc2, rw1) is incremented by one. In FIG. 6, the document whose document ID is doci is represented by the expression ID RW
This indicates that the reference expression j is referred to TF (doci, rwj) times. For example, in FIG. 6, it can be seen that the document whose document ID is doc1 has been referenced 19 times by the reference expression whose expression ID is rw1.

【００６０】以下、第１実施形態に係わる文書収集装置
が実現する特定のコミュニティにとって有用度の高い文
書を収集する方法について説明する。説明において以下
の表記法を用いる。・ＬＴ（Ｓ）は、文書群Ｓの参照先となる文書群を示
す。・Ｘ−Ｙは、集合Ｘと集合Ｙの差集合を示す。Hereinafter, a description will be given of a method of collecting documents having a high degree of usefulness for a specific community realized by the document collection apparatus according to the first embodiment. The following notation is used in the description. LT (S) indicates a document group as a reference destination of the document group S. XY indicates a difference set between the set X and the set Y.

【００６１】最初に、図７を用いて特定のコミュニティ
向けの文書を収集する処理の大まかな流れについて説明
する。まず、収集開始時に、収集済み文書群Ｓの初期文
書群（収集の開始点となる文書群）としてコミュニティ
内の文書を与える。First, a general flow of a process of collecting documents for a specific community will be described with reference to FIG. First, at the start of collection, a document in the community is given as an initial document group (a document group serving as a collection start point) of the collected document group S.

【００６２】参照関係抽出部１０２による参照関係の抽
出結果及びコミュニティ判別部１０３による、参照先文
書がコミュニティ内の文書であるか否かの判別結果に基
づいて、候補判定部１０４は、次収集候補Ｎを抽出する
（ステップＳ１）。次収集候補Ｎを抽出する処理につい
て、詳しくは後述する。Based on the result of the reference relationship extraction by the reference relationship extraction unit 102 and the result of the community determination unit 103 determining whether or not the referenced document is a document in the community, the candidate determination unit 104 N is extracted (step S1). The process of extracting the next collection candidate N will be described later in detail.

【００６３】続いて、文書収集部１０１は、ＵＲＬテー
ブル１２０に格納されたＵＲＬに基づいて、次収集候補
Ｎを収集し（ステップＳ２）、収集された文書について
の収集済みフラグをオンにする。これにより、文書収集
部１０１は、新たに収集された次収集候補Ｎを収集済み
文書群Ｓに加える。つまり、式Ｓ∪Ｎで示される文書群
を新たに収集済み文書群Ｓとする。Subsequently, the document collection unit 101 collects the next collection candidate N based on the URL stored in the URL table 120 (step S2), and turns on the collection completed flag for the collected document. Accordingly, the document collection unit 101 adds the newly collected next collection candidate N to the collected document group S. That is, the group of documents represented by the expression S∪N is set as a newly collected document group S.

【００６４】文書収集部１０１は、収集済み文書群Ｓに
含まれる文書数が規定された文書数以上であるか否か判
定する（ステップＳ３）。この判定は、ＵＲＬテーブル
１２０に格納された収集済みフラグが「オン（１）」に
なっている文書の数を計数することにより行う。収集済
み文書群Ｓに含まれる文書数が規定された文書数以上で
ない場合（ステップＳ３：Ｎｏ）、次候補判定部１０４
は、再度次収集候補を決定し（ステップＳ４）、ステッ
プＳ２に戻る。２回目以降の次収集候補の決定におい
て、今回の収集で新たに収集した文書（以下、新規収集
文書という）についての参照関係抽出部１０２による参
照関係の抽出結果、及び、コミュニティ判別部１０３に
よる新規収集文書の参照先文書がコミュニティ内の文書
であるか否かの判別結果に基づいて、候補判定部１０４
は、未収集の参照先文書のうちコミュニティ内の文書を
次収集候補Ｎとして抽出する。ステップＳ４の処理は、
ステップＳ１と同様であるため、ステップＳ１について
後述する際に一緒に説明する。The document collection unit 101 determines whether or not the number of documents included in the collected document group S is equal to or greater than the specified number of documents (step S3). This determination is made by counting the number of documents for which the collected flag stored in the URL table 120 is “ON (1)”. If the number of documents included in the collected document group S is not equal to or greater than the specified number of documents (step S3: No), the next candidate determination unit 104
Determines the next collection candidate again (step S4), and returns to step S2. In the determination of the next collection candidate after the second time, a reference relation extraction result by the reference relation extraction unit 102 for a document newly collected by this collection (hereinafter, referred to as a newly collected document) and a new determination by the community determination unit 103 Based on the result of determining whether or not the reference document of the collected document is a document in the community, the candidate determination unit 104
Extracts the document in the community from the uncollected reference destination documents as the next collection candidate N. The processing in step S4 is
Since this is the same as step S1, step S1 will be described together when it is described later.

【００６５】収集済み文書群Ｓに含まれる文書数が規定
された文書数以上である場合（ステップＳ３：Ｙｅ
ｓ）、今度は、候補判定部１０４は、コミュニティ内外
の文書から次収集候補を決定する。そのために、まず、
参照関係抽出部１０２は、新規収集文書の参照関係の抽
出し、コミュニティ判別部１０３は、新規収集文書の参
照先文書がコミュニティ内の文書であるか否かを判別す
る。その後、ランキング部１０５は、収集済み文書及び
その参照先文書、つまりＳ∪ＬＴ（Ｓ）に対して重要度
を付与し、重要度に基づいて、未収集の参照先文書、つ
まりＬＴ（Ｓ）−Ｓのランキングを行う（ステップＳ
５）。このステップＳ５の処理について詳しくは後述す
る。When the number of documents included in the collected document group S is equal to or more than the specified number of documents (step S3: Ye)
s) This time, the candidate determination unit 104 determines a next collection candidate from documents inside and outside the community. First of all,
The reference relationship extraction unit 102 extracts the reference relationship of the newly collected document, and the community determination unit 103 determines whether the reference destination document of the newly collected document is a document in the community. After that, the ranking unit 105 assigns importance to the collected document and its reference destination document, that is, S∪LT (S), and based on the importance, the uncollected reference destination document, that is, LT (S). -S ranking (step S
5). Details of the processing in step S5 will be described later.

【００６６】続いて、次候補判定部１０４は、ＬＴ
（Ｓ）−Ｓのうち、コミュニティ内の文書群のランキン
グで上位ｎ１件に入っている文書群及びコミュニティ外
の文書群のランキングで上位ｎ２件に入っている文書群
を次収集候補Ｎとする（ステップＳ６）。このようにコ
ミュニティ内とコミュニティ外とを区別して次収集候補
Ｎを抽出することにより、コミュニティ内またはコミュ
ニティ外に、収集される文書が偏ることを防ぐことがで
きる。Subsequently, the next candidate determination unit 104
Among (S) -S, a document group included in the top n1 documents in the ranking of documents in the community and a document group included in the top n2 documents in the ranking of documents outside the community are set as the next collection candidate N. (Step S6). By extracting the next collection candidate N while distinguishing the inside and outside of the community as described above, it is possible to prevent the collected documents from being biased inside or outside the community.

【００６７】文書収集部１０１は、ＵＲＬテーブル１２
０に格納されたＵＲＬに基づいて、次収集候補Ｎを収集
し（ステップＳ７）、収集された文書の収集済みフラグ
を「オン（１）」にする。文書収集部１０１は、ＵＲＬ
テーブル１２０に格納された収集済みフラグが「オン
（１）」になっている文書の数を計数することにより、
収集済み文書群Ｓに含まれる文書数が規定された文書数
以上であるか否か判定する（ステップＳ８）。The document collection unit 101 includes the URL table 12
The next collection candidate N is collected based on the URL stored in 0 (step S7), and the collected flag of the collected document is set to “ON (1)”. The document collection unit 101 has a URL
By counting the number of documents for which the collected flag stored in the table 120 is “ON (1)”,
It is determined whether or not the number of documents included in the collected document group S is equal to or greater than the specified number of documents (step S8).

【００６８】収集済み文書群Ｓに含まれる文書数が規定
された文書数以上でない場合（ステップＳ８：Ｎｏ）、
ステップＳ５に戻る。収集済み文書群Ｓに含まれる文書
数が規定された文書数以上である場合（ステップＳ８：
Ｙｅｓ）、ランキング部１０５、まとめあげ部１０６及
びキーワード部１０７によって、収集済み文書群Ｓの文
書を選別する（ステップＳ９）。ステップＳ９の処理に
ついて詳しくは後述する。If the number of documents included in the collected document group S is not equal to or greater than the specified number of documents (step S8: No),
It returns to step S5. When the number of documents included in the collected document group S is equal to or more than the specified number of documents (step S8:
Yes), the documents of the collected document group S are selected by the ranking unit 105, the grouping unit 106, and the keyword unit 107 (step S9). Details of the processing in step S9 will be described later.

【００６９】以下、コミュニティ内の文書を収集する際
に、次収集候補を決定する処理について詳しく説明す
る。この処理は、図７のステップＳ１及びステップＳ４
に相当する。The process of determining the next collection candidate when collecting documents in a community will be described below in detail. This processing is performed in steps S1 and S4 in FIG.
Is equivalent to

【００７０】まず、参照関係抽出部１０２は、新規収集
文書から参照されている参照先文書を抽出する（ステッ
プＳ１１）。参照関係抽出部１０２は、各抽出された参
照先文書について、参照先文書と同一のＵＲＬがＵＲＬ
テーブル１２０に格納されていない場合、参照先文書の
ＵＲＬをＵＲＬテーブル１２０に格納する（ステップＳ
１２）。同じＵＲＬを重複して格納する必要はないから
である。情報を格納する際、参照関係抽出部１０２は、
収集済みフラグを「オフ（０）」とする。First, the reference relation extracting unit 102 extracts a reference destination document referred to from a newly collected document (step S11). The reference relation extracting unit 102 determines, for each extracted reference destination document, the same URL as the reference destination document.
If it is not stored in the table 120, the URL of the referenced document is stored in the URL table 120 (step S
12). This is because it is not necessary to store the same URL redundantly. When storing the information, the reference relationship extracting unit 102
The collected flag is set to “off (0)”.

【００７１】続いて、コミュニティ判別部１０３は、Ｕ
ＲＬテーブル１２０に格納された参照先文書のＵＲＬの
文字列に基づいて、抽出された参照先文書がコミュニテ
ィ内の文書であるか否か判別し、コミュニティ内の文書
であると判別した場合、コミュニティ判別部１０３は、
ＵＲＬテーブル１２０のコミュニティフラグを「オン
（１）」とする。それ以外の場合、コミュニティ判別部
１０３は、コミュニティフラグを「オフ（０）」とする
（ステップＳ１３）。さらに、参照関係抽出部１０２
は、コミュニティ判別部１０３の判別結果に基づいて、
参照関係テーブル１２１の各欄に参照関係を格納する。Subsequently, the community determining unit 103
Based on the character string of the URL of the reference document stored in the RL table 120, it is determined whether or not the extracted reference document is a document in the community. The determination unit 103
The community flag of the URL table 120 is set to “ON (1)”. In other cases, the community determination unit 103 sets the community flag to “off (0)” (step S13). Further, the reference relation extracting unit 102
Is based on the determination result of the community determination unit 103,
The reference relation is stored in each column of the reference relation table 121.

【００７２】ここで、本実施形態によれば、コミュニテ
ィは、ネットワーク上の文書の集合、つまり文書群とし
て与えられている。従って、同一コミュニティ内の文書
であるか否かの判別は、その文書群を示すＵＲＬに基づ
いて判別できる。より具体的には、コミュニティ内の文
書であるか否かの判定は、ＵＲＬの文字列上の特徴に基
づいて、以下のようにして行う。・コミュニティが社内サイトである場合、通常、社内サ
イトのドメイン名（fujitsu.co.jp等）とドメイン名が
同じである文書をコミュニティ内の文書であると判定す
る。・コミュニティが業界サイトである場合、その業界サイ
トに属する複数の企業のサイトのドメイン名のいずれか
とドメイン名が同じである文書をコミュニティ内の文書
であると判定する。・コミュニティがユーザグループである場合、各ユーザ
のサイト（ホーム文書ともいう）のＵＲＬ（例えば、ht
tp://www.fujitsu.co.jp/foo/ ）のいずれかと同じ文字
列をＵＲＬに含む文書をコミュニティ内の文書であると
判定する。Here, according to the present embodiment, the community is given as a set of documents on the network, that is, a document group. Therefore, whether or not the document belongs to the same community can be determined based on the URL indicating the document group. More specifically, the determination as to whether or not the document is in the community is performed as follows based on the characteristics of the URL in the character string. When the community is an in-house site, a document whose domain name is the same as the domain name of the in-house site (fujitsu.co.jp or the like) is usually determined to be a document in the community. If the community is an industry site, a document having the same domain name as any of the domain names of the sites of a plurality of companies belonging to the industry site is determined to be a document in the community. If the community is a user group, the URL (for example, ht) of each user's site (also referred to as a home document)
tp: //www.fujitsu.co.jp/foo/) is determined as a document in the community if the URL includes the same character string as any of the characters.

【００７３】次候補判定部１０４は、収集済み文書の参
照先文書であり、かつ、未収集文書である文書ＬＴ
（Ｓ）−Ｓのうち、コミュニティ内の文書を次収集候補
Ｎとして判定する。具体的には、次候補判定部１０４
は、ＵＲＬテーブル１２０を参照し、収集済みフラグが
「オフ（０）」であり、且つ、コミュニティフラグが
「オン（１）」である文書を次収集候補Ｎとして決定す
る（ステップＳ１４）。このような次収集候補Ｎは、以
下の（１）式で表すことができる。The next candidate judging section 104 is a document LT which is a reference destination document of the collected document and which is an uncollected document.
Among (S) -S, the document in the community is determined as the next collection candidate N. Specifically, the next candidate determination unit 104
Refers to the URL table 120, and determines, as the next collection candidate N, a document whose collection flag is “off (0)” and whose community flag is “on (1)” (step S14). Such a next collection candidate N can be represented by the following equation (1).

【００７４】Ｎ＝｛ｄ|ｄ∈ＬＴ（Ｓ）−Ｓ，ｄはコミュニティ内｝・・・・（１）このようにして次収集候補Ｎを決定し、コミュニティ内
の文書をまんべんなく収集することにより、コミュニテ
ィ内で必要とされる、意味的に多様な文書についての情
報を偏りなく取得することが可能となる。N = {d | d} LT (S) −S, d is in the community} (1) In this way, the next collection candidate N is determined, and the documents in the community are collected evenly. Accordingly, information on semantically diverse documents required in the community can be obtained without bias.

【００７５】続いて、図９を用いて収集済み文書及びそ
の参照先文書をランキングする処理について説明する。
この処理は、図７のステップＳ５に相当する。参照関係
抽出部１０２及びコミュニティ判別部１０３は、新規収
集文書の参照関係の抽出し、参照関係をコミュニティの
判別結果とともに、ＵＲＬテーブル１２０及び参照関係
テーブル１２１に格納する（ステップＳ２１からＳ２
３）。このステップＳ２１からＳ２３の処理は、図８で
説明したステップＳ１１からＳ１３と同様であるため、
詳しい説明は省略する。Next, a process for ranking collected documents and their reference destination documents will be described with reference to FIG.
This processing corresponds to step S5 in FIG. The reference relationship extraction unit 102 and the community determination unit 103 extract the reference relationship of the newly collected document, and store the reference relationship together with the community determination result in the URL table 120 and the reference relationship table 121 (steps S21 to S2).
3). The processing of steps S21 to S23 is the same as the processing of steps S11 to S13 described in FIG.
Detailed description is omitted.

【００７６】続いて、ランキング部１０５は、収集済み
文書及びその参照先文書、つまりＳ∪ＬＴ（Ｓ）に対し
て、参照関係テーブル１２１に格納された参照関係及び
ＵＲＬテーブル１２０に格納されたＵＲＬの文字列上の
特徴に基づいて重要度を算出し、算出した重要度をＵＲ
Ｌテーブル１２０に格納する（ステップＳ２４）。ラン
キング部１０５は、ＵＲＬテーブル１２０に格納された
コミュニティフラグ及び重要度に基づいて、未収集の参
照先文書、つまり、ＬＴ（Ｓ）−Ｓを、コミュニティ内
外に分けてランキングする（ステップＳ２５）。Subsequently, the ranking unit 105 determines the reference relation stored in the reference relation table 121 and the URL stored in the URL table 120 for the collected document and its reference destination document, that is, S∪LT (S). Is calculated based on the characteristics of the character string of
It is stored in the L table 120 (step S24). The ranking unit 105 ranks uncollected reference documents, that is, LT (S) -S, inside and outside the community, based on the community flag and importance stored in the URL table 120 (step S25).

【００７７】以下、ステップＳ２４の重要度を算出する
処理について詳しく説明する。上述のように、ランキン
グ部１０５は、文書の参照関係及びＵＲＬを利用して、
収集済み文書の意味内容を分析することなく、文書の重
要度を算出する。以下、参照関係に基づいて文書に付与
される重要度をリンク重要度という。リンク重要度を付
与する際の基本的な考え方は以下の通りである。・類似度の低いＵＲＬから多く参照されている文書は重
要である。Hereinafter, the process of calculating the importance in step S24 will be described in detail. As described above, the ranking unit 105 uses the reference relation of the document and the URL,
The importance of a document is calculated without analyzing the semantic content of the collected document. Hereinafter, the importance given to the document based on the reference relation is referred to as link importance. The basic concept when assigning link importance is as follows. Documents that are frequently referenced by URLs with low similarity are important.

【００７８】例えば、一般に、同一サイト内に設けられ
た複数の文書はそのサイト内の他の文書に参照されてい
るが、それらの文書のＵＲＬは相互に類似する。従っ
て、類似度の高いＵＲＬから参照されている文書の重要
度は低いと推定できる。・多くの文書から参照されている文書ほど重要な文書で
あり、重要な文書から参照されている、ＵＲＬの類似度
の低い文書は重要である。For example, generally, a plurality of documents provided in the same site are referred to by other documents in the site, but the URLs of the documents are similar to each other. Therefore, it can be estimated that the importance of a document referenced from a URL having a high similarity is low. A document that is referred to by many documents is a more important document, and a document that is referred to by an important document and has a low URL similarity is important.

【００７９】例えば、有名なディレクトリサービス等及
び官公庁等は多くの文書から参照されているが、このよ
うな重要な文書から参照されている文書は重要度が高い
と考えられる。また、多くの文書やミラーサイトを抱え
るサービス（サイト）に設けられた文書等はそのサイト
内で参照されていることが多いが、同じサイト内の文書
のＵＲＬは大抵類似しているため、「ＵＲＬの類似度の
低い文書は重要である」という考え方を導入すれば、同
じサイトの文書が多く検索されてしまうことを避けるこ
とが可能となる。・ＵＲＬの類似度は、サーバアドレス、パス、ファイル
名の全てが異なるものが最も小さく、ミラーサイトや同
一サーバ内の文書は類似度が高くなるように、ＵＲＬの
字面情報から定義する。For example, famous directory services and government offices are referred to by many documents, and documents referred to by such important documents are considered to have high importance. In addition, many documents and documents provided in a service (site) having a mirror site are often referred to in the site, but since the URLs of documents in the same site are usually similar, " It is possible to avoid that many documents at the same site are searched for many times, by introducing the idea that "documents with low URL similarity are important." The similarity of the URL is defined based on the face information of the URL such that a server address, a path, and a file name that are all different are the smallest, and a mirror site or a document in the same server has a high similarity.

【００８０】上述の３つの考え方を導入することによ
り、全ての参照関係を同等に扱わないでリンク重要度に
応じた重みを参照関係に与えることとしている。より具
体的には、重みを参照元と参照先文書のＵＲＬの類似度
の逆数として与えることとしている。以下、リンク重要
度の算出についてより詳しく説明する。By introducing the above three concepts, all reference relations are not treated equally, and a weight corresponding to the link importance is given to the reference relation. More specifically, the weight is given as the reciprocal of the similarity between the URLs of the reference source document and the reference destination document. Hereinafter, the calculation of the link importance will be described in more detail.

【００８１】リンク重要度の算出対象となる文書集合を
ＤＯＣ＝｛p₁, p₂,....p_N ｝、文書ｐのリンク重要
度をＷ_p 、文書ｐの参照先の文書集合をＲef(p)、文書
ｐの参照元の文書集合をＲefed(p)、文書ｐとｑのＵＲ
Ｌ類似度をsim(p,q)、相異度をdiff(p,q)＝1/sim(p,q)
とすると、文書ｐからｑに参照が張られているとした
時、その参照の重みlw(p,q)を以下の（２）式で定義す
る。DOC = {p ₁ , p ₂ ,... P _N }, the link importance of the document p is W _p , the document set of the reference destination of the document p is Ref (p), Referred (p) is the set of documents referencing document p, and UR of documents p and q
L similarity is sim (p, q), difference is diff (p, q) = 1 / sim (p, q)
Then, when it is assumed that a reference is made from the document p to q, the weight lw (p, q) of the reference is defined by the following equation (2).

【００８２】[0082]

【数１】 (Equation 1)

【００８３】この（２）式から分かるように、lw(p,q)
は、ｐとｑのＵＲＬの類似度sim(p,q)が低いほど、ま
た、ｐからの参照数がより少ないほど大きくなる。各文
書のリンク重要度は、各ｐ∈ＤＯＣに対して、Ｃq を定
数（重要度の下限であり、文書によって異なる値を与え
てもよい。）として、As can be seen from equation (2), lw (p, q)
Becomes larger as the similarity sim (p, q) of the URLs of p and q is lower and the number of references from p is smaller. The link importance of each document is obtained by defining Cq as a constant (a lower limit of the importance, which may be different for each document) for each p∈DOC.

【００８４】[0084]

【数２】 (Equation 2)

【００８５】という連立一次方程式の解として定義す
る。ランキング部１０５は、この連立一次方程式を解く
ことにより、リンク重要度を各文書に付与する。なお、
連立一次方程式の解法については、既存のアルゴリズム
が多数存在するため、説明は省略する。（２）式及び
（３）式から、上述の考え方が実現されていることを読
み取ることができる。Is defined as a solution of the simultaneous linear equation. The ranking unit 105 assigns link importance to each document by solving this simultaneous linear equation. In addition,
Description of the solution of the simultaneous linear equations is omitted because there are many existing algorithms. From the expressions (2) and (3), it can be seen that the above concept is realized.

【００８６】次に、（２）式及び（３）式中の文書ｐと
ｑのＵＲＬ類似度sim(p,q) について説明する。ＵＲＬ
類似度は、ランキング部１０５のＵＲＬ判別部（不図
示）により算出される。一般に、文書のＵＲＬは、サー
バアドレス、パス、ファイル名の三種類の情報から構成
される。例えば、ＷＷＷ文書のＵＲＬ、http://www.fla
b.fujitsu.co.jp/hypertext/news/1999/product1.html
は、サーバアドレス（www.flab.fujitsu.co.jp）、パス
（hypertext/news/1999）、ファイル名（product1.htm
l）の３種類の情報から構成される。Next, the URL similarity sim (p, q) between the documents p and q in the equations (2) and (3) will be described. URL
The similarity is calculated by a URL determination unit (not shown) of the ranking unit 105. Generally, the URL of a document is composed of three types of information: a server address, a path, and a file name. For example, the URL of a WWW document, http: //www.fla
b.fujitsu.co.jp/hypertext/news/1999/product1.html
Is the server address (www.flab.fujitsu.co.jp), path (hypertext / news / 1999), file name (product1.htm
l) consists of three types of information.

【００８７】本実施形態では、与えられた２つの文書ｐ
及びｑのＵＲＬ類似度を、上記の三種類の組合せにより
定義する。類似度sim(p,q)として、例えば、以下に述べ
るドメイン類似度sim ＿domain(p,q)及び融合類似度sim
＿merge(p,q)が考えられる。In the present embodiment, given two documents p
And q are defined by the above three combinations. As similarity sim (p, q), for example, domain similarity sim_domain (p, q) and fusion similarity sim described below
_Merge (p, q) is considered.

【００８８】ドメイン類似度sim ＿domain(p,q)は、ド
メインの類似に基づいて算出される。ドメインとは、サ
ーバアドレスの後半部分であり、会社や組織を表す。サ
ーバアドレスが.com、.edu、.org等で終わる米国サーバ
の場合はサーバアドレスの後ろから２つめまで、サーバ
アドレスが.jp 、.fr 等で終わる他国のサーバの場合は
サーバアドレスの後ろから３つめまでがドメインに相当
する。The domain similarity sim_domain (p, q) is calculated based on the domain similarity. The domain is the latter half of the server address and represents a company or organization. For servers in the United States ending with .com, .edu, .org, etc., from the end of the server address to the second, and for servers in other countries ending with .jp, .fr, etc., from the end of the server address Up to the third corresponds to a domain.

【００８９】文書ｐと文書ｑのドメイン類似度は以下の
式により定義される。 sim ＿domain(p,q) =１／α （ｐ、ｑが同一ドメインの場合） =１（ｐ、ｑが異なるドメインの場合）ここで、αは定数で、０より大きく１より小さい実数値
を取るとする。The domain similarity between the document p and the document q is defined by the following equation. sim_domain (p, q) = 1 / α (when p and q are in the same domain) = 1 (when p and q are in different domains) Here, α is a constant and a real value larger than 0 and smaller than 1 Take it.

【００９０】また、sim(p,q)として、前述の三種類の情
報を融合した類似度sim＿merge(p,q)を次のように定義
する。 sim ＿merge(p,q)＝（サーバアドレスの類似度）＋（パ
スの類似度）＋（ファイル名の類似度）以下、右辺の各項の算出方法について説明する。As sim (p, q), a similarity sim_merge (p, q) obtained by fusing the above three types of information is defined as follows. sim_merge (p, q) = (similarity of server address) + (similarity of path) + (similarity of file name) Hereinafter, a method of calculating each term on the right side will be described.

【００９１】サーバアドレスの類似度は、アドレスの階
層を後ろから見ていき、ｎレベルまで一致した場合、類
似度を１＋ｎとする。例えば、www.fujitsu.co.jp とww
w.flab.fujitsu.co.jpは３レベルまで一致しているので
４となる。www.fujitsu.co.jp とwww.fujitsu.com は１
レベルも一致していないので（一致０レベル）、類似度
は１である。As for the similarity of the server address, the hierarchy of the address is viewed from the rear, and when the levels match up to n levels, the similarity is set to 1 + n. For example, www.fujitsu.co.jp and ww
w.flab.fujitsu.co.jp is 4 because it matches up to 3 levels. www.fujitsu.co.jp and www.fujitsu.com are 1
Since the levels do not match (match 0 level), the similarity is 1.

【００９２】パスの類似度は、先頭からパスの"/"で区
切られた要素毎に比較し、一致したレベルまでを類似度
とする。例えば、/doc/patent/index.htmlと/doc/paten
t/1999/2/file.htmlとは、２レベルまで一致しているの
で類似度は２である。The path similarity is compared for each element separated by "/" of the path from the head, and the similarity is determined up to the matching level. For example, /doc/patent/index.html and / doc / paten
Since t / 1999/2 / file.html matches up to two levels, the similarity is 2.

【００９３】ファイル名の類似度は、ファイル名が一致
する場合、類似度１とする。このsim＿merge(p,q)によ
っても、ＵＲＬが似通った文書が多く検索されることを
防ぐことができる。The file name similarity is set to 1 when the file names match. This sim_merge (p, q) can also prevent many documents with similar URLs from being searched.

【００９４】このようにして、ランキング部１０５は、
文書に重要度を付与し、高い重要度を付与された文書を
上位にランキングする。このように、本実施形態によれ
ば、ランキング部１０５は、取得した文書の参照関係及
びＵＲＬの文字列の特徴に基づいて、文書本文の意味内
容を解析せずに、つまり処理速度が速くかつ精度良く、
文書に重要度を付与し、その重要度に基づいて文書をラ
ンキングすることができる。In this way, ranking section 105
Documents are assigned importance, and documents with high importance are ranked higher. As described above, according to the present embodiment, the ranking unit 105 does not analyze the semantic content of the document body based on the acquired reference relationship of the document and the characteristics of the character string of the URL. With high accuracy,
Documents can be assigned importance, and the documents can be ranked based on the importance.

【００９５】以下、図１０を用いて収集済み文書を選別
する処理について詳しく説明する。この処理は図７のス
テップＳ９に相当する。まず、まとめあげ部１０６は、
収集済み文書群Ｓで用いられている参照表現に基づい
て、収集済み文書群Ｓをまとめあげる（ステップＳ３
１）。なお、参照表現とは、例えば、ＨＴＭＬ（HyperT
ext Mark-up Language）では、アンカータグで囲まれた
部分がそれに相当する。Hereinafter, the process of selecting collected documents will be described in detail with reference to FIG. This processing corresponds to step S9 in FIG. First, the grouping unit 106
The collected documents S are put together based on the reference expression used in the collected documents S (step S3).
1). The reference expression is, for example, HTML (HyperT
In ext Mark-up Language), the part surrounded by the anchor tag corresponds to it.

【００９６】より具体的には、予め不図示のまとめあげ
参照表現テーブルに、「次に」、「前へ」といった参照
表現（参照時に用いられる文字列）を格納する。これら
「次に」、「前へ」といった参照表現を用いている文書
は、参照元文書と参照先文書は同一内容であるが、ＵＲ
Ｌが分散されている文書と推定される。まとめあげ部１
０６は、まとめあげ参照表現テーブルに格納されている
参照表現を文書から抽出し、以下のようにして文書をま
とめあげる。・文書ｄｏｃ１の中から「次へ」、「次に続く」、「Ｎ
ｅｘｔ」というような表現により、文書ｄｏｃ２が参照
されている場合、まとめあげ部１０６は、文書ｄｏｃ２
を文書ｄｏｃ１に縮退する。この操作の繰り返しを可能
な限り行う。・文書ｄｏｃ１の中から「前へ」、「前に戻る」、「Ｐ
ｒｅｖ」といった表現により、文書ｄｏｃ２が参照され
ている場合、まとめあげ部１０６は、文書ｄｏｃ１をｄ
ｏｃ２に縮退する。この操作の繰り返しを可能な限り行
う。More specifically, reference expressions (character strings used at the time of reference) such as "next" and "previous" are stored in advance in a collective reference expression table (not shown). Documents using these reference expressions such as “next” and “previous” have the same content in the reference source document and the reference destination document, but have the UR.
It is estimated that L is a distributed document. Consolidation part 1
In step 06, the reference expressions stored in the grouping reference expression table are extracted from the document, and the documents are grouped as follows. -"Next", "Continue", "N" from document doc1
When the document doc2 is referred to by an expression such as “ext”, the grouping unit 106 outputs the document doc2
Is reduced to the document doc1. Repeat this operation as much as possible. -From the document doc1, "Previous", "Return to previous", "P
When the document doc2 is referred to by an expression such as “rev”, the grouping unit 106 converts the document doc1 to d.
Degenerate to oc2. Repeat this operation as much as possible.

【００９７】続いて、キーワード付与部１０７は、参照
表現に基づいて収集済み文書Ｓにキーワードを付す（ス
テップＳ３２）。キーワード付与処理について詳しくは
後述する。最後に、ランキング部１０５は、上述の図９
のステップＳ２４と同様にして、収集済み文書に重要度
を付与し、重要度をＵＲＬテーブル１２０に格納する。
ランキング部１０５は、重要度に基づいて収集済み文書
をランキングする（ステップＳ３３）。Subsequently, the keyword assigning unit 107 assigns a keyword to the collected document S based on the reference expression (step S32). The keyword assignment processing will be described later in detail. Finally, the ranking unit 105 determines whether the
In the same manner as in step S24, importance is assigned to the collected documents, and the importance is stored in the URL table 120.
The ranking unit 105 ranks the collected documents based on the importance (Step S33).

【００９８】次に、ステップＳ３２のキーワード付与処
理について、図１１を用いて詳しく説明する。まず、予
め、収集済み文書で用いられている参照表現のうち、
「ホームへ」、「トップに戻る」等、参照先文書に関係
なく、しばしば使用される参照表現を不図示の不要語辞
書に格納する（不図示）。キーワード付与部１０７は、
収集済み文書群Ｓから参照表現を抽出し、各参照表現ｗ
について、参照表現ｗを用いて参照される相異なる文書
の数ＤＦ（ｗ）を集計し、参照表現ｗを識別する表現Ｉ
Ｄ、その参照表現（文字列）とともにＤＦ（ｗ）の集計
結果を参照表現テーブル１２２に格納する（ステップＳ
４１）。この段階では、要否フラグを「オフ（０）」と
しておく。Next, the keyword assignment processing in step S32 will be described in detail with reference to FIG. First, of the reference expressions used in the collected documents in advance,
A frequently used reference expression such as "go home" or "return to top" is stored in an unnecessary word dictionary (not shown) regardless of the reference document (not shown). The keyword assignment unit 107
A reference expression is extracted from the collected document group S, and each reference expression w is extracted.
, The number DF (w) of different documents referenced using the reference expression w is aggregated, and the expression I that identifies the reference expression w
D, together with the reference expression (character string), the total result of DF (w) is stored in the reference expression table 122 (step S)
41). At this stage, the necessity flag is set to “OFF (0)”.

【００９９】キーワード付与部１０７は、参照表現ｗの
うち、ＤＦ（ｗ）が所定の数以上であるものをキーワー
ド候補から省く（ステップＳ４２）。言い換えると、参
照先文書まで含めた総文書数をＮとすると、以下の式に
該当する参照表現ｗを省く。The keyword assigning unit 107 excludes the reference expressions w whose DF (w) is equal to or more than a predetermined number from the keyword candidates (step S42). In other words, assuming that the total number of documents including the reference destination document is N, the reference expression w corresponding to the following expression is omitted.

【０１００】ＤＦ（ｗ）＞αＮここで、αは、定数であり、例えば０．１としてもよ
い。キーワード付与部１０７は、参照表現ｗのうち、不
要語辞書に格納されている特定の参照表現をキーワード
候補から省く（ステップＳ４３）。これらの参照表現
は、参照先文書に関係なく使用されているため、キーワ
ードとして用いるには適切でないからである。DF (w)> αN Here, α is a constant, and may be, for example, 0.1. The keyword assigning unit 107 excludes the specific reference expressions stored in the unnecessary word dictionary from the keyword candidates among the reference expressions w (step S43). This is because these reference expressions are used irrespective of the reference destination document and are not suitable for use as keywords.

【０１０１】キーワード付与部１０７は、収集済み文書
Ｓから、文書ｄを取り出し、収集済み文書群Ｓとｄの差
集合、つまりＳ−ｄを新たな収集済み文書群Ｓとする
（ステップＳ４４）。The keyword assigning unit 107 extracts the document d from the collected documents S, and sets the difference set between the collected documents S and d, that is, Sd, as a new collected documents S (step S44).

【０１０２】キーワード付与部１０７は、キーワード付
与部１０７は、文書ｄにおいて各参照表現ｗによって参
照されている回数ＴＦ（ｄ，ｗ）を集計し、以下の
（４）式を用いて、文書ｄについて各参照表現ｗの重み
Ｗ(ｄ，ｗ）を算出する（ステップＳ４５）。The keyword assigning unit 107 counts the number of times TF (d, w) referred to by each reference expression w in the document d, and calculates the number of times TF (d, w) in the document d using the following equation (4). , The weight W (d, w) of each reference expression w is calculated (step S45).

【０１０３】Ｗ（ｄ，ｗ）＝ＴＦ（ｄ，ｗ）ｌｏｇ（Ｎ／ＤＦ（ｗ））・・・・（４）キーワード付与部１０７は、参照表現テーブル１２２に
アクセスし、参照表現の重みＷの大きい順に高々ｎ個の
参照表現の要否フラグを「オン（１）」とする。つま
り、重みＷの大きい順に高々ｎ個の参照表現を文書ｄの
キーワードとする。W (d, w) = TF (d, w) log (N / DF (w)) (4) The keyword assigning unit 107 accesses the reference expression table 122 and weights the reference expression. The necessity flags of at most n reference expressions are set to “on (1)” in the descending order of W. That is, at most n reference expressions are set as keywords of the document d in descending order of the weight W.

【０１０４】このようにして得られた参照表現に基づく
キーワードは、文書ｄの本文に含まれる単語に基づくキ
ーワードと異なり、様々な異称をキーワードとして取得
することが特徴の１つである。例えば、ある企業のホー
ムページへの参照表現から、その企業の様々な呼称（正
式名、略称、通称、英語名等）を取得することができ
る。また、例えば、用語「Ｌｉｎｕｘ」に関して、「リ
ナックス」、「ライナックス」等の様々な異称がキーワ
ードとして取得することができる。一方、一般に１つの
文書の本文ではこうした異称のうち１つだけを統一的に
用いるため、本文からキーワードを取得する場合では異
称をキーワードとして取得することはできない。One of the features of the keyword based on the reference expression thus obtained is to acquire various aliases as keywords, unlike the keyword based on the word included in the text of the document d. For example, various names (official names, abbreviations, common names, English names, etc.) of the company can be obtained from the reference expression to the home page of the company. Further, for example, regarding the term “Linux”, various aliases such as “Linux” and “Linux” can be acquired as keywords. On the other hand, in general, only one of such aliases is used in the text of one document in a unified manner, so that when a keyword is obtained from the text, the alias cannot be obtained as a keyword.

【０１０５】また、参照表現から取得したキーワード
に、文書ｄの本文に出現する単語のうちで頻出する単語
からキーワード及び文書ｄを示すＵＲＬから得たキーワ
ード、例えば、http://www.fujitsu.com/であれば、キ
ーワードとしてfujitsu、を加えることとしてもよい。
これにより、文書ｄに多様なキーワードを付与すること
が可能になる。The keywords obtained from the reference expression include keywords that are frequently used among words appearing in the body of the document d and keywords obtained from the URL indicating the document d, for example, http://www.fujitsu. If it is com /, fujitsu may be added as a keyword.
This makes it possible to assign various keywords to the document d.

【０１０６】図１２に、第１実施形態に係わる文書収集
装置を用いて収集した文書をユーザに提供する画面の一
例を示す。図１２において、収集した優良コンテンツ１
３０を、分類エンジン１５０を用いてディレクトリに分
け、サーバ１６０のクライアントに提供する場合を例と
している。クライアントは、画面１８０でキーワードを
入力する、又は、カテゴリを選択することにより、閲覧
したい文書へのリンクまたはリンク集を画面に表示させ
ることができる。FIG. 12 shows an example of a screen for providing a user with documents collected using the document collection device according to the first embodiment. In FIG. 12, the collected excellent contents 1
30 is divided into directories using the classification engine 150 and provided to the client of the server 160 as an example. By inputting a keyword or selecting a category on the screen 180, the client can display a link or a collection of links to a document to be viewed on the screen.

【０１０７】クライアントがキーワードを入力した場
合、画面１８１に示すようにキーワードに基づいて検索
された文書へのリンクが、重要度と共に表示される。本
実施形態によれば、入力されたキーワードの異称も合わ
せて検索することが可能である。カテゴリを選択した場
合、画面１８２に示すように選択されたカテゴリに関連
する文書へのリンク集が表示される。When the client inputs a keyword, a link to a document searched based on the keyword is displayed together with the importance as shown in a screen 181. According to the present embodiment, it is possible to perform a search with an alias of the input keyword. When a category is selected, a link collection to documents related to the selected category is displayed as shown in a screen 182.

【０１０８】ここで、画面１８１及び画面１８２に示す
ように、検索された文書を提示する際に、ＵＲＬテーブ
ル１２０に格納されたコミュニティフラグに基づいて、
文書をコミュニティ内外に分けて提示することとしても
良い。Here, as shown in screens 181 and 182, when presenting the retrieved document, based on the community flag stored in the URL table 120,
The document may be presented separately inside and outside the community.

【０１０９】以下、第２実施形態に係わる文書収集装置
について説明する。第２実施形態に係わる文書収集装置
は、特定分野に関する文書を収集する。本実施形態に係
わる文書収集装置において以下の考え方を採用する。・
ネットワークにおいて、参照の親子／兄弟関係にある文
書は、内容的に似通っている傾向にある。ある程度の文
書群としばしば親子／兄弟関係にあるとされる文書は、
元文書群と同じような分野の内容である可能性が高い。
元の文書群からと親子／兄弟関係にある文書のうち参照
度（親子関係）や共参照度（兄弟関係）の高い文書を収
集し、元文書群に繰り込み、という操作を多段階に繰り
返すことで、当該分野に関する文書を収集していくこと
ができる。Hereinafter, a document collection device according to the second embodiment will be described. The document collection device according to the second embodiment collects documents related to a specific field. The following concept is adopted in the document collection device according to the present embodiment.・
In a network, documents in a parent / child / brother relationship of reference tend to be similar in content. Documents that are often described as having parent-child / sibling relationships with a certain set of documents,
It is highly likely that the content is in the same field as the original document group.
The operation of collecting documents having a high reference degree (parent-child relation) or a high degree of co-reference (sibling relations) among documents in parent-child / sibling relations from the original document group, and nesting the original document group in multiple steps. With this, documents related to the relevant field can be collected.

【０１１０】図１３に第２実施形態に係わる文書収集装
置の構成を示す。図１３に示すように第２実施形態に係
わる文書収集装置２００は、文書収集部１０１、参照関
係抽出部１０２、候補判定部１０４、参照度／共参照度
算出部２０１、ランキング部１０５、まとめあげ部１０
６及びキーワード付与部１０７を備える。参照度／共参
照度算出部２０１は、文書の参照関係に基づいて、ある
文書が特定分野に関連している度合いを算出する。その
他の各部の機能は、第１実施形態で説明した通りであ
る。FIG. 13 shows the configuration of a document collection device according to the second embodiment. As shown in FIG. 13, the document collection device 200 according to the second embodiment includes a document collection unit 101, a reference relation extraction unit 102, a candidate determination unit 104, a reference / co-reference degree calculation unit 201, a ranking unit 105, and a grouping unit. 10
6 and a keyword assigning unit 107. The reference degree / co-reference degree calculation unit 201 calculates the degree to which a certain document is related to a specific field based on the reference relation of the documents. The functions of the other units are as described in the first embodiment.

【０１１１】第２実施形態に係わる文書収集装置におい
て、収集開始に先立って、まず、ある分野の代表的な文
書を既存の検索エンジンやリンク集を用いて収集し、正
例文書群ＰＳとして与え、当該分野と重ならない任意の
分野の文書も同様にして収集して負例文書群ＮＳとして
与え、ＰＳ∪ＮＳを収集済み文書群Ｓとする。この収集
済み文書群Ｓが収集の開始点となる。In the document collection apparatus according to the second embodiment, prior to the start of collection, first, representative documents in a certain field are collected using an existing search engine or link collection, and given as a positive example document group PS. Similarly, documents in any field that does not overlap with the field are also collected and given as a negative example document group NS, and PS @ NS is set as a collected document group S. This collected document group S is the starting point of collection.

【０１１２】参照関係抽出部１０２は、収集済み文書群
Ｓから参照関係を抽出し、収集済み文書群Ｓの参照先文
書のＵＲＬをＵＲＬテーブル１２０に格納し、抽出され
た参照関係を参照関係テーブル１２１に格納する。ここ
で、第２実施形態に係わる文書収集装置において、ＵＲ
Ｌテーブル１２０に、コミュニティフラグの代わりに正
例文書群ＰＳに含まれる文書であるか否かを示す正例フ
ラグの欄を含む。正例フラグは、正例文書群ＰＳに含ま
れる文書である場合に「オン（１）」となる。また、参
照関係テーブル１２１に参照関係を格納する際、コミュ
ニティ内外で分けることは不要となる。The reference relation extracting unit 102 extracts a reference relation from the collected document group S, stores the URL of the reference destination document of the collected document group S in the URL table 120, and stores the extracted reference relation in the reference relation table. 121. Here, in the document collection device according to the second embodiment, the UR
The L table 120 includes a column of a positive example flag indicating whether or not the document is included in the positive example document group PS instead of the community flag. The positive example flag is turned on (1) when the document is included in the positive example document group PS. When storing the reference relation in the reference relation table 121, it is not necessary to divide the reference relation inside and outside the community.

【０１１３】参照度／共参照度算出部２０１は、参照関
係抽出部１０２が抽出した参照関係に基づいて、正例文
書群ＰＳ及び負例文書群ＮＳと収集済み文書Ｓの参照先
文書との関係を示す参照度及び共参照度を算出する。次
候補判定部１０４は、参照度／共参照度算出部２０１が
算出した参照度及び共参照度に基づいて、収集済み文書
群Ｓの参照先文書であって、正例文書群ＰＳに含まれな
い文書のなかから所定の条件を満たす文書を次収集候補
Ｎとして判定する。次候補判定部１０４は、次収集候補
Ｎのうち負例文書群ＮＳに含まれている文書を負例文書
群ＮＳから除き、正例文書群ＰＳに加える。The reference degree / co-reference degree calculation section 201 compares the positive example document group PS and the negative example document group NS with the reference document of the collected document S based on the reference relation extracted by the reference relation extraction section 102. The reference degree and the co-reference degree indicating the relationship are calculated. The next candidate determination unit 104 is a reference destination document of the collected document group S and is included in the positive example document group PS based on the reference degree and the co-reference degree calculated by the reference degree / co-reference degree calculation unit 201. A document that satisfies a predetermined condition is determined as a next collection candidate N from the documents that do not exist. The next candidate determination unit 104 excludes the documents included in the negative example document group NS among the next collection candidates N from the negative example document group NS and adds the documents to the positive example document group PS.

【０１１４】文書収集部１０１は、ＵＲＬテーブル１２
０を参照し、次収集候補Ｎのうち未収集文書を収集し、
収集した文書を正例文書群ＰＳに加える。第２実施形態
に係わる文書収集装置２００は、正例文書群ＰＳの文書
数が規定された数以上になるまで、上述のようにして収
集済み文書Ｓの参照関係を抽出し、参照関係に基づいて
次収集候補Ｎを決定し、次収集候補Ｎを収集する処理を
繰り返す。The document collection unit 101 sets the URL table 12
0, and collects uncollected documents among the next collection candidates N,
The collected documents are added to the positive document group PS. The document collection device 200 according to the second embodiment extracts the reference relation of the collected documents S as described above until the number of documents in the positive example document group PS becomes equal to or more than the specified number, and based on the reference relation. Then, the next collection candidate N is determined, and the process of collecting the next collection candidate N is repeated.

【０１１５】収集済み文書Ｓが規定された数以上になる
と、まとめあげ部１０６は参照表現に基づいて収集済み
文書群Ｓをまとめあげ、キーワード付与部１０７は参照
表現が用いられる頻度等に基づいて収集済み文書群Ｓに
キーワードを付す。ランキング部１０５は、参照関係及
びＵＲＬの文字列上の特徴に基づいて各収集済み文書Ｓ
の重要度を算出し、重要度に基づいて収集済み文書Ｓを
ランキングする。これにより、分野別優良コンテンツ２
１０を作成する。このように、第２実施形態に係わる文
書収集装置によれば、文書本文の内容を解析せずに、特
定分野に関する文書を収集し、まとめあげ、キーワード
を付与することができる。When the number of collected documents S exceeds the specified number, the grouping unit 106 compiles the collected document group S based on the reference expression, and the keyword assigning unit 107 collects the collected documents S based on the frequency of use of the reference expression. A keyword is assigned to the document group S. The ranking unit 105 determines each of the collected documents S based on the reference relationship and the characteristics of the URL character string.
Is calculated, and the collected documents S are ranked based on the importance. As a result, excellent content 2 by field
Create 10. As described above, according to the document collection device according to the second embodiment, it is possible to collect, collect, and assign keywords to a specific field without analyzing the contents of the document body.

【０１１６】分野別優良コンテンツ２１０は、検索エン
ジン１４０を介してサーバ１６０に提供される。サーバ
のクライアントはブラウザ１６０を用いて検索サービス
の提供を受けることができる。[0116] The excellent content 210 for each field is provided to the server 160 via the search engine 140. The client of the server can receive the search service using the browser 160.

【０１１７】以下、第２実施形態に係わる文書収集装置
が実現する特定分野に関する文書収集方法について説明
する。まず、用いる表記法について説明する。・ＬＴ（Ｂ）は、文書群Ｂの参照先文書集合を示す。・ＬＴ（ｐ）は、文書ｐの参照先文書集合を示す。・ＬＳ（ｄ，Ｘ）＝｛ｃ∈Ｘ|ｃ refers ｄ｝は、文書
集合Ｘのうち文書ｄを参照している文書の集合を示す。・ＬＳ（Ａ，Ｘ）＝｛ｃ∈Ｘ|∃ｄ∈Ａ，ｃrefers ｄ｝
は、文書集合Ｘのうち集合Ａ中の少なくとも１文書を参
照している文書の集合を示す。・ＣＣ（ｄ，Ａ，Ｘ）＝ＬＳ（ｄ，Ｘ）∩ＬＳ（Ａ，
Ｘ）は、文書集合Ｘのうちで、文書ｄ、及び集合Ａの文
書（少なくとも１文書）の両方を参照している文書の集
合を示す。Hereinafter, a document collection method for a specific field realized by the document collection device according to the second embodiment will be described. First, the notation used will be described. LT (B) indicates a reference destination document set of the document group B. LT (p) indicates a document set referred to by the document p. LS (d, X) = {c {X | c refers d}} indicates a set of documents that refer to document d in document set X. LS (A, X) = {c} X | {d {A, crefers d}
Indicates a set of documents that refer to at least one document in the set A of the document set X. CC (d, A, X) = LS (d, X) ∩LS (A,
X) indicates a set of documents that refer to both the document d and the documents of the set A (at least one document) in the document set X.

【０１１８】図１４に、ＬＴ（Ｓ）、ＬＴ（ｐ）、ＬＳ
（ｄ，Ｘ）及びＬＳ（Ａ，Ｘ）について、各集合が意味
する文書の参照関係を示す。図１４において黒丸は文書
を示し、矢印は参照関係を示し、矢印の元が参照元、矢
印の先が参照先を示す。図１４に示すように、ＬＴ
（Ｂ）とＬＳ（Ａ，Ｘ）及びＬＴ（ｐ）とＬＳ（ｄ，
Ｘ）は、それぞれ矢印が逆になっている、つまり参照先
文書と参照元文書が入れかわった関係にあることが分か
る。また、図１５に、ＣＣ（ｄ，Ａ，Ｘ）が意味する文
書の参照関係を示す。In FIG. 14, LT (S), LT (p), LS
For (d, X) and LS (A, X), the reference relation of the document which each set means is shown. In FIG. 14, a black circle indicates a document, an arrow indicates a reference relationship, a source of the arrow indicates a reference source, and a tip of the arrow indicates a reference destination. As shown in FIG.
(B) and LS (A, X) and LT (p) and LS (d,
In X), it can be seen that the arrows are reversed, that is, the reference destination document and the reference source document are interchanged. FIG. 15 shows a reference relationship of a document represented by CC (d, A, X).

【０１１９】以下、図１６を用いて特定分野に関する文
書を収集する処理について説明する。第２実施形態に係
わる文書収集装置によれば、「ＸＭＬ」や「Ｌｉｎｕ
ｘ」といった、特定分野（ジャンル）に関する意味的に
類似した文書を優先的に収集する場合に、文書本文の内
容を解析する処理を行わずに、参照関係に基づいて収集
することが可能である。The processing for collecting documents relating to a specific field will be described below with reference to FIG. According to the document collection device according to the second embodiment, “XML” or “Linux”
In the case of preferentially collecting semantically similar documents related to a specific field (genre), such as "x", it is possible to collect documents based on a reference relationship without performing a process of analyzing the contents of the document body. .

【０１２０】まず、当該分野に属する代表的な文書を、
既存の検索エンジンやリンク集から探し出して収集し、
正例文書群ＰＳとする。同様にして当該分野とは重なら
ない分野に属する文書を、探し出して収集し、負例文書
群ＮＳとする。この正例文書群ＰＳと負例文書群ＮＳが
初期文書群となる。そして、ＰＳ及びＮＳの文書のＵＲ
Ｌ、収集済みフラグ（全て「オン（１）」）、及び正例
フラグ（正例文書の場合「オン（１）」）をＵＲＬテー
ブル１２０に格納する。正例文書群ＰＳと負例文書群Ｎ
Ｓの和集合ＰＳ∪ＮＳを収集済み文書群Ｓとする（ステ
ップＳ５１）。ここで、例えば、当該分野を「コンピュ
ータ」であるとすると、当該分野と重ならない分野の例
として、「手芸」、「料理」、「美容」等が考えられ
る。First, a typical document belonging to this field is
Search and collect from existing search engines and link collections,
This is a positive example document group PS. Similarly, documents belonging to a field that does not overlap with the relevant field are searched for and collected, and set as a negative example document group NS. The positive example document group PS and the negative example document group NS become the initial document group. And the UR of the PS and NS documents
L, a collected flag (all “ON (1)”), and a positive example flag (“ON (1)” in the case of a positive example document) are stored in the URL table 120. Positive document group PS and negative document group N
The union set PS∪NS of S is set as a collected document group S (step S51). Here, for example, if the field is “computer”, examples of fields that do not overlap with the field include “handicraft”, “cooking”, and “beauty”.

【０１２１】参照関係抽出部１０２は、収集開始時は初
期の収集済み文書群Ｓ（初期文書群）から、それ以降は
新規収集文書から参照関係を抽出し（ステップＳ５
２）、参照先文書のＵＲＬをＵＲＬテーブル１２０に格
納し、参照関係を参照関係テーブル１２１に格納する。
この処理は、第１実施形態と同様である。The reference relation extracting unit 102 extracts a reference relation from the initially collected document group S (initial document group) at the start of collection and from a newly collected document thereafter (step S5).
2) The URL of the reference destination document is stored in the URL table 120, and the reference relation is stored in the reference relation table 121.
This processing is the same as in the first embodiment.

【０１２２】参照度／共参照度算出部２０１は、抽出さ
れた参照関係に基づいて、収集済み文書群Ｓの参照先文
書から正例文書群ＰＳに含まれる文書を除いた文書集合
Ｔ（Ｓ）＝ＬＴ（Ｓ）−ＰＳに含まれる文書ｄ∈Ｔ
（Ｓ）について、以下の（５）式を用いて参照度Ｒ
_score（ｄ，ＰＳ，Ｓ）を算出する。次候補判定部１０
５は、参照度Ｒ_score（ｄ，ＰＳ，Ｓ）が上位ｎ１件に
入っている文書群をＮ１とする。（ステップＳ５３）。
なお、収集済み文書が正例文書群ＰＳに含まれるか否か
は、ＵＲＬテーブル１２０の正例フラグを参照すること
により判定できる。The reference degree / co-reference degree calculation unit 201 generates a document set T (S) by removing the documents included in the positive example document group PS from the reference destination documents of the collected document group S based on the extracted reference relation. ) = LT (S) -document d∈T included in PS
Regarding (S), the reference degree R is calculated using the following equation (5).
Calculate _score (d, PS, S). Next candidate determination unit 10
Reference _numeral 5 denotes a document group in which the reference _score R _score (d, PS, S) is in the top n1 documents, as N1. (Step S53).
Whether or not the collected document is included in the positive example document group PS can be determined by referring to the positive example flag of the URL table 120.

【０１２３】[0123]

【数３】 (Equation 3)

【０１２４】（５）式の第１項は、文書ｄを参照してい
る正例文書群ＰＳの文書数の対数を示す。また、（５）
式の第２項は、文書ｄを参照している収集済み文書数に
対する、文書ｄを参照している正例文書群ＰＳの文書数
の割合を示す。従って、収集済み文書群Ｓのうち正例文
書群ＰＳからのみ多く参照されている文書ｄほど、Ｒ
_score（ｄ，ＰＳ，Ｓ）が大きな値を取ることが分か
る。The first term in equation (5) refers to document d.
2 shows the logarithm of the number of documents in the positive example document group PS. Also, (5)
The second term in the formula is the number of collected documents that refer to document d.
On the other hand, the number of documents in the positive document group PS referencing the document d
Indicates the ratio of Therefore, the positive example sentence in the collected document group S
The more the document d is referred to only from the book group PS, the more R
_scoreYou can see that (d, PS, S) takes a large value
You.

【０１２５】つまり、次候補判定部１０５は、参照度Ｒ
_score（ｄ，ＰＳ，Ｓ）に基づいて、新規収集文書の参
照先文書のうち、特定分野に関係ある正例文書群ＰＳか
ら多く参照され、特定分野とあまり関係ない負例文書群
ＮＳから参照されていない文書をＮ１として決定する。
図１７に、文書ｄについて参照度を算出する際に、
（５）式に含まれる各集合が意味する参照関係を示す。That is, the next candidate determination unit 105 determines that the reference degree R
_{Based on the score} (d, PS, S), among the reference documents of the newly collected documents, a large number of positive example documents PS related to a specific field refer to a large number of negative example documents NS not related to a specific field. A document that has not been subjected to the determination is determined as N1.
In FIG. 17, when calculating the reference degree for the document d,
(5) The reference relation that each set included in the expression means is shown.

【０１２６】続いて、参照度／共参照度算出部２０１
は、文書ｄ∈Ｔ（Ｓ）−Ｎ１について、以下の（６）式
を用いて共参照度Ｃ_score（ｄ，ＰＳ，Ｓ）を算出す
る。次候補判定部１０５は、ｄ∈Ｔ（Ｓ）−Ｎ１のうち
で共参照度Ｃ_score（ｄ，ＰＳ，Ｓ）が上位ｎ２件に入
っている文書群をＮ２とする（ステップＳ５４）。Subsequently, the reference degree / co-reference degree calculation section 201
Calculates the co-reference degree C _score (d, PS, S) for the document d @ T (S) -N1 using the following equation (6). The next candidate determination unit 105 sets a document group in d のうち T (S) -N1 in which the co-reference degree C _score (d, PS, S) is in the top n2 items, as N2 (step S54).

【０１２７】[0127]

【数４】 (Equation 4)

【０１２８】（６）式の第１項の対数の中身は、文書ｄ
及び正例文書群ＰＳの文書の両方を参照している収集済
み文書ｐ全てについての、文書ｐの参照先文書であって
正例文書群ＰＳに含まれる文書数の積和を示す。従っ
て、共参照度Ｃ_score（ｄ，ＰＳ，Ｓ）は、文書ｄ及び
正例文書群ＰＳの文書の両方を参照している収集済み文
書ｐの数が多い文書ｄほど、及び、このような文書ｐの
参照先文書であって正例文書群ＰＳに含まれる文書の数
が多いような文書ｄほど、大きな値を取ることが分か
る。言い換えると、正例文書群ＰＳの文書を参照してい
る収集済み文書から参照されている文書ｄについて、そ
の文書ｄを参照している収集済み文書の数が多い文書ｄ
ほど、共参照度Ｃ_score（ｄ，ＰＳ，Ｓ）は、大きな値
を取る。The content of the logarithm of the first term of the equation (6) is the document d
And the sum of the number of documents included in the positive example document group PS, which is the reference destination document of the document p, for all the collected documents p referencing both the documents of the positive example document group PS. Accordingly, the co-reference degree C _score (d, PS, S) is such that the greater the number of collected documents p that refer to both the document d and the documents of the positive example document group PS, and such a document d It can be seen that the value of the document d, which is the reference document of the document p and includes the number of documents included in the positive example document group PS, is larger. In other words, with respect to the document d referenced from the collected documents referencing the documents of the positive example document group PS, the document d in which the number of collected documents referencing the document d is large.
The larger the co-reference degree C _score (d, PS, S), the larger the value.

【０１２９】（６）式の第２項は、文書ｄの参照元とな
っている収集済み文書の数に対する、文書ｄと共に参照
されている文書ｐの数の割合を示す。共参照度Ｃ_score
（ｄ，ＰＳ，Ｓ）は、この割合が大きいほど大きな値を
取る。図１８に、文書ｄについて共参照度を算出する際
に、（６）式に含まれる各集合が意味する参照関係を示
す。The second term of the expression (6) indicates the ratio of the number of documents p referenced together with the document d to the number of collected documents that are the reference sources of the document d. Co-reference score C _score
(D, PS, S) takes a larger value as this ratio increases. FIG. 18 shows a reference relationship that each set included in Expression (6) implies when calculating the co-reference degree for the document d.

【０１３０】次候補判定部１０５は次収集候補Ｎ＝Ｎ１
∪Ｎ２とする（ステップＳ５５）。次候補判定部１０５
は、次収集候補ＮのＵＲＬをキーとしてＵＲＬテーブル
１２０を検索し、次収集候補Ｎの正例フラグを「オン
（１）」する。この処理により、負例文書群ＮＳに含ま
れていたが、次収集候補として判定された文書が、負例
文書群ＮＳから除かれ、正例文書群ＰＳに加えられるこ
ととなる（ステップＳ５６）。The next candidate determination unit 105 determines that the next collection candidate N = N1
∪N2 is set (step S55). Next candidate determination unit 105
Searches the URL table 120 using the URL of the next collection candidate N as a key, and turns on (1) the positive example flag of the next collection candidate N. As a result of this processing, the documents included in the negative example document group NS but determined as the next collection candidates are removed from the negative example document group NS and added to the positive example document group PS (step S56). .

【０１３１】文書収集部１０１は、ＵＲＬテーブル１２
０に格納されたＵＲＬに基づいて、次収集候補Ｎのうち
未収集文書をネットワークから収集し、収集した文書に
対応する収集済みフラグを「オン（１）」にする（ステ
ップＳ５７）。この処理により、新規収集文書を正例文
書群ＰＳに加える。文書収集部１０１は、ＵＲＬテーブ
ル１２０を参照し、正例文書群ＰＳの文書数が規定され
た数以上であるか否か判定する（ステップＳ５８）。正
例文書群ＰＳの文書数が規定された数以上でない場合
（ステップＳ５８：Ｎｏ）、ステップＳ５２に戻って処
理を繰り返す。The document collection unit 101 stores the URL table 12
Based on the URL stored in 0, uncollected documents among the next collection candidates N are collected from the network, and the collected flag corresponding to the collected documents is set to “ON (1)” (step S57). By this process, a newly collected document is added to the positive document group PS. The document collection unit 101 refers to the URL table 120 and determines whether or not the number of documents in the positive example document group PS is equal to or greater than a specified number (step S58). If the number of documents in the positive example document group PS is not equal to or greater than the specified number (step S58: No), the process returns to step S52 and repeats the processing.

【０１３２】正例文書群ＰＳの文書数が規定された数以
上である場合（ステップＳ５８：Ｙｅｓ）、正例文書群
ＰＳの文書を選別し（ステップＳ５９）、処理を終了す
る。文書の選別処理は、第１実施形態と同様であるため
説明を省略する。If the number of documents in the positive example document group PS is equal to or greater than the specified number (step S58: Yes), the documents in the positive example document group PS are selected (step S59), and the process is terminated. The document selection process is the same as in the first embodiment, and a description thereof will not be repeated.

【０１３３】このようにして、本実施形態によれば、文
書本文の内容を解析することなく、特定分野に関する文
書を精度よく、かつ迅速に収集することが可能となる。
以下、第２実施形態の変形例について説明する。負例文
書群ＮＳは、集めることも難しいため、収集処理の後に
廃棄することをさけて、有効利用することが望ましい。
そこで、第２実施形態の変形例に係わる文書収集装置に
よれば、上記処理で収集した負例文書群ＮＳを有効に利
用することとする。これにより、なるべく独立な、例え
ば、「Ｊａｖａ（登録商標）言語」と「編物」及び「フ
ランス料理」等、複数分野の文書を並行して収集するこ
とを可能とする。そのために、ある分野の文書を収集す
る際、その分野の文書群を正例文書群ＰＳとし、その分
野以外の他の分野の文書群を負例文書群ＮＳとして扱
う。As described above, according to the present embodiment, it is possible to accurately and quickly collect documents related to a specific field without analyzing the contents of the document body.
Hereinafter, a modified example of the second embodiment will be described. Since it is difficult to collect the negative example document group NS, it is desirable to effectively use the document group NS so as not to discard it after the collection process.
Therefore, according to the document collection device according to the modification of the second embodiment, the negative example document group NS collected in the above processing is effectively used. This makes it possible to collect documents in a plurality of fields as independent as possible, for example, “Java (registered trademark) language”, “knitted”, and “French cuisine”. Therefore, when documents in a certain field are collected, a document group in the field is treated as a positive example document group PS, and a document group in a field other than the field is treated as a negative example document group NS.

【０１３４】文書収集装置の構成は、図１３を用いて説
明した通りであるため、説明を省略する。以下、図１９
を用いて第２実施形態の変形例に係わる文書収集装置で
行う処理について説明する。The configuration of the document collection device is as described with reference to FIG. 13, and a description thereof will not be repeated. Hereinafter, FIG.
A process performed by the document collection device according to the modification of the second embodiment will be described with reference to FIG.

【０１３５】まず、ｎ個の独立な分野の文書群Ｄ_i（ｉ
＝１，２，・・・ｎ）を、検索エンジンやリンク集等か
ら探し出して収集し、文書群Ｄ_iの文書のＵＲＬ、収集
済みフラグ、及び分野を識別する情報である分野識別情
報をＵＲＬテーブル１２０に格納する。第２実施形態の
変形例に係わる文書収集装置では、正例フラグは不要で
ある。文書群Ｄ_iは、分野ｉの初期文書群となる。収集
済み文書群をＤ＝（Ｄ₁、Ｄ₂、・・・、Ｄ_n）とする
（ステップＳ６１）。First, n independent document groups D _i (i
= 1, the · · · n), collected searched from a search engine or links like, URL of the document of the document set D _i, collecting flag, and the field identification information is information for identifying the areas URL It is stored in the table 120. In the document collection device according to the modification of the second embodiment, the positive example flag is unnecessary. The document group _Di is an initial document group of the field i. The collected document group _{_{D = (D 1, D 2}} , ···, D n) to (step S61).

【０１３６】まず、参照関係抽出部１０２は、ｉを与え
る（ステップＳ６２）。なお、収集開始時に、参照関係
抽出部１０２は、ｉを１とする。続いて、参照関係抽出
部１０２は、ｉがｎを超えているか否か判定する（ステ
ップＳ６３）。ｉがｎを超えている場合（ステップＳ６
３：Ｙｅｓ）、ステップＳ７１に進む。そうでない場合
（ステップＳ６３：Ｎｏ）、参照関係抽出部１０２は、
分野ｉに対応する文書群Ｄ_iの新規収集文書から（収集
開始時は初期文書群から）、参照関係を抽出し、参照先
文書のＵＲＬをＵＲＬテーブル１２０に、参照関係を参
照関係テーブル１２１にそれぞれ格納する（ステップＳ
６４）。この処理は、第１実施形態と同様である。First, the reference relation extracting unit 102 gives i (step S62). At the start of collection, the reference relationship extraction unit 102 sets i to 1. Subsequently, the reference relationship extraction unit 102 determines whether i exceeds n (step S63). If i exceeds n (step S6)
3: Yes), proceed to step S71. Otherwise (step S63: No), the reference relation extracting unit 102
From the new collection documents of the document group D _i corresponding to the field i (collection start time from the initial document group), to extract the reference relationship, the URL of the referenced document in the URL table 120, the reference relationship table 121 reference relationships Each is stored (Step S
64). This processing is the same as in the first embodiment.

【０１３７】参照度／共参照度算出部２０１は、文書群
Ｄ_iの参照先文書であって、収集済み文書群Ｄに含まれ
ない文書群Ｔ（Ｄ_i）＝ＬＴ（Ｄ_i）−Ｄを次収集範囲と
し、この次収集範囲Ｔ（Ｄ_i）に含まれる文書ｄ∈Ｔ
（Ｄ_i）について、上述の（５）式を用いて参照度Ｒ
_score（ｄ，Ｄ_i，Ｄ）を算出する。次候補判定部１０５
は、参照度Ｒ_score（ｄ，Ｄ_i，Ｄ）が上位ｎ１件に入っ
ている文書群をＮ１_iとする。（ステップＳ６５）。な
お、収集済み文書が含まれる分野は、ＵＲＬテーブル１
２０の分野識別情報を参照することにより判定できる。[0137] Referring degree / co-reference calculation unit 201, a referenced document of the document set D _i, is not included in the collected document group D document group _{T (D i) = LT (} D i) -D Is the next collection range, and the document d∈T included in the next collection range T (D _i )
For (D _i ), the reference degree R is calculated by using the above equation (5).
Calculate _score (d, D _i , D). Next candidate determination unit 105
Let N1 _i be a document group in which the reference _score R _score (d, D _i , D) is in the top n1 cases. (Step S65). The field containing the collected documents is the URL table 1
The determination can be made by referring to the twenty field identification information.

【０１３８】参照度／共参照度算出部２０１は、次収集
範囲Ｔ（Ｄ_i）からＮ１_iを除いた集合に含まれる文書ｄ
∈Ｔ（Ｄ_i）−Ｎ１_iついて、上述の（６）式を用いて共
参照度Ｃ_score（ｄ，Ｄ_i，Ｄ）を算出する。次候補判定
部１０５は、共参照度Ｃ_scor _e（ｄ，Ｄ_i，Ｄ）が上位ｎ
２件に入っている文書群をＮ２_iとする。（ステップＳ
６６）。The reference degree / co-reference degree calculation unit 201 performs the next collection
Range T (D_i) To N1_iDocument d included in the set excluding
∈T (D_i) -N1_iThen, using the above equation (6),
Reference C_score(D, D_i, D) are calculated. Next candidate judgment
The unit 105 calculates the co-reference degree C_scor _e(D, D_i, D) is the top n
N2 sets of documents included in two documents_iAnd (Step S
66).

【０１３９】次候補判定部１０５は、Ｎ１_i∪Ｎ２_iを分
野ｉについての次収集候補Ｎ_iとする（ステップＳ６
７）。次候補判定部１０５は、ＵＲＬテーブル１２０に
アクセスし、次収集候補Ｎ_iに現在のｉの値に対応した
分類識別情報を付す。文書収集部１０１は、ネットワー
クから次収集候補Ｎ_iを収集する（ステップＳ６８）。
文書収集部１０１は、ＵＲＬテーブル１２０にアクセス
し、収集された次収集候補Ｎ_i（新規収集文書群）の収
集済みフラグを「オン（１）」とする。これにより、文
書収集部１０１は、文書群Ｄ_iに新規収集文書群を加え
て新たな文書群Ｄ_iとする（ステップＳ６９）。[0139] The following candidate determination unit 105, and collected next candidate N _i of areas i of N1 _{_i} ∪N2 _i (step S6
7). Next candidate determining unit 105 accesses the URL table 120, given the classification identification information corresponding to the value of the current i to be collected next candidate N _i. Document collection unit 101 collects the collected next candidate N _i from the network (step S68).
The document collection unit 101 accesses the URL table 120 and sets the collected flag of the collected next collection candidate N _i (newly collected document group) to “ON (1)”. Thus, the document collection unit 101 adds the new collection documents and new documents D _i in document group D _i (step S69).

【０１４０】続いて、参照関係抽出部１０２は、ｉを１
インクリメントし（ステップＳ７０）、ステップＳ６３
に戻る。文書収集装置２００は、上述の処理をｉがｎを
超えるまで、処理を繰り返す。Subsequently, the reference relation extracting unit 102 sets i to 1
Increment (step S70), step S63
Return to The document collection device 200 repeats the above processing until i exceeds n.

【０１４１】ｉがｎを超えると（ステップＳ６３：Ｙｅ
ｓ）、参照関係抽出部１０２は、ＵＲＬテーブル１２０
を参照し、収集済みフラグ及び分野識別情報に基づい
て、各文書群Ｄ_iの文書数を計数し、各文書群Ｄ_iの文書
数が規定された数以上であるか否か判定する（ステップ
Ｓ７１）。文書数が規定数以上でない文書群Ｄ_k（ｋは
１からｎまでの任意の数）がある場合、ステップＳ６２
に戻り、参照関係抽出部１０２は、ｉ＝ｋとしてステッ
プＳ６３以下の処理を繰り返す。If i exceeds n (step S63: Ye
s), the reference relation extracting unit 102
Refers to the, based on the collected flag and sector identification information, counts the number of documents in the document group D _i, determines whether the number of documents for each document group D _i is defined number or more (step S71). If there is a document group D _k (k is an arbitrary number from 1 to n) in which the number of documents is not more than the specified number, step S62
Then, the reference relationship extracting unit 102 repeats the processing of step S63 and subsequent steps with i = k.

【０１４２】なお、文書数が規定数以上でない文書群Ｄ
_kが複数ある場合、例えば、Ｄ_k1、Ｄ_k2及びＤ_k3がある
場合、ｉ＝ｋ１、ｋ２及びｋ３である場合について、ス
テップＳ６３以下の処理を繰り返す。Ｄ₁からＤ_nまで全
ての収集済み文書群Ｄ_iについて文書数が規定数以上で
ある場合（ステップＳ７１：Ｙｅｓ）、処理を終了す
る。The document group D in which the number of documents is not more than the prescribed number
_When there are a plurality of _k , for example, when there are D _k1, D _k2 and D _k3, and when i = k1, k2 and k3, the processing after step S63 is repeated. If D document number for all the collected document group D _i to ₁ from D _n is greater than or equal to the prescribed number (step S71: Yes), the process ends.

【０１４３】これにより、ある分野の文書を収集する際
に、その分野の文書群を正例文書群ＰＳとし、他の残り
の分野の文書群の和集合を負例文書群ＮＳとして用いる
ことができるため、負例文書群ＮＳに関する処理が無駄
にならないこととなる。Thus, when documents in a certain field are collected, it is possible to use a document group in that field as a positive document group PS and use the union of documents in other remaining fields as a negative document group NS. Therefore, the processing related to the negative example document group NS is not wasted.

【０１４４】また、第２実施形態の変形例によれば、あ
る分野の文書群Ｄ₁を正例文書群ＰＳとして、その分野
に関する文書を収集する場合に注目すると、負例文書群
ＮＳとして用いられる他の分野の文書群が、正例文書群
ＰＳと比べ大きくなる。さらにまた、負例文書群ＮＳ自
体も他の分野に関する文書群であるため、意味的に一定
している。変形例ではない第２実施形態においてある程
度以上収集が進むと、正例文書群ＰＳが大きくなる一方
で負例文書群ＮＳから正例文書群ＰＳに文書が移される
ことによって、例えば（５）式に示されるＲ
_score（ｄ，ＰＳ，Ｓ）の第２項が大きくなっていくこ
と態が生じうる。これによって、収集の精度が低下する
る可能性があったが、変形例ではその可能性が低くな
る。[0144] Also, according to a modification of the second embodiment, the document group D ₁ of the certain areas as positive sample document group PS, when focusing for gathering documents about the field, used as a negative sample document group NS The documents in other fields are larger than the positive documents PS. Furthermore, since the negative example document group NS itself is a document group related to another field, it is semantically constant. In the second embodiment which is not a modified example, if the collection proceeds to a certain extent or more, the positive example document group PS becomes larger while the documents are moved from the negative example document group NS to the positive example document group PS. R shown in
_A situation can occur in which the second term of _score (d, PS, S) increases. As a result, the accuracy of the collection may have been reduced, but in a modified example, the possibility is reduced.

【０１４５】以下、図２０及び図２１を用いて、第２実
施形態に係わる文書収集装置において特定分野に関する
文書を収集する精度について説明する。図２０に、ネッ
トワークから収集した約６７０万ＵＲＬの文書を全体集
合Ｄとし、ＵＲＬに「Ｌｉｎｕｘ」を含む15,000ＵＲＬ
を正解例Ｌとし、任意に選択した約5,000ＵＲＬを正例
文書群ＰＳそれ以外のＵＲＬ（Ｄ−ＰＳ）を負例文書群
ＮＳを初期文書として、文書収集装置の収集精度を実験
した結果を示す。Hereinafter, the accuracy of collecting a document related to a specific field in the document collection device according to the second embodiment will be described with reference to FIGS. FIG. 20 shows a document set of about 6.70 million URLs collected from the network as a whole set D, and 15,000 URLs including "Linux" in the URL
Is the correct answer L, and about 5,000 URLs arbitrarily selected are the positive document set PS and the other URLs (D-PS) are the negative document set NS as the initial document. Show.

【０１４６】図２０において、横軸に収集のくり返し回
数ｉ、縦軸に適合率又は再現率を示す。再現率を折れ
線、適合率を四角プロットで示す。ここで、ｉ回目の繰
り返しで得られた正例集合Ｓ_iについての適合率及び再
現率は、以下（７）式及び（８）式で示される。In FIG. 20, the horizontal axis indicates the number of times of repeated collection i, and the vertical axis indicates the precision or recall. The recall is shown by a polygonal line, and the precision is shown by a square plot. Here, the relevance and the recall for the positive example set S _i obtained in the i-th iteration are shown by the following equations (7) and (8).

【０１４７】適合率＝｜Ｓ_i∩Ｌ｜／｜Ｓ_i｜・・・・（７）再現率＝｜Ｓ_i∩Ｌ｜／｜Ｌ｜・・・・（８）つまり、適合率は、正例集合Ｓ_i中の正例文書群Ｓに含
まれる正解例Ｌの割合であり、対象としている分野に含
まれない文書（いわゆるゴミ）の少なさを示す。再現率
は、正解例Ｌ中の正例文書群Ｓ_iに含まれる正解例Ｌの
割合であり、対象としている分野に含まれる文書が収集
されないこと（いわゆる漏れ）の少なさを示す。図２０
に示すように、繰り返し回数が７３回程度になると、再
現率が急激に低下するが、数十回の繰り返しでは、適合
率、再現率とも良好であることが分かる。なお、繰り返
し回数が７３回程度になると再現率が低下する原因は、
所謂ゴミがゴミをよぶためであると考えられる。Precision = | S _i ∩L | / | S _i | (7) Reproducibility = | S _i ∩L | / | L | (8) That is, the precision is This is the ratio of correct examples L included in the correct example document group S in the correct example set S _i , and indicates the number of documents (so-called garbage) not included in the target field. The recall rate is a ratio of the correct answer examples L included in the correct example document group S _i in the correct answer examples L, and indicates a small possibility that documents included in the target field are not collected (so-called omission). FIG.
As shown in the figure, when the number of repetitions reaches about 73, the recall rate sharply decreases, but it can be seen that the relevance rate and the recall rate are good after several dozen repetitions. The reason why the recall decreases when the number of repetitions becomes about 73 is as follows.
It is considered that so-called garbage is for calling garbage.

【０１４８】図２１に、ＵＲＬに「What's New」を含む
14,000ＵＲＬを正解例Ｌとした場合に、同様の実験を行
った結果を示す。図２１に示すように、繰り返し回数が
数回程度になると急激に適合率が低下している。これ
は、What's Newのようなコンテンツは、互いにあまり意
味的な関連（つながり）が無いためと考えられる。In FIG. 21, the URL includes "What's New".
The result of a similar experiment performed when 14,000 URL was used as Correct Example L is shown. As shown in FIG. 21, when the number of repetitions becomes about several times, the matching rate sharply decreases. This is probably because contents such as What's New do not have much semantic association (connection) with each other.

【０１４９】図２０に示す実験結果から、本実施形態に
係わる文書収集装置によれば意味的に関連する文書群を
効率よく収集することができることが分かる。上述にお
いて説明した各サーバ及び各端末は、図２２に示すよう
な情報処理装置（コンピュータ）を用いて構成すること
ができる。図２２の情報処理装置３００は、ＣＰＵ３０
１、メモリ３０２、入力装置３０３、出力装置３０４、
外部記憶装置３０５、媒体駆動装置３０６、及びネット
ワーク接続装置３０７を備え、それらはバス３０８によ
り互いに接続されている。From the experimental results shown in FIG. 20, it can be seen that the document collection apparatus according to the present embodiment can efficiently collect a semantically related document group. Each server and each terminal described above can be configured using an information processing device (computer) as shown in FIG. The information processing device 300 of FIG.
1, memory 302, input device 303, output device 304,
An external storage device 305, a medium drive device 306, and a network connection device 307 are provided, and these are connected to each other by a bus 308.

【０１５０】メモリ３０２は、例えば、ＲＯＭ（Read O
nly Memory）、ＲＡＭ（Random Access Memory）等を含
み、処理に用いられるプログラムとデータを格納する。
ＣＰＵ３０１は、メモリ３０２を利用してプログラムを
実行することにより、必要な処理を行う。The memory 302 is, for example, a ROM (Read O
nly Memory), RAM (Random Access Memory), etc., and stores programs and data used for processing.
The CPU 301 performs necessary processing by executing a program using the memory 302.

【０１５１】上述の各サーバ及び各端末を構成する各機
器及び各部は、それぞれメモリ３０２の特定のプログラ
ムコードセグメントにプログラムとして格納される。入
力装置３０３は、例えば、キーボード、ポインティング
デバイス、タッチパネル等であり、ユーザからの指示や
情報の入力に用いられる。出力装置３０４は、例えば、
ディスプレイやプリンタ等であり、情報処理装置３００
の利用者への問い合わせ、処理結果等の出力に用いられ
る。Each device and each unit constituting each server and each terminal described above is stored as a program in a specific program code segment of the memory 302. The input device 303 is, for example, a keyboard, a pointing device, a touch panel, or the like, and is used for inputting an instruction or information from a user. The output device 304 is, for example,
A display, a printer, or the like;
Is used for inquiring the user of the program and outputting the processing results and the like.

【０１５２】外部記憶装置３０５は、例えば、磁気ディ
スク装置、光ディスク装置、光磁気ディスク装置等であ
る。この外部記憶装置３０５に上述のプログラムとデー
タを保存しておき、必要に応じて、それらをメモリ３０
２にロードして使用することもできる。The external storage device 305 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, or the like. The above-described programs and data are stored in the external storage device 305, and are stored in the memory 30 as necessary.
2 can also be used.

【０１５３】媒体駆動装置３０６は、可搬記録媒体３０
９を駆動し、その記録内容にアクセスする。可搬記録媒
体３０９としては、メモリカード、メモリスティック、
フロッピー（登録商標）ディスク、ＣＤ−ＲＯＭ（Comp
act Disc Read Only Memory）、光ディスク、光磁気デ
ィスク、ＤＶＤ（Digital Versatile Disk）等、任意の
情報処理装置で読み取り可能な記録媒体が用いられる。
この可搬記録媒体３０９に上述のプログラムとデータを
格納しておき、必要に応じて、それらをメモリ３０２に
ロードして使用することもできる。The medium driving device 306 is a portable recording medium 30.
9 to access the recorded contents. As the portable recording medium 309, a memory card, a memory stick,
Floppy (registered trademark) disk, CD-ROM (Comp
A recording medium that can be read by any information processing device, such as an act disc read only memory), an optical disk, a magneto-optical disk, and a DVD (Digital Versatile Disk) is used.
The above-described program and data can be stored in the portable recording medium 309, and can be used by loading them into the memory 302 as needed.

【０１５４】ネットワーク接続装置３０７は、ＬＡＮ、
ＷＡＮ等の任意のネットワーク（回線）を介して外部の
装置を通信し、通信に伴なうデータ変換を行う。また、
必要に応じて、上述のプログラムとデータを外部の装置
から受け取り、それらをメモリ３０２にロードして使用
することもできる。The network connection device 307 is a LAN,
An external device is communicated via an arbitrary network (line) such as a WAN, and data conversion accompanying the communication is performed. Also,
If necessary, the program and data described above can be received from an external device, and loaded into the memory 302 for use.

【０１５５】図２３は、図２２の情報処理装置３００に
プログラムとデータを供給することのできる情報処理装
置で読み取り可能な記録媒体及び伝送信号を示してい
る。なお、本発明は、情報処理装置により使用されたと
きに、上述の本発明の実施形態の各構成によって実現さ
れる機能と同様の機能を情報処理装置に行わせるための
情報処理装置で読み出し可能な記録媒体３０９として構
成することもできる。FIG. 23 shows a recording medium and a transmission signal which can be read by an information processing apparatus capable of supplying a program and data to the information processing apparatus 300 of FIG. Note that, when the present invention is used by an information processing apparatus, it can be read by the information processing apparatus for causing the information processing apparatus to perform the same function as the function realized by each configuration of the above-described embodiment of the present invention. It can also be configured as a simple recording medium 309.

【０１５６】実施形態において各装置により行なわれる
処理と同様のものを情報処理装置に行なわせるプログラ
ムを、情報処理装置で読み取り可能な記録媒体３０９に
予め記憶させておき、図２３に示すようにしてその記録
媒体３０９からそのプログラムを情報処理装置３００に
読み出させてその情報処理装置３００のメモリ３０２や
外部記憶装置３０５に一旦格納させ、その情報処理装置
３００の有するＣＰＵ３０１にこの格納されたプログラ
ムを読み出させて実行させる。In the embodiment, a program for causing an information processing apparatus to execute the same processing as that performed by each apparatus is stored in a recording medium 309 readable by the information processing apparatus in advance, and as shown in FIG. The program is read from the recording medium 309 by the information processing device 300 and temporarily stored in the memory 302 or the external storage device 305 of the information processing device 300. The stored program is stored in the CPU 301 of the information processing device 300. Read and execute.

【０１５７】また、プログラム（データ）提供者３１０
から情報処理装置３００にプログラムをダウンロードす
る際に回線３１１（伝送媒体）を介して伝送される伝送
信号自体も、上述した本発明の実施形態において説明し
た各装置に相当する機能を汎用的な情報処理装置で行な
わせることのできるものである。Also, the program (data) provider 310
The transmission signal itself transmitted via the line 311 (transmission medium) when the program is downloaded from the computer to the information processing apparatus 300 also has the function corresponding to each apparatus described in the above-described embodiment of the present invention as general-purpose information. It can be performed by a processing device.

【０１５８】以上、本発明の実施形態について説明した
が、本発明は上述した実施形態に限定されるものではな
く、他の様々な変更が可能である。例えば、第１実施形
態に係わる文書収集装置１００と第２実施形態に係わる
文書収集装置２００とを組みせるように構成ことによ
り、コミュニティ向けに分野別に文書を収集させること
としてもよい。Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various other modifications are possible. For example, by configuring the document collection device 100 according to the first embodiment and the document collection device 200 according to the second embodiment to be assembled, documents may be collected according to fields for a community.

【０１５９】また、文書収集装置１００又は２００を構
成する各部及び各ＤＢは、お互いに連携して動作するこ
とにより一連のビジネスプロセスを実現する。これら各
部及び各ＤＢは同じサーバに設けられてもよいし、異な
るサーバに設けられネットワークを介して連携して動作
することとしてもよい。Further, each unit and each DB constituting the document collection device 100 or 200 operate in cooperation with each other to realize a series of business processes. These units and DBs may be provided in the same server, or may be provided in different servers and operate in cooperation via a network.

【０１６０】（付記１）ネットワークから文書を収集
する文書収集方法であって、前記文書の参照関係に基づ
いて、前記ネットワーク上のコミュニティ内から文書を
所定数以上収集し、前記コミュニティから前記第１の所
定数以上の文書を収集した後、収集済み文書の参照関係
に基づいて、前記コミュニティ内外から文書を収集す
る、ことを特徴とする文書収集方法。(Supplementary Note 1) A document collection method for collecting documents from a network, the method comprising: collecting at least a predetermined number of documents from a community on the network based on a reference relation of the documents; Collecting a predetermined number of documents or more, and then collecting documents from inside and outside the community based on the reference relation of the collected documents.

【０１６１】（付記２）前記収集済み文書群の参照関
係及びネットワーク上の場所を示す情報に基づいて前記
収集済み文書の重要さの度合いを示す重要度を算出し、
前記参照関係及び前記重要度に基づいて、収集すべき文
書を決定する、ことを特徴とする付記１記載の文書収集
方法。(Supplementary Note 2) The importance indicating the degree of importance of the collected documents is calculated based on the reference relation of the collected documents and the information indicating the location on the network.
The document collection method according to claim 1, wherein a document to be collected is determined based on the reference relation and the importance.

【０１６２】（付記３）前記収集すべき文書は、前記
コミュニティ内外別に決定される、ことを特徴とする付
記２記載の文書収集方法。(Supplementary Note 3) The document collection method according to Supplementary Note 2, wherein the documents to be collected are determined for each of the inside and outside of the community.

【０１６３】（付記４）前記収集済み文書群を検索し
た結果を、前記コミュニティ内外に分けて提示する、こ
とを特徴とする付記３記載の文書収集方法。(Supplementary Note 4) The document collection method according to Supplementary Note 3, wherein a result of searching the collected document group is presented separately inside and outside the community.

【０１６４】（付記５）前記コミュニティ内の文書で
あるか否かを前記ネットワーク上の場所を示す情報に基
づいて判定する、ことを特徴とする付記２記載の文書収
集方法。(Supplementary note 5) The document collection method according to Supplementary note 2, wherein whether or not the document is in the community is determined based on information indicating a location on the network.

【０１６５】（付記６）ネットワークから文書を収集
する文書収集方法であって、ある分野に関する文書群で
ある正例文書群と、前記分野と関連が少ない分野に関す
る文書群である負例文書群とを与え、前記正例文書群及
び前記負例文書群の参照関係に基づいて、前記分野に関
する収集すべき文書を決定し、前記ネットワークから前
記収集すべき文書を収集する、ことを特徴とする文書収
集方法。(Supplementary Note 6) A document collection method for collecting documents from a network, wherein a positive example document group that is a document group related to a certain field and a negative example document group that is a document group related to a field that is less relevant to the field. A document to be collected relating to the field is determined based on a reference relationship between the positive example document group and the negative example document group, and the document to be collected is collected from the network. Collection method.

【０１６６】（付記７）前記参照関係に基づいて、前
記正例文書群の文書からのみ参照される度合いを示す参
照度を算出し、前記参照度が高い文書を前記収集すべき
文書として決定する、ことを特徴とする付記６記載の文
書収集方法。(Supplementary Note 7) On the basis of the reference relation, a reference degree indicating a degree of reference from only the documents in the positive example document group is calculated, and a document having a high reference degree is determined as the document to be collected. 6. The document collection method according to claim 6, further comprising:

【０１６７】（付記８）前記参照関係に基づいて、前
記正例文書群の文書を参照している収集済み文書から参
照されている文書について、収集済み文書から参照され
る度合いを示す共参照度を算出し、共参照度が高い文書
を収集すべき文書として決定する、ことを特徴とする付
記６記載の文書収集方法。(Supplementary Note 8) A co-reference degree indicating a degree of reference from the collected documents with respect to a document referred to from the collected documents referring to the documents in the positive document group based on the reference relation. The document collection method according to claim 6, wherein a document having a high degree of co-reference is determined as a document to be collected.

【０１６８】（付記９）前記負例文書群は、複数の分
野に関する文書群の和集合である、ことを特徴とする、
付記６記載の文書収集方法。(Supplementary Note 9) The negative example document group is a union of documents relating to a plurality of fields.
Document collection method according to attachment 6.

【０１６９】（付記１０）前記収集済み文書で用いら
れている参照表現に基づいて、前記収集済み文書群をま
とめあげる、ことを特徴とする付記１記載の文書収集方
法。(Supplementary Note 10) The document collection method according to Supplementary Note 1, wherein the collected document group is put together based on a reference expression used in the collected document.

【０１７０】（付記１１）前記収集済み文書で用いら
れている参照表現に基づいて、前記収集済み文書にキー
ワードを付与する、ことを特徴とする付記１記載の文書
収集方法。(Supplementary Note 11) The document collection method according to Supplementary Note 1, wherein a keyword is assigned to the collected document based on a reference expression used in the collected document.

【０１７１】（付記１２）前記参照表現が参照先文書
に関係なく使用される参照表現の場合、キーワードとし
ない、ことを特徴とする付記１１記載の文書収集方法。(Supplementary Note 12) The document collection method according to Supplementary Note 11, wherein if the reference expression is a reference expression used irrespective of a reference destination document, the reference expression is not used as a keyword.

【０１７２】（付記１３）前記参照表現が参照する相
異なる文書の数を計数し、前記相異なる文書の数がある
数以上である場合、前記参照表現をキーワードとしな
い、ことを特徴とする付記１１記載の文書収集方法。(Supplementary Note 13) The number of different documents referenced by the reference expression is counted, and if the number of different documents is equal to or greater than a certain number, the reference expression is not used as a keyword. 11. The document collection method according to item 11.

【０１７３】（付記１４）前記相異なる文書の数があ
る数未満である場合、各収集済み文書が前記参照表現に
よって参照されている回数である参照回数を計数し、前
記相異なる文書の数及び前記参照回数に基づいて、前記
参照表現をキーワードとするか否か判定する、ことを特
徴とする付記１１記載の文書集収集方法。(Supplementary Note 14) If the number of the different documents is less than a certain number, the number of references, which is the number of times each collected document is referred to by the reference expression, is counted, and the number of the different documents and 12. The document collection method according to appendix 11, wherein it is determined whether or not the reference expression is a keyword based on the reference count.

【０１７４】（付記１５）前記参照表現に基づくキー
ワードに、前記収集済み文書の本文から抽出したキーワ
ード及び前記収集済み文書のネットワーク上の場所を示
す情報から抽出したキーワードを組み合せる、ことを特
徴とする付記１１記載の文書集収集方法。(Supplementary Note 15) The keyword based on the reference expression is combined with a keyword extracted from the body of the collected document and a keyword extracted from information indicating a location on the network of the collected document. The document collection method described in Supplementary Note 11.

【０１７５】（付記１６）ネットワーク上のコミュニ
ティに属する文書を検索する検索方法であって、文書を
検索するための情報をサーバに送信し、前記検索するた
めの情報に基づいて前記コミュニティ内外に分けて検索
した文書を、前記コミュニティにとっての重要さの度合
いを示す情報とともに受信する、ことを特徴とする検索
方法。(Supplementary Note 16) A search method for searching for a document belonging to a community on a network, in which information for searching for a document is transmitted to a server, and the information is divided into and out of the community based on the information for searching. Receiving the retrieved documents together with information indicating the degree of importance to the community.

【０１７６】（付記１７）ネットワークから文書を収
集する文書収集装置であって、前記文書の参照関係に基
づいて、次に収集すべき文書の候補である次収集候補を
決定する次候補判定手段と、前記文書のネットワーク上
の場所を示す情報に基づいて前記文書が前記ネットワー
ク上のコミュニティ内の文書であるか否か判別するコミ
ュニティ判別手段と、前記ネットワークから前記次収集
候補を収集する文書収集手段と、を備え、前記文書収集
手段は、前記コミュニティ内から所定数以上文書を収集
した後、前記コミュニティ内外から文書を収集する、こ
とを特徴とする文書収集装置。(Supplementary Note 17) A document collection device that collects documents from a network, comprising: a next candidate determination unit that determines a next collection candidate that is a candidate for a document to be collected next based on the reference relation of the documents. Community determining means for determining whether or not the document is a document in a community on the network based on information indicating a location of the document on a network; and document collecting means for collecting the next collection candidate from the network Wherein the document collection means collects a predetermined number or more of documents from inside the community and then collects documents from inside and outside the community.

【０１７７】（付記１８）ネットワークから文書を収
集する文書収集装置であって、ある分野に関する文書群
である正例文書群及び前記分野と関連が少ない分野に関
する文書群である負例文書群の参照関係に基づいて、次
に収集すべき文書の候補である次収集候補を決定する次
候補判定手段と、前記ネットワークから前記次収集候補
を収集する文書収集手段とを備える、ことを特徴とする
文書収集装置。(Supplementary Note 18) A document collection device for collecting documents from a network, wherein a reference is made to a group of positive documents which is a group of documents related to a certain field and a group of negative examples which is a group of documents related to a field which is less relevant to the field. A document comprising: a next candidate determining unit that determines a next collection candidate that is a candidate for a document to be collected next based on the relationship; and a document collection unit that collects the next collection candidate from the network. Collection device.

【０１７８】（付記１９）コンピュータに実行させる
ことによって、ネットワークから文書を収集する制御を
該コンピュータに行なわせるプログラムを記録した、コ
ンピュータで読み取り可能な記録媒体であって、前記文
書の参照関係に基づいて、前記ネットワーク上のコミュ
ニティ内から文書を所定数以上収集し、前記コミュニテ
ィから前記第１の所定数以上の文書を収集した後、収集
済み文書の参照関係に基づいて、前記コミュニティ内外
から文書を収集する、ことを含む制御をコンピュータに
行なわせるプログラムを記録した記録媒体。(Supplementary Note 19) A computer-readable recording medium in which a program for causing a computer to execute a control for collecting a document from a network by being executed by the computer is recorded. Collecting at least a predetermined number of documents from the community on the network, collecting at least the first predetermined number of documents from the community, and then collecting documents from inside and outside the community based on the reference relation of the collected documents. A recording medium on which a program for causing a computer to perform control including collecting is recorded.

【０１７９】（付記２０）コンピュータに実行させる
ことによって、ネットワークから文書を収集する制御を
該コンピュータに行なわせるプログラムを記録した、コ
ンピュータで読み取り可能な記録媒体であって、ある分
野に関する文書群である正例文書群及び前記分野と関連
が少ない分野に関する文書群である負例文書群の参照関
係に基づいて、前記分野に関する収集すべき文書を決定
し、前記ネットワークから前記収集すべき文書を収集す
る、ことを含む制御をコンピュータに行なわせるプログ
ラムを記録した記録媒体。(Supplementary Note 20) A computer-readable recording medium in which a program for causing a computer to collect documents from a network by executing the computer is recorded, and is a group of documents related to a certain field. Based on the reference relationship between the positive document group and the negative document group that is a document group related to the field with a small relation to the field, a document to be collected is determined for the field, and the document to be collected is collected from the network. A recording medium on which a program for causing a computer to perform control including the above is recorded.

【０１８０】（付記２１）搬送波に具現化された、ネ
ットワークから文書を収集する制御をコンピュータに行
わせるプログラムを表現するコンピュータ・データ・シ
グナルであって、前記プログラムは以下をコンピュータ
に実行させる、前記文書の参照関係に基づいて、前記ネ
ットワーク上のコミュニティ内から文書を所定数以上収
集し、前記コミュニティから前記第１の所定数以上の文
書を収集した後、収集済み文書の参照関係に基づいて、
前記コミュニティ内外から文書を収集する、（付記２２）コンピュータによって実行されることに
よって、ネットワークから文書を収集する制御を前記コ
ンピュータに行わせるコンピュータ・プログラムであっ
て、前記文書の参照関係に基づいて、前記ネットワーク
上のコミュニティ内から文書を所定数以上収集し、前記
コミュニティから前記第１の所定数以上の文書を収集し
た後、収集済み文書の参照関係に基づいて、前記コミュ
ニティ内外から文書を収集する、ことを含む制御を前記
コンピュータに行わせることを特徴とするコンピュータ
・プログラム。(Supplementary Note 21) A computer data signal embodied in a carrier wave, which represents a program for causing a computer to perform control for collecting documents from a network, wherein the program causes the computer to execute the following: Based on the document reference relationship, a predetermined number or more of documents are collected from within the community on the network, and after collecting the first predetermined number or more of documents from the community, based on the reference relationship of the collected documents,
(Supplementary Note 22) A computer program which, when executed by a computer, causes the computer to perform control of collecting a document from a network, based on a reference relationship of the document. After collecting a predetermined number or more of documents from the community on the network, collecting the first predetermined number of documents or more from the community, and collecting documents from inside and outside the community based on a reference relation of the collected documents. A computer program for causing the computer to perform control including:

【０１８１】（付記２３）コンピュータによって実行
されることによって、ネットワークから文書を収集する
制御を前記コンピュータに行わせるコンピュータ・プロ
グラムであって、ある分野に関する文書群である正例文
書群と、前記分野と関連が少ない分野に関する文書群で
ある負例文書群とを与え、前記正例文書群及び前記負例
文書群の参照関係に基づいて、前記分野に関する収集す
べき文書を決定し、前記ネットワークから前記収集すべ
き文書を収集する、こと含む制御を前記コンピュータに
行わせることを特徴とするコンピュータ・プログラム。(Supplementary Note 23) A computer program which, when executed by a computer, causes the computer to control the collection of documents from a network, wherein a set of positive documents, which is a group of documents related to a certain field, And a negative example document group, which is a document group relating to a field having a small relation, determine a document to be collected regarding the field based on the reference relationship between the positive example document group and the negative example document group, from the network. A computer program for causing the computer to perform control including collecting the document to be collected.

【０１８２】[0182]

【発明の効果】以上詳細に説明したように、本発明は、
ある用途向けの文書を収集する際に、文書間の参照関係
に基づいて収集すべき文書を決定し、決定された文書を
収集することにより、言語に依存すること無く、迅速に
用途にあった文書を選択して収集することが可能とな
る。As described in detail above, the present invention provides
When collecting documents for a certain use, the documents to be collected are determined based on the reference relation between the documents, and by collecting the determined documents, the use can be quickly performed without depending on the language. Documents can be selected and collected.

【０１８３】また、参照表現に基づいて、収集済み文書
をまとめあげ、各収集済み文書にキーワードを付与する
ことにより、収集済み文書へのアクセスを容易とするこ
とが可能となる。また、文書本文の内容を解析しないた
め、言語に依存せず、迅速にキーワードを付与すること
が可能となる。Also, by collecting collected documents based on the reference expression and assigning a keyword to each collected document, access to the collected documents can be facilitated. Further, since the content of the document body is not analyzed, it is possible to quickly assign a keyword without depending on the language.

[Brief description of the drawings]

【図１】本発明の原理図である。FIG. 1 is a principle diagram of the present invention.

【図２】第１実施形態に係わる文書収集装置の構成図で
ある。FIG. 2 is a configuration diagram of a document collection device according to the first embodiment.

【図３】ＵＲＬテーブルのデータ構造の１例を示す図で
ある。FIG. 3 is a diagram illustrating an example of a data structure of a URL table.

【図４】参照関係テーブルのデータ構造の１例を示す図
である。FIG. 4 is a diagram illustrating an example of a data structure of a reference relation table.

【図５】参照表現テーブルのデータ構造の１例を示す図
である。FIG. 5 is a diagram illustrating an example of a data structure of a reference expression table.

【図６】参照回数テーブルのデータ構造の１例を示す図
である。FIG. 6 is a diagram illustrating an example of a data structure of a reference count table.

【図７】第１実施形態に係わる文書収集装置が行う処理
の大まかな流れを示すフローチャートである。FIG. 7 is a flowchart illustrating a rough flow of a process performed by the document collection device according to the first embodiment.

【図８】コミュニティ内の文書を収集する際に次収集候
補を判定する処理を示すフローチャートである。FIG. 8 is a flowchart illustrating a process of determining a next collection candidate when collecting documents in a community.

【図９】収集済み文書及び参照先文書をランキングする
処理を示すフローチャートである。FIG. 9 is a flowchart illustrating a process of ranking collected documents and reference destination documents.

【図１０】収集済み文書を選別する処理を示すフローチ
ャートである。FIG. 10 is a flowchart illustrating a process of selecting collected documents.

【図１１】キーワード付与処理を示すフローチャートで
ある。FIG. 11 is a flowchart illustrating a keyword assignment process.

【図１２】収集した文書を提供する画面の１例を示す図
である。FIG. 12 is a diagram illustrating an example of a screen for providing collected documents.

【図１３】第２実施形態に係わる文書収集装置の構成図
である。FIG. 13 is a configuration diagram of a document collection device according to a second embodiment.

【図１４】ＬＴ（Ｓ）、ＬＴ（ｐ）、ＬＳ（ｄ，Ｘ）、
ＬＳ（Ａ，Ｘ）が意味する文書の参照関係を示す図であ
る。FIG. 14 shows LT (S), LT (p), LS (d, X),
FIG. 14 is a diagram illustrating a reference relationship of documents that is indicated by LS (A, X).

【図１５】ＣＣ（ｄ，Ａ，Ｘ）が意味する文書の参照関
係を示す図である。FIG. 15 is a diagram illustrating a reference relationship of a document represented by CC (d, A, X).

【図１６】第２実施形態に係わる文書収集装置が行う処
理を示すフローチャートである。FIG. 16 is a flowchart illustrating a process performed by the document collection device according to the second embodiment.

【図１７】参照度を算出する式に含まれる各集合が意味
する参照関係を示す図である。FIG. 17 is a diagram illustrating a reference relationship that each set included in an expression for calculating a reference degree means.

【図１８】共参照度を算出する式に含まれる各集合が意
味する参照関係を示す図である。FIG. 18 is a diagram illustrating a reference relationship implied by each set included in an equation for calculating a co-reference degree.

【図１９】第２実施形態の変形例に係わる文書収集装置
が行う処理を示すフローチャートである。FIG. 19 is a flowchart illustrating processing performed by a document collection device according to a modification of the second embodiment.

【図２０】文書収集装置の収集精度の実験結果を示す図
（その１）である。FIG. 20 is a diagram (part 1) illustrating an experimental result of the collection accuracy of the document collection device.

【図２１】文書収集装置の収集精度の実験結果を示す図
（その２）である。FIG. 21 is a diagram (part 2) illustrating an experimental result of the collection accuracy of the document collection device.

【図２２】情報処理装置の構成図である。FIG. 22 is a configuration diagram of an information processing apparatus.

【図２３】情報処理装置にプログラムやデータを供給す
る記録媒体、伝送信号及び伝送媒体を説明する図であ
る。FIG. 23 is a diagram illustrating a recording medium, a transmission signal, and a transmission medium that supply a program and data to the information processing apparatus.

[Explanation of symbols]

１、１００、２００文書収集装置２文書収集手段３参照関係抽出手段４コミュニティ判別手段５次候補判定手段６ランキング手段７ＵＲＬ判定手段８参照度／共参照度算出手段９まとめあげ手段１０キーワード付与手段２０収集済み文書群２１次収集候補２２文書間参照関係２３収集文書ファイル１０１文書収集部１０２参照関係抽出部１０３コミュニティ判別部１０４候補判定部１０５ランキング部１０６まとめあげ部１０７キーワード付与部１２０ＵＲＬテーブル１２１参照関係テーブル１２２参照表現テーブル１２３参照回数テーブル１３０優良コンテンツ１４０検索エンジン１４１索引１５０分類エンジン１６０サーバ１７０ブラウザ１８０、１８１、１８２画面２０１参照度／共参照度テーブル２１０分野別優良コンテンツ３００情報処理装置３０１ＣＰＵ３０２メモリ３０３入力装置３０４出力装置３０５外部記憶装置３０６媒体駆動装置３０７ネットワーク接続装置３０８バス３０９可搬記録媒体３１０プログラム（データ）提供者３１１回線 DESCRIPTION OF SYMBOLS 1, 100, 200 Document collection device 2 Document collection means 3 Reference relation extraction means 4 Community discrimination means 5 Secondary candidate judgment means 6 Ranking means 7 URL judgment means 8 Reference / co-reference degree calculation means 9 Grouping means 10 Keyword assignment means 20 Collected document group 21 Primary collection candidate 22 Reference relationship between documents 23 Collection document file 101 Document collection unit 102 Reference relationship extraction unit 103 Community determination unit 104 Candidate determination unit 105 Ranking unit 106 Grouping unit 107 Keyword assignment unit 120 URL table 121 Reference relationship Table 122 Reference expression table 123 Reference count table 130 Excellent content 140 Search engine 141 Index 150 Classification engine 160 Server 170 Browser 180, 181, 182 Screen 201 Reference / co-reference Degree table 210 sectoral superior content 300 the information processing apparatus 301 CPU 302 memory 303 input device 304 output device 305 external storage device 306 medium drive 307 network connection device 308 bus 309 portable recording medium 310 a program (data) provider 311 lines

Claims

[Claims]

1. A document collection method for collecting documents from a network, comprising: collecting at least a predetermined number of documents from a community on the network based on a reference relationship of the documents; Collecting a plurality of documents and then collecting documents from inside and outside the community based on a reference relationship of the collected documents.

2. An importance degree indicating a degree of importance of the collected document is calculated based on a reference relation of the collected document and information indicating a location on a network, and based on the reference relation and the importance. The document collection method according to claim 1, wherein a document to be collected is determined.

3. The document collection method according to claim 2, wherein the document to be collected is determined for each of the inside and outside of the community.

4. The document collection method according to claim 3, wherein a result of searching for the collected documents is presented separately inside and outside the community.

5. A document collection method for collecting documents from a network, comprising: providing a group of positive documents as a group of documents related to a certain field and a group of negative documents as a group of documents related to a field that is less relevant to the field. Determining a document to be collected in the field based on a reference relationship between the positive example document group and the negative example document group, and collecting the document to be collected from the network. .

6. A reference degree indicating a degree of being referred to only from the documents of the positive example document group is calculated based on the reference relation, and a document having a high reference degree is determined as a document to be collected. The document collection method according to claim 5, wherein

7. A co-reference degree indicating a degree of reference from a collected document with respect to a document referred to from a collected document referring to a document in the positive document group based on the reference relation. The document collection method according to claim 5, wherein the document having a high degree of co-reference is determined as a document to be collected.

8. The document collection method according to claim 5, wherein the negative example document group is a union of documents relating to a plurality of fields.

9. The document collection method according to claim 1, wherein the collected documents are compiled based on a reference expression used in the collected documents.

10. The document collection according to claim 1, wherein a keyword is assigned to the collected document based on a reference expression used in the collected document. Method.

11. A computer program which, when executed by a computer, causes the computer to control the collection of documents from a network, wherein the computer program executes the control based on a reference relationship of the documents from within a community on the network. Collecting at least a predetermined number of documents from the community, collecting at least the first predetermined number of documents from the community, and then collecting documents from inside and outside the community based on a reference relationship of the collected documents. A computer program characterized by being performed by a computer.

12. A computer program which, when executed by a computer, causes the computer to control the collection of documents from a network, comprising: a set of regular documents which is a group of documents related to a certain field; And a negative example document group, which is a group of documents related to a small number of fields, determine a document to be collected based on the reference relationship between the positive example document group and the negative example document group, and collect the document from the network. A computer program for causing the computer to perform control including collecting documents to be performed.