JP2008217047A

JP2008217047A - Method, system and program for dividing web document

Info

Publication number: JP2008217047A
Application number: JP2007049197A
Authority: JP
Inventors: Toshio Ikeda; 利夫池田
Original assignee: Kansai Electric Power Co Inc
Current assignee: Kansai Electric Power Co Inc
Priority date: 2007-02-28
Filing date: 2007-02-28
Publication date: 2008-09-18
Anticipated expiration: 2027-02-28
Also published as: JP4700637B2

Abstract

PROBLEM TO BE SOLVED: To divide a document on a Web site into appropriate units of block documents according to a type of the document. SOLUTION: A document dividing process part 13 includes: a first dividing part 135 for creating block documents by dividing a document on a Web site into units of blank-line tags; a character counting part 136 for counting the number of all the characters of the block documents and the number of hyperlinked characters; a determination part 137 for calculating the content rate of the number of hyperlinked characters in the block documents and determining whether or not the block documents should be subdivided according to the content rate; and a second dividing part 138 for subdividing the block documents when the determination part 137 determines that they should be subdivided. COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、Ｗｅｂサイト上の文書を適切な区分で分割するＷｅｂ文書分割方法、システム及びプログラムに関する。 The present invention relates to a Web document dividing method, system, and program for dividing a document on a Web site into appropriate sections.

例えばインターネット上のＷＷＷ（World Wide Web）サイトに、新規文書が掲載されたか否かを自動判定する文書更新判定システムが検討されている。この、文書更新判定システムは、同一Ｗｅｂサイト（Ｗｅｂページ）から異なる時刻に取得された２つの文書を比較して、両者が同一文書であるか否かを判定するものである。その判定方法として、全文比較法、ハッシュ値比較法、形態素解析法、時間情報取得法及び暗号文比較法などが知られている。 For example, a document update determination system that automatically determines whether or not a new document has been posted on a WWW (World Wide Web) site on the Internet has been studied. This document update determination system compares two documents acquired at different times from the same Web site (Web page) and determines whether or not both are the same document. As the determination method, a full text comparison method, a hash value comparison method, a morpheme analysis method, a time information acquisition method, a ciphertext comparison method, and the like are known.

上記の文書比較に際しては、Ｗｅｂサイト上の文書を適切なブロック文書の単位で分割し、適切なブロック文書同士で比較することが、的確な文書更新判定のために肝要となる。従来、比較すべき文書（文字列）を抽出するための方法として、例えば特許文献１には、メモリに蓄積された電子メールのような文字列情報から特定の文字列を抽出する方法が開示されている。また、特許文献２には、ＷＷＷサイトから特定のＷＷＷ文書を取得する方法が開示されている。
特開平１１−２７２７０３号公報特許第２８６７９８６号公報 In the above document comparison, it is important for an accurate document update determination to divide a document on a Web site into appropriate block document units and compare them with appropriate block documents. Conventionally, as a method for extracting a document (character string) to be compared, for example, Patent Document 1 discloses a method of extracting a specific character string from character string information such as an e-mail stored in a memory. ing. Patent Document 2 discloses a method for acquiring a specific WWW document from a WWW site.
Japanese Patent Laid-Open No. 11-272703 Japanese Patent No. 2867986

しかしながら、上記特許文献に開示のものは、いずれもＷｅｂサイト上の文書の特質を十分考慮したものではない。すなわち、Ｗｅｂサイト上の文書には、主に文字だけで構成されたテキストデータ、これにハイパーリンクの部分を含むテキストデータ、殆どがハイパーリンクの部分で構成されたテキストデータ等が存在する。これらの文書の性質に合わせて、Ｗｅｂサイト上の文書を適切なブロック文書の単位で分割する方法は従来提案されておらず、このため的確な文書更新判定が行えないという不都合があった。 However, none of those disclosed in the above-mentioned patent documents fully consider the characteristics of documents on a Web site. That is, a document on a Web site includes text data mainly composed of only characters, text data including a hyperlink portion, text data composed mostly of a hyperlink portion, and the like. In accordance with the nature of these documents, a method for dividing a document on a Web site in an appropriate block document unit has not been proposed so far, and there has been a disadvantage that accurate document update determination cannot be performed.

本発明は、かかる事情に鑑みてなされたもので、Ｗｅｂサイト上の文書種別に応じて、該文書を適切なブロック文書の単位で分割することができるＷｅｂ文書分割方法、システム及びプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and provides a Web document dividing method, system, and program capable of dividing the document into appropriate block document units according to the document type on the Web site. For the purpose.

本発明の請求項１に係るＷｅｂ文書分割方法は、Ｗｅｂサイト上の文書を、空行を示す第１タグの単位で分割してブロック文書を作成し、前記ブロック文書に含まれる全キャラクタの数に対する、他のＷｅｂサイトへのハイパーリンク用の第２タグが与えられているリンクキャラクタの数の割合を求め、前記リンクキャラクタの数の割合が所定の第１閾値よりも高い場合には、前記ブロック文書中に含まれる改行を示す第３タグの単位で前記ブロック文書を再分割することを特徴とする。 The Web document dividing method according to claim 1 of the present invention creates a block document by dividing a document on a Web site in units of first tags indicating blank lines, and the number of all characters included in the block document. If the ratio of the number of link characters to which the second tag for hyperlinks to other websites is given is determined, and the ratio of the number of link characters is higher than a predetermined first threshold, The block document is subdivided in units of a third tag indicating a line feed included in the block document.

この構成によれば、まずＷｅｂサイト上の文書が、空行部分を区切りとして分割され、ブロック文書が作成される。そして、このブロック文書におけるハイパーリンクキャラクタの含有率が求められる。ハイパーリンクキャラクタの含有率が高い場合、当該ブロック文書はリンク付のニュースタイトル等の羅列文書である可能性が高い。この場合、ブロック文書内には複数の異なる内容の文書が含まれていると推定されることから、改行を示す第３タグの単位で前記ブロック文書を再分割することで、適切な文書単位に区切ることができる。一方、ハイパーリンクキャラクタの含有率が低い場合、当該ブロック文書は一つのニュース記事、ブログ文書、掲示板文書等である可能性が高い。このようなブロック文書を第３タグの単位で区切ってしまうと、一つのまとまりのある文書を細分化してしまうことになるので、この場合には再分割を行わない。従って、Ｗｅｂサイト上の文書を、内容に応じて適正な単位で分割することができる。 According to this configuration, the document on the Web site is first divided with the blank line portion as a delimiter, and a block document is created. And the content rate of the hyperlink character in this block document is calculated | required. When the content rate of the hyperlink character is high, there is a high possibility that the block document is an enumerated document such as a news title with a link. In this case, it is presumed that the block document includes a plurality of documents having different contents. Therefore, by subdividing the block document in units of the third tag indicating a line break, an appropriate document unit is obtained. Can be separated. On the other hand, when the content rate of the hyperlink character is low, there is a high possibility that the block document is a single news article, blog document, bulletin board document, or the like. If such a block document is divided by the unit of the third tag, a unitary document is subdivided, and in this case, no re-division is performed. Therefore, it is possible to divide the document on the Web site in an appropriate unit according to the content.

請求項２に係るＷｅｂ文書分割方法は、Ｗｅｂサイト上の文書を、空行を示す第１タグの単位で分割してブロック文書を作成し、前記ブロック文書に含まれる全キャラクタの数に対する、他のＷｅｂサイトへのハイパーリンク用の第２タグが与えられているリンクキャラクタの数の割合を求め、前記リンクキャラクタの数の割合が所定の第１閾値よりも高い場合には、前記ブロック文書中に含まれる改行を示す第３タグの単位で前記ブロック文書を再分割し、前記リンクキャラクタの数の割合が、前記第１閾値よりも高い所定の第２閾値よりもさらに高い場合には、前記ブロック文書中に含まれる前記第２タグの単位で前記ブロック文書を再分割することを特徴とする。 According to a second aspect of the present invention, there is provided a Web document dividing method for generating a block document by dividing a document on a Web site by a unit of a first tag indicating a blank line. If the ratio of the number of link characters to which the second tag for hyperlink to the Web site is given and the ratio of the number of link characters is higher than a predetermined first threshold, If the block document is re-divided in units of third tags indicating line breaks included in the image, and the ratio of the number of link characters is higher than a predetermined second threshold value that is higher than the first threshold value, The block document is subdivided in units of the second tag included in the block document.

この構成によれば、上記の文書再分割要件に加えて、ハイパーリンクキャラクタの含有率が、第１閾値よりも高い所定の第２閾値よりもさらに高い場合には、当該ブロック文書がハイパーリンク用の第２タグの単位で再分割される。これは、ハイパーリンクキャラクタの含有率が極めて高い文書は、リンク先が一行内に複数並べられているようなブロック文書である可能性が高い。この場合、ブロック文書を、改行を示す第３タグの単位で再分割したのでは適切な文書単位に区切れない可能性があるが、前記第２タグの単位で再分割すれば、これを適切に再分割することができる。 According to this configuration, in addition to the document subdivision requirement, when the content ratio of the hyperlink character is higher than a predetermined second threshold value that is higher than the first threshold value, the block document is used for hyperlinks. Are subdivided in units of the second tag. It is highly likely that a document with a very high hyperlink character content is a block document in which a plurality of link destinations are arranged in one line. In this case, if the block document is subdivided in units of the third tag indicating a line break, it may not be divided into appropriate document units. However, if the subdocument is subdivided in units of the second tag, Can be subdivided into

上記いずれかの構成において、前記Ｗｅｂサイトを記述するマークアップ言語が、ＨＴＭＬ形式であることが望ましい（請求項３）。この構成によれば、＜ｂｒ＞タグ、＜ａ＞タグ等を利用して、Ｗｅｂサイト文書を簡単且つ適切に分割することができる。 In any one of the above configurations, it is desirable that the markup language describing the Web site is in an HTML format. According to this configuration, the Web site document can be easily and appropriately divided using the tag, the <a> tag, and the like.

請求項４に係るＷｅｂ文書分割システムは、Ｗｅｂサイト上の文書を、空行を示す第１タグの単位で分割してブロック文書を作成する第１文書分割手段と、前記ブロック文書に含まれる全キャラクタの数と、他のＷｅｂサイトへのハイパーリンク用の第２タグが与えられているリンクキャラクタの数とをそれぞれカウントするカウント手段と、前記全キャラクタの数に対する前記リンクキャラクタの数の割合を求め、その割合に基づいて前記ブロック文書を再分割するか否かを決定する判定手段と、前記判定手段が再分割すると決定した場合に、前記ブロック文書を再分割する第２文書分割手段と、を備え、前記判定手段は、前記リンクキャラクタの数の割合が所定の第１閾値よりも高い場合に前記ブロック文書を再分割すると決定し、前記第２文書分割手段は、前記ブロック文書中に含まれる改行を示す第３タグの単位で前記ブロック文書を再分割することを特徴とする。 According to a fourth aspect of the present invention, there is provided a Web document dividing system including a first document dividing unit that generates a block document by dividing a document on a Web site in units of first tags indicating blank lines, and all the documents included in the block document. Counting means for counting the number of characters and the number of link characters provided with second tags for hyperlinks to other websites, and the ratio of the number of link characters to the number of all characters Determining means for determining whether to re-divide the block document based on the ratio, and a second document dividing means for re-dividing the block document when the determining means determines to re-divide, The determination means determines that the block document is to be subdivided when the ratio of the number of link characters is higher than a predetermined first threshold, 2 Document dividing means is characterized by subdividing the block document in units of third tag indicating a line break included in the block document.

この場合、前記第１閾値が３０％であることが望ましい（請求項６）。この構成によれば、リンク付のニュースタイトル等の羅列文書と、一つのニュース記事等を、高い確率で区分することができる。 In this case, it is desirable that the first threshold value is 30%. According to this configuration, it is possible to classify an enumerated document such as a news title with a link and a single news article with a high probability.

請求項５に係るＷｅｂ文書分割システムは、Ｗｅｂサイト上の文書を、空行を示す第１タグの単位で分割してブロック文書を作成する第１文書分割手段と、前記ブロック文書に含まれる全キャラクタの数と、他のＷｅｂサイトへのハイパーリンク用の第２タグが与えられているリンクキャラクタの数とをそれぞれカウントするカウント手段と、前記全キャラクタの数に対する前記リンクキャラクタの数の割合を求め、その割合に基づいて前記ブロック文書を再分割するか否かを決定する判定手段と、前記判定手段が再分割すると決定した場合に、前記ブロック文書を再分割する第２文書分割手段と、を備え、前記判定手段は、前記リンクキャラクタの数の割合が所定の第１閾値よりも高い第１含有率の場合と、前記第１閾値よりも高い所定の第２閾値よりもさらに高い第２含有率の場合とに前記ブロック文書を再分割すると決定し、前記第２文書分割手段は、第１含有率の場合には前記ブロック文書中に含まれる改行を示す第３タグの単位で前記ブロック文書を再分割し、第２含有率の場合には前記ブロック文書中に含まれる前記第２タグの単位で前記ブロック文書を再分割することを特徴とする。 According to a fifth aspect of the present invention, there is provided a Web document dividing system including a first document dividing unit that generates a block document by dividing a document on a Web site in units of first tags indicating blank lines, and all the documents included in the block document. Counting means for counting the number of characters and the number of link characters provided with second tags for hyperlinks to other websites, and the ratio of the number of link characters to the number of all characters Determining means for determining whether to re-divide the block document based on the ratio, and a second document dividing means for re-dividing the block document when the determining means determines to re-divide, The determination means includes a first content rate in which the ratio of the number of link characters is higher than a predetermined first threshold value, and a predetermined value higher than the first threshold value. The block document is determined to be subdivided when the second content rate is higher than two thresholds, and the second document dividing unit indicates a line feed included in the block document when the first content rate is reached. The block document is subdivided in units of third tags, and in the case of the second content rate, the block documents are subdivided in units of the second tag included in the block documents.

この場合、前記第１閾値が３０％であり、前記第２閾値が８５％であることが望ましい（請求項７）。この構成によれば、リンク付のニュースタイトル等の羅列文書と、一つのニュース記事等を、高い確率で判定することができる。さらに、リンク先が一行内に複数並べられているようなブロック文書も、高い確率で判定することができる。 In this case, it is desirable that the first threshold value is 30% and the second threshold value is 85%. According to this configuration, it is possible to determine an enumerated document such as a news title with a link and one news article with a high probability. Further, a block document in which a plurality of link destinations are arranged in one line can be determined with a high probability.

上記いずれかの構成において、前記第２タグ及び第３タグを特殊文字に変換する文字変換手段をさらに備えることが望ましい（請求項８）。この構成によれば、タグキャラクタと同じキャラクタがブロック文書に含まれているような場合でも、誤判定がなされないようにすることができる。 In any one of the configurations described above, it is preferable that the information processing apparatus further includes character conversion means for converting the second tag and the third tag into special characters. According to this configuration, it is possible to prevent erroneous determination even when the same character as the tag character is included in the block document.

請求項９に係るＷｅｂ文書分割プログラムは、Ｗｅｂサイト上のキャラクタ情報及びタグ情報が解析可能なコンピュータに、Ｗｅｂサイト上の文書を、空行を示す第１タグの単位で分割してブロック文書を作成するステップと、前記ブロック文書に含まれる全キャラクタの数に対する、他のＷｅｂサイトへのハイパーリンク用の第２タグが与えられているリンクキャラクタの数の割合を求めるステップと、前記リンクキャラクタの数の割合が所定の第１閾値よりも高いか否かを判定するステップと、前記第１閾値よりも高い場合に、前記ブロック文書中に含まれる改行を示す第３タグの単位で前記ブロック文書を再分割するステップと、を実行させることを特徴とする。 A Web document dividing program according to claim 9 divides a document on a Web site into units of a first tag indicating a blank line and a block document on a computer capable of analyzing character information and tag information on the Web site. Creating a ratio of the number of link characters provided with second tags for hyperlinks to other websites with respect to the total number of characters included in the block document; and Determining whether the number ratio is higher than a predetermined first threshold; and, if higher than the first threshold, the block document in units of a third tag indicating a line break included in the block document And a step of subdividing.

請求項１０に係るＷｅｂ文書分割プログラムは、Ｗｅｂサイト上のキャラクタ情報及びタグ情報が解析可能なコンピュータに、Ｗｅｂサイト上の文書を、空行を示す第１タグの単位で分割してブロック文書を作成するステップと、前記ブロック文書に含まれる全キャラクタの数に対する、他のＷｅｂサイトへのハイパーリンク用の第２タグが与えられているリンクキャラクタの数の割合を求めるステップと、前記リンクキャラクタの数の割合が所定の第１閾値よりも高い第１含有率か否か、及び前記第１閾値よりも高い所定の第２閾値よりもさらに高い第２含有率か否かを判定するステップと、前記第１含有率の場合には、前記ブロック文書中に含まれる改行を示す第３タグの単位で前記ブロック文書を再分割させ、前記第２含有率の場合には、前記ブロック文書中に含まれる前記第２タグの単位で前記ブロック文書を再分割させるステップと、を実行させることを特徴とする。 A Web document dividing program according to claim 10 divides a document on a Web site into units of a first tag indicating a blank line on a computer capable of analyzing character information and tag information on the Web site. Creating a ratio of the number of link characters provided with second tags for hyperlinks to other websites with respect to the total number of characters included in the block document; and Determining whether the ratio of the number is a first content rate higher than a predetermined first threshold and whether the second content rate is higher than a predetermined second threshold value higher than the first threshold value; In the case of the first content rate, the block document is subdivided in units of a third tag indicating a line feed included in the block document, and in the case of the second content rate. , Characterized in that to execute a step of subdividing said block document in units of the second tag included in the block document.

上記のようなＷｅｂ文書分割方法、システム及びプログラムによれば、主に文字だけで構成された文書、これにハイパーリンクの部分を多く含む文書乃至は殆どがハイパーリンクの部分で構成された文書を適切に分割することができる。従って、Ｗｅｂサイト上の文書を適切なブロック文書の単位で分割し、このブロック文書同士で比較して行う文書更新判定を、的確に実行させることができる。 According to the Web document dividing method, system, and program as described above, a document mainly composed of only characters, a document including many hyperlink parts or a document composed mostly of hyperlink parts. Can be divided appropriately. Accordingly, it is possible to accurately execute the document update determination performed by dividing the document on the Web site in units of appropriate block documents and comparing the block documents with each other.

以下、図面に基づいて、本発明の実施形態について説明する。
図１は、本発明に係るＷｅｂ文書分割方法が適用された文書更新判定システムＳのハード構成を示す構成図である。また、図２は、文書更新判定システムＳの全体的な動作を概略的に示すフローチャートである。本発明に係るＷｅｂ文書分割方法は種々の用途に適用できるが、ここでは一例として、文書更新判定システムＳに組み込む例を挙げる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a configuration diagram showing a hardware configuration of a document update determination system S to which a Web document dividing method according to the present invention is applied. FIG. 2 is a flowchart schematically showing the overall operation of the document update determination system S. The Web document dividing method according to the present invention can be applied to various uses. Here, an example in which the Web document dividing method is incorporated into the document update determination system S will be given.

この文書更新判定システムＳは、インターネット回線ＩＮに接続され、Ｗｅｂサイト３（ＷＷＷサイト）をブラウジング可能とされたサーバ装置１と、このサーバ装置１と通信可能とされたパーソナルコンピュータ２１、携帯電話機２２等の端末装置２とから構成されている。 The document update determination system S is connected to the Internet line IN, and the server device 1 that can browse the Web site 3 (WWW site), the personal computer 21 that can communicate with the server device 1, and the mobile phone 22. And the like.

Ｗｅｂサイト３は、インターネット回線ＩＮに接続されたコンピュータ（Ｗｅｂサーバ）に記録されているＨＴＭＬファイルや画像ファイル等の各種Ｗｅｂコンテンツが複数登録されている場所である。Ｗｅｂサイト３は、サーバ装置１からのキーワード検索等、ユーザのリクエストに応じて、該当するＨＴＭＬファイルや画像ファイル等を提供する。図１では、「ニュース」に関するＷｅｂサイトＡ、「経済ニュース」に関するＷｅｂサイトＢ、「掲示板」としてのＷｅｂサイトＣ、「ブログ」としてのＷｅｂサイトＤを模式的に例示している。 The Web site 3 is a place where a plurality of various Web contents such as HTML files and image files recorded in a computer (Web server) connected to the Internet line IN are registered. The Web site 3 provides a corresponding HTML file, image file, or the like in response to a user request such as keyword search from the server device 1. FIG. 1 schematically illustrates a website A regarding “news”, a website B regarding “economic news”, a website C as a “bulletin board”, and a website D as a “blog”.

図２に基づき、文書更新判定システムＳの動作を説明する。サーバ装置１は、適宜なサンプリング周期を設定されており（ステップＳ１）、インターネット上で所定のキーワードを用いて一定のサンプリング時間毎（ステップＳ１でＹＥＳ）に、Ｗｅｂサイト３に対して巡回検索を行い、各種のＷｅｂコンテンツから当該キーワードが含まれる文書データ（ＨＴＭＬファイル等）を抽出する（ステップＳ２）。その後、後記で詳述する手法により抽出した文書を適切なブロック文書に分割した上で（ステップＳ３）、更新文書（新規文書）がＷｅｂサイト３上に掲載されたか否かを判定する（ステップＳ４）。 The operation of the document update determination system S will be described with reference to FIG. The server apparatus 1 is set with an appropriate sampling period (step S1), and performs a cyclic search for the website 3 at predetermined sampling times (YES in step S1) using a predetermined keyword on the Internet. In step S2, document data (HTML file or the like) including the keyword is extracted from various Web contents. Thereafter, the document extracted by the method described in detail later is divided into appropriate block documents (step S3), and it is determined whether or not the updated document (new document) is posted on the website 3 (step S4). ).

そして、更新文書が検出された場合、サーバ装置１は、その文書の更新日時、ＵＲＬ（当該文書の所在地記述情報）等を前記端末装置２へ配信する（ステップＳ５）。端末装置２は、配信されたＵＲＬに基づき、更新文書の内容を確認したり、或いは統計分析したりするためのもので、端末装置２を構成するパーソナルコンピュータ２１、携帯電話機２２等の保持者は、直ちに前記更新文書の存在並びに内容を知見し、また統計分析等を行うことができる。 When an updated document is detected, the server device 1 distributes the update date / time, URL (location description information of the document), etc. of the document to the terminal device 2 (step S5). The terminal device 2 is used for confirming the contents of the updated document or performing statistical analysis based on the delivered URL. The holders of the personal computer 21 and the mobile phone 22 that constitute the terminal device 2 are It is possible to immediately know the existence and content of the updated document and perform statistical analysis and the like.

ここで、本実施形態の文書更新判定システムＳに採用されている更新判定方法の概要を、図３に基づいて説明しておく。図３（ａ）に示すように、いま比較すべき第１文書４１と第２文書４２とが存在するものとする。例えば既存文書を第１文書４１とし、この第１文書４１に対して第２文書４２が何らかの更新情報を含んでいるか（更新文書若しくは新規文書であるか）を判定するならば、第１文書４１が比較元文書となり、第２文書４２は比較先文書となる。この場合、第１文書４１と第２文書４２との同一性を厳密に求めるならば、両文書の構成文字を全文対比させれば良い（全文比較法）。 Here, an outline of the update determination method employed in the document update determination system S of the present embodiment will be described with reference to FIG. As shown in FIG. 3A, it is assumed that there are a first document 41 and a second document 42 to be compared now. For example, if the existing document is the first document 41 and it is determined whether the second document 42 contains any update information (updated document or new document) with respect to the first document 41, the first document 41 Becomes the comparison source document, and the second document 42 becomes the comparison destination document. In this case, if the identity between the first document 41 and the second document 42 is strictly determined, the constituent characters of both documents may be compared with each other (full text comparison method).

しかし、全文比較法では処理時間がどうしても長くなってしまう。そこで、次のような手法で、両文書から文字を抜き出して対比させても良い。すなわち、図３（ｂ）に示すように、比較元となる第１文書４１から所定の文字抽出条件に基づいて第１の比較文字４１ａ〜４１ｅを抽出すると共に、比較先となる第２文書４２からも同様な文字抽出条件に基づいて第２の比較文字４２ａ〜４２ｅを抽出する。ここでの文字抽出条件は、第１文書４１及び第２文書４２の構成文字数（データ長Ｌ；変数）と固定化された抽出文字数（抜き取り文字数Ｃ）とに依存性をもつ間欠抜き取りの算術式で定められる。そして、第１の比較文字４１ａ〜４１ｅが順番に配列されて第１の比較文字配列４１０が生成され、また第２の比較文字４２ａ〜４２ｅが順番に配列されて第２の比較文字配列４２０が生成される。 However, the full text comparison method inevitably increases the processing time. Therefore, characters may be extracted from both documents and compared by the following method. That is, as shown in FIG. 3B, the first comparison characters 41a to 41e are extracted from the first document 41 as the comparison source based on a predetermined character extraction condition, and the second document 42 as the comparison destination. The second comparison characters 42a to 42e are extracted based on similar character extraction conditions. The character extraction condition here is an arithmetic expression of intermittent sampling that depends on the number of constituent characters (data length L; variable) of the first document 41 and the second document 42 and the number of extracted characters (number of extracted characters C). Determined by Then, the first comparison characters 41a to 41e are arranged in order to generate the first comparison character array 410, and the second comparison characters 42a to 42e are arranged in order to obtain the second comparison character array 420. Generated.

しかる後、図３（ｃ）に示すように、第１の比較文字配列４１０における第１番目の比較文字４１ａと第２の比較文字配列４２０における第２番目の比較文字４２ａというように、第１の比較文字配列４１０と第２の比較文字配列４２０との同じ番目の比較文字同士が各々比較される。その結果、第１の比較文字配列４１０と第２の比較文字配列４２０とが全て同一であれば、第１文書４１と第２文書４２とは同一文書（更新なし）であると判定する。一方、第１の比較文字配列４１０と第２の比較文字配列４２０との間に相違があれば、第１文書４１と第２文書４２とは異文書（更新あり；更新部分を含んで生成された更新文書若しくは新規文書）であると判定するものである。 Thereafter, as shown in FIG. 3C, the first comparison character 41a in the first comparison character array 410 and the second comparison character 42a in the second comparison character array 420 are used as the first comparison character 41a. The same comparison characters in the comparison character array 410 and the second comparison character array 420 are compared with each other. As a result, if all of the first comparison character array 410 and the second comparison character array 420 are the same, it is determined that the first document 41 and the second document 42 are the same document (no update). On the other hand, if there is a difference between the first comparison character array 410 and the second comparison character array 420, the first document 41 and the second document 42 are different documents (updated; generated including an updated portion). Update document or new document).

なお、第１文書４１と第２文書４２との同一性を判定する簡便な手法として、両文書の構成文字数を単純比較する方法がある。第２文書４２が、第１文書４１に何らかの書き込みを追加して作成された文書であるならば、当然に両文書の文字数に相違が生じるからである。従って、図３に示した更新判定の手法に文字数比較の要素を組み入れることにより、一層合理的な更新判定が行えるようになる。 As a simple method for determining the identity between the first document 41 and the second document 42, there is a method of simply comparing the number of constituent characters of both documents. This is because, if the second document 42 is a document created by adding some writing to the first document 41, there is a difference in the number of characters between the two documents. Therefore, by incorporating an element for comparing the number of characters into the update determination method shown in FIG. 3, a more rational update determination can be performed.

本発明に係るＷｅｂ文書分割方法は、ここではＷｅｂサイト上の文書を、上述した第１文書４１及び第２文書４２のように、比較に適したブロック文書の単位に分割するために用いられる。すなわち、いくら優れた文書更新判定手法を導入したとしても、比較すべき文書を適切に抽出できないと、そもそも的確な更新判定は行えない。このため、的確にＷｅｂ文書を分割できる手法が肝要となる。以下、本発明の実施形態に係るＷｅｂ文書分割方法を詳述する。 The Web document dividing method according to the present invention is used here to divide a document on a Web site into block document units suitable for comparison, such as the first document 41 and the second document 42 described above. In other words, no matter how good a document update determination method is introduced, if a document to be compared cannot be extracted properly, an accurate update determination cannot be performed in the first place. For this reason, a technique capable of accurately dividing a Web document is important. Hereinafter, a Web document dividing method according to an embodiment of the present invention will be described in detail.

図４は、サーバ装置１（Ｗｅｂ文書分割システム）の機能構成を示す機能ブロック図である。このサーバ装置１は、送受信部１１、Ｗｅｂサイト検索部１２、文書分割処理部１３、更新判定処理部１４及び全体制御部１００を備えて構成されている。 FIG. 4 is a functional block diagram showing a functional configuration of the server device 1 (Web document dividing system). The server device 1 includes a transmission / reception unit 11, a website search unit 12, a document division processing unit 13, an update determination processing unit 14, and an overall control unit 100.

送受信部１１は、当該サーバ装置１とＬＡＮ等の所定の回線を介して端末装置２との間で、またインターネット回線ＩＮを介してＷｅｂサイト３との間でデータ通信を可能とするためのデータ通信部である。本実施形態において送受信部１１は、Ｗｅｂサイト３に対して検索キーワード等の検索条件情報を送信すると共に、その検索結果を受信する。また、端末装置２に対しては、文書更新ありと判定した場合に、当該文書の所在地を示すＵＲＬ等を送信する。 The transmission / reception unit 11 is data for enabling data communication between the server device 1 and the terminal device 2 via a predetermined line such as a LAN, and between the Web site 3 via the Internet line IN. It is a communication part. In the present embodiment, the transmission / reception unit 11 transmits search condition information such as a search keyword to the Web site 3 and receives the search result. In addition, when it is determined that there is a document update, a URL indicating the location of the document is transmitted to the terminal device 2.

Ｗｅｂサイト検索部１２は、全体制御部１００から与えられるサンプリング時間毎に、所定の検索条件を設定してＷｅｂサイト３を巡回検索する。例えば法人ＸＹＺ○△社が、自社に関連する記事がインターネット上に掲載されているかを定期的にウォッチングする目的では、例えば検索キーワードを「ＸＹＺ○△」、「ＸＹＺ」、「○△」、「Ｘ○」などと設定してＷｅｂサイト３に対して検索を行う。これにより、図１に示したＷｅｂサイト３のＷｅｂサイトＡ，Ｂ，Ｃ，Ｄ・・・に各々含まれている文書１−１、１−２、文書２−１、２−２、２−３・・・の中から、上記キーワードが含まれている文書が抽出される。かかる検索を行うようにすれば、法人ＸＹＺ○△社に対する社会や顧客の評価情報を速やかに収集することができ、また予期せぬ風評などに対しても速やかな対応が行えるようになる。 The Web site search unit 12 performs a cyclic search of the Web site 3 by setting a predetermined search condition for each sampling time given from the overall control unit 100. For example, for the purpose of regularly monitoring whether a corporation XYZ ○ △ company posts articles related to the company on the Internet, for example, search keywords “XYZ ○ △”, “XYZ”, “○ △”, “ A search for the Web site 3 is performed by setting “X ○” or the like. As a result, the documents 1-1, 1-2, documents 2-1, 2-2, 2-, which are included in the websites A, B, C, D... Of the website 3 shown in FIG. Documents including the keyword are extracted from 3. By performing such a search, it is possible to quickly collect social and customer evaluation information for the corporation XYZ ○ △, and it is possible to quickly respond to unexpected reputations.

文書分割処理部１３は、主に次の処理を行う。その詳細については、図５〜図１５に基づき詳述する。
（１）第１分割；Ｗｅｂサイト３上の文書を、空行タグ（第１タグ）の単位で分割して、ブロック文書を作成する。
（２）含有率判定；ブロック文書に含まれる文字や記号等のキャラクタ（以下、単に「文字」という）の全数に対する、他のＷｅｂサイトへのハイパーリンクタグ（第２タグ）が与えられている文字（リンクキャラクタ）の数の割合を求める。
（３）第２分割Ａ；ハイパーリンク文字の数の割合が所定の第１閾値よりも高い場合（判定１）には、前記ブロック文書中に含まれる改行タグ（第３タグ）の単位で前記ブロック文書を再分割する。
（４）第２分割Ｂ；ハイパーリンク文字の数の割合が、前記第１閾値より高く設定されている所定の第２閾値よりもさらに高い場合（判定２）には、前記ブロック文書中に含まれるハイパーリンクタグの単位で前記ブロック文書を再分割する。
（５）第１分割の維持；ハイパーリンク文字の数の割合が所定の第１閾値よりも低い場合（判定０）には、前記ブロック文書を再分割しない。 The document division processing unit 13 mainly performs the following processing. Details thereof will be described in detail with reference to FIGS.
(1) First division: A document on the Web site 3 is divided into blank line tags (first tags) to create a block document.
(2) Content rate determination: Hyperlink tags (second tags) to other websites are provided for the total number of characters (hereinafter simply referred to as “characters”) such as characters and symbols included in the block document. The ratio of the number of characters (link characters) is obtained.
(3) Second division A; when the ratio of the number of hyperlink characters is higher than a predetermined first threshold (determination 1), the unit is a unit of a line feed tag (third tag) included in the block document. Subdivide the block document.
(4) Second division B: included in the block document when the ratio of the number of hyperlink characters is higher than a predetermined second threshold set higher than the first threshold (decision 2) The block document is subdivided in units of hyperlink tags.
(5) Maintenance of the first division; when the ratio of the number of hyperlink characters is lower than the predetermined first threshold (determination 0), the block document is not subdivided.

更新判定処理部１４は、先に図３に基づき説明した如きアルゴリズムで、文書の更新判定処理を行う。 The update determination processing unit 14 performs a document update determination process using the algorithm described above with reference to FIG.

全体制御部１００は、ＣＰＵ（Central Processing Unit）等からなり、サーバ装置１内の各種機能部の動作を司る。例えば全体制御部１００は、Ｗｅｂサイト検索部１２に所定のサンプリング周期で検索指示信号を与え、送受信部１１を介してＷｅｂサイト３の検索を行わせたり、文書分割処理部１３及び更新判定処理部１４に所定のシーケンスに従い分割処理及び判定処理を実行させたり、更新判定処理部１４において更新文書が検出された場合に、端末装置２に対して当該更新文書のＵＲＬ情報等を配信したりする制御を行うものである。 The overall control unit 100 includes a CPU (Central Processing Unit) and the like, and manages operations of various functional units in the server device 1. For example, the overall control unit 100 gives a search instruction signal to the Web site search unit 12 at a predetermined sampling period to search the Web site 3 via the transmission / reception unit 11, or the document division processing unit 13 and the update determination processing unit. 14 to execute division processing and determination processing according to a predetermined sequence, or to distribute the URL information of the update document to the terminal device 2 when the update determination processing unit 14 detects an update document. Is to do.

図５は、文書分割処理部１３の機能構成を詳細に示すブロック図である。文書分割処理部１３は、ＲＡＭ（Random Access Memory）１３１、分割条件設定部１３２、閾値設定部１３３、タグ変換部１３４（文字変換手段）、第１分割部１３５（第１文書分割手段）、文字数カウント部１３６（カウント手段）、判定部１３７（判定手段）及び第２分割部１３８（第２文書分割手段）を備えている。 FIG. 5 is a block diagram showing the functional configuration of the document division processing unit 13 in detail. The document division processing unit 13 includes a RAM (Random Access Memory) 131, a division condition setting unit 132, a threshold setting unit 133, a tag conversion unit 134 (character conversion unit), a first division unit 135 (first document division unit), the number of characters A counting unit 136 (counting unit), a determining unit 137 (determining unit), and a second dividing unit 138 (second document dividing unit) are provided.

ＲＡＭ１３１は、Ｗｅｂサイト検索部１２による検索によりヒットした文書の文書データの他、分割処理に際して生じる各種データを一時的に保持するものである。 The RAM 131 temporarily stores document data of documents hit by the search by the website search unit 12 and various data generated during the division process.

分割条件設定部１３２は、上記第１分割、第２分割Ａ及び第２分割Ｂを行う際のキーとなるタグ情報の定義を記憶する。図６は、かかるタグ情報の定義の一例を示す表形式の図である。ここでは、Ｗｅｂサイト３を記述するマークアップ言語が、ＨＭＴＬ形式である場合について例示している。図６に示すように、本実施形態では、“第１分割”が“空行”を示す“＜ｂｒ＞＜ｂｒ＞タグ”（第１タグ）の単位で行われ、“第２分割Ａ”が“改行” を示す“＜ｂｒ＞タグ”（第２タグ）の単位で行われ、“第２分割Ｂ”が“ハイパーリンク” を示す“＜ａｈｒｅｆ＞タグ”（第３タグ）の単位で行われるよう設定されている例を示している。 The division condition setting unit 132 stores the definition of tag information that is a key when performing the first division, the second division A, and the second division B. FIG. 6 is a table format showing an example of the definition of such tag information. Here, a case where the markup language describing the Web site 3 is in the HMTL format is illustrated. As shown in FIG. 6, in the present embodiment, “first division” is performed in units of “ tag” (first tag) indicating “blank row”, and “second division A”. Is performed in units of “ tag” (second tag) indicating “line feed”, and “<a href> tag” (third tag) unit in which “second division B” indicates “hyperlink” It shows an example that is set to be performed in.

閾値設定部１３３は、上記第１分割のみ、若しくは第２分割Ａ又は第２分割Ｂのいずれを行うかの判定基準となる、ハイパーリンク文字含有率（リンクキャラクタ含有率）ｍについての閾値の定義を記憶する。図７は、かかる含有率ｍの閾値定義の一例を示す表形式の図である。ここでは、閾値としてｍ＝３０％（第１閾値）及びｍ＝８５％（第２閾値）を選び、上記第１分割のみが行われる“判定０”が“ｍ＜３０％”の条件のとき、第２分割Ａが行われる“判定１”が“３０％≦ｍ≦８５％”の条件のとき、第２分割Ｂが行われる“判定２”が“８５％＜ｍ”の条件のときにそれぞれ為されるよう設定されている例を示している。 The threshold value setting unit 133 defines a threshold value for the hyperlink character content rate (link character content rate) m, which is a criterion for determining whether only the first division or the second division A or the second division B is performed. Remember. FIG. 7 is a table format showing an example of the threshold definition of the content ratio m. Here, when m = 30% (first threshold value) and m = 85% (second threshold value) are selected as the threshold values, and “determination 0” in which only the first division is performed is “m <30%”. When “determination 1” in which the second division A is performed is “30% ≦ m ≦ 85%”, and “determination 2” in which the second division B is performed is “85% <m”. It shows an example where each is set to be done.

上記第１閾値及び第２閾値の数値（３０％及び８５％）は、現状の著名なインターネット上の各種Ｗｅｂサイト（各種ニュースサイト、ブログサイト、掲示板サイト）において文書構造解析を行った結果得られたパラメータである。この数値は一例であって、第１閾値＜第２閾値の関係を基礎として、適宜な数値に設定することができる。しかし、後記でも説明するが、ハイパーリンク文字含有率に基づき的確に文書区分を判定する観点からは、第１閾値は２０％〜４０％、好ましくは２５％〜３５％の範囲から、また第２閾値は７５％〜９５％、好ましくは８０％〜９０％の範囲から選択することが望ましい。なお、第１閾値及び第２閾値は、Ｗｅｂサイトの種別、検索キーワード、サンプリング周期等に応じて可変としても良い。 The numerical values (30% and 85%) of the first threshold value and the second threshold value are obtained as a result of document structure analysis on various well-known Internet websites (various news sites, blog sites, bulletin board sites). Parameters. This numerical value is an example, and can be set to an appropriate numerical value based on the relationship of the first threshold value <the second threshold value. However, as will be described later, from the viewpoint of accurately determining the document classification based on the hyperlink character content rate, the first threshold value is in the range of 20% to 40%, preferably 25% to 35%. The threshold value is desirably selected from the range of 75% to 95%, preferably 80% to 90%. Note that the first threshold value and the second threshold value may be variable according to the type of website, the search keyword, the sampling period, and the like.

タグ変換部１３４は、上記“＜ｂｒ＞＜ｂｒ＞タグ”、“＜ｂｒ＞タグ”及び“＜ａｈｒｅｆ＞タグ”を、文字変換する処理を行う。“＜ｂｒ＞＜ｂｒ＞タグ”は、どのような文書であっても必ず行われる文書分割（第１分割）のキーとなるタグである。このタグを確実に認識できるよう、例えば“＜ｂｒ＞＜ｂｒ＞タグ”を、“＿空＿行＿”という文字列（以下、“特殊文字Ａ”という）に変換し、区切り文字として利用できるようにする。 The tag conversion unit 134 performs character conversion processing on the “ tag”, “ tag”, and “<a href> tag”. “ tag” is a tag that is a key for document division (first division) that is always performed for any document. In order to recognize this tag reliably, for example, “ tag” can be converted into a character string “_empty_line_” (hereinafter referred to as “special character A”) and used as a delimiter. Like that.

また、タグ変換部１３４は、“＜ｂｒ＞タグ”及び“＜ａｈｒｅｆ＞タグ”については、同じ文字がブロック文書に含まれているような場合でも誤判定がなされないようにするため、並びに、必要に応じて後に区切り文字として利用できるよう、特殊文字に変換する。この特殊文字としては、およそ通常の文書には登場しないような文字列が選ばれる。例えば“＜ｂｒ＞タグ”を“＿改＿行＿”という文字列（以下、“特殊文字Ｃ”という）に、“＜ａｈｒｅｆ＞タグ”を“＿リ＿ン＿ク＿”という文字列（以下、“特殊文字Ｂ”という）に変換する。 Further, the tag conversion unit 134 prevents the “ tag” and the “<a href> tag” from being erroneously determined even when the same character is included in the block document, and If necessary, convert it to a special character so that it can be used as a delimiter later. As this special character, a character string that does not appear in a normal document is selected. For example, “ tag” is a character string “_Kai_line_” (hereinafter referred to as “special character C”), and “<a href> tag” is a character string “_Lin_Ku_”. (Hereinafter referred to as “special character B”).

第１分割部１３５は、Ｗｅｂサイト検索部１２による検索でヒットした文書を、空行の表示を指定する“＜ｂｒ＞＜ｂｒ＞タグ”の単位で分割して、ブロック文書を作成する。図８は、第１分割部１３５による文書分割動作を示す模式図である。ここでは、経済ニュースを記述したＷｅｂページ５０であって、空行を挟んで第１文書５１、第２文書５２、第３文書５３及び第４文書５４が掲載されている例を示している。なお、第１〜第４文書５１〜５４は内容の異なるニュース記事を記載した文書である。第１分割部１３５は、第１〜第４文書５１〜５４間に、空行の表示指定として記述されている“特殊文字Ａ”を区切り文字として読み出し、この“特殊文字Ａ”単位で文書を分割して、第１〜第４文書５１〜５４を文書データ上で第１〜第４ブロック文書に分割する。このように分割された第１〜第４ブロック文書は、一時的にＲＡＭ１３１へ格納される。 The first dividing unit 135 divides the document hit by the search by the Web site searching unit 12 in units of “ tag” that specifies display of blank lines, and creates a block document. FIG. 8 is a schematic diagram showing a document dividing operation by the first dividing unit 135. Here, an example is shown in which a web page 50 describing economic news includes a first document 51, a second document 52, a third document 53, and a fourth document 54 across a blank line. The first to fourth documents 51 to 54 are documents describing news articles having different contents. The first dividing unit 135 reads “special character A” described as a blank line display designation between the first to fourth documents 51 to 54 as a delimiter and reads the document in units of the “special character A”. The first to fourth documents 51 to 54 are divided into first to fourth block documents on the document data. The first to fourth block documents divided in this way are temporarily stored in the RAM 131.

図８に示すＷｅｂページ５０の第１〜第４文書５１〜５４のように、内容の異なるひとかたまりの記述文書が空行を挟んで複数掲載されている場合、これ以上第１〜第４文書５１〜５４を再分割する必要性はないといえる。逆に、例えば改行を指示する“＜ｂｒ＞タグ”の単位で再分割すると、次のような不具合を生じる。 When a plurality of descriptive documents having different contents are posted across a blank line, such as the first to fourth documents 51 to 54 of the Web page 50 shown in FIG. 8, the first to fourth documents 51 are no more. It can be said that there is no need to subdivide .about.54. On the other hand, for example, if the subdivision is performed in units of “ tag” indicating a line feed, the following problem occurs.

図９は、図８の第１文書５１をＨＴＭＬ記述形式で表したＨＴＭＬ文書５１Ａを示している。このＨＴＭＬ文書５１Ａには、第１文書５１の始まり及び終わりに空行タグ５１１が、また第１文書５１の改行位置に相当する箇所に改行タグ５１２がそれぞれ記述されている。この場合、空行タグ５１１で分割した上に改行タグ５１２で再分割すると、ひとまとまりの文書（話題）が複数のブロック文書に細分化されてしまうことになる。 FIG. 9 shows an HTML document 51A representing the first document 51 of FIG. 8 in the HTML description format. In the HTML document 51A, a blank line tag 511 is described at the beginning and end of the first document 51, and a line feed tag 512 is described at a position corresponding to the line feed position of the first document 51. In this case, if it is divided by the blank line tag 511 and then re-divided by the line feed tag 512, a group of documents (topics) is subdivided into a plurality of block documents.

このような細分化が行われてしまうと、後の文書更新判定において、第１文書５１の新規掲載を含めて、第１文書５１に複数の改行タグ５１２間で更新があった場合に、ひとまとまりの文書にも拘わらず、複数の文書更新があったものと判定してしまう可能性が高くなる。これでは、ユーザに正確な通知が行えないと共に、統計分析等の精度が低下する。 If such subdivision is performed, when the first document 51 is updated between a plurality of line feed tags 512 including the new publication of the first document 51 in the subsequent document update determination, In spite of a group of documents, there is a high possibility that it will be determined that a plurality of documents have been updated. As a result, the user cannot be notified accurately and the accuracy of statistical analysis or the like is reduced.

また、Ｗｅｂサイト検索部１２の検索でヒットしたＷｅｂ文書を、さらにキーワードを用いて絞り込み検索を行う場合に不具合が生じる。例えば、第１文書５１に含まれている「画像データ」というキーワードＫＥ１（図８）に着目する。該キーワードＫＥ１は、たまたま行を跨いでしまっていることから、ＨＴＭＬ文書５１Ａ（図９）で見ると改行タグ５１２の＜ｂｒ＞で分離されてしまっている。このため、「画像データ」というキーワードＫＥ１で検索したとしても、この第１文書５１はヒットしない結果となる。 In addition, a problem occurs when a Web document hit by the search by the Web site search unit 12 is further searched using a keyword. For example, attention is paid to the keyword KE1 (FIG. 8) “image data” included in the first document 51. Since the keyword KE1 happens to straddle a line, it is separated by in the line feed tag 512 when viewed in the HTML document 51A (FIG. 9). Therefore, even if a search is performed with the keyword KE1 “image data”, the first document 51 does not hit.

さらに、例えば「撮像素子」というキーワードＫＥ２と、「デジタルカメラ」というキーワードＫＥ２とを用いてＡＮＤ条件で検索する場合にも不具合が生じる。すなわち、キーワードＫＥ２、ＫＥ３は、改行タグ５１２単位で第１文書５１が分割されてしまうと別文書に所属していることになるので、ＡＮＤ条件を満たさず、かかる検索では第１文書５１がヒットしない結果となってしまう。以上のことから、第１文書５１の如き文書は、空行タグ５１１の単位で分割する第１分割の後は、再分割しないことが望ましい。 Further, for example, a problem occurs when a search is performed using an AND condition using a keyword KE2 “image sensor” and a keyword KE2 “digital camera”. That is, since the keywords KE2 and KE3 belong to another document when the first document 51 is divided in units of line feed tags 512, the AND condition is not satisfied and the first document 51 is hit in such a search. Result. From the above, it is desirable that a document such as the first document 51 should not be subdivided after the first division that is divided in units of blank line tags 511.

しかしながら、常に空行タグの単位のみでの分割が適しているということはできない。図１０は、複数のニュースタイトルが行を変えて羅列されているＷｅｂページ６０を示す模式図である。Ｗｅｂページ６０は第１〜第４ニュースタイトル６１〜６４を含み、各タイトルにはユーザからクリック操作を与えられることで他のＷｅｂページ（例えば図８に示したニュースの詳細を記述したＷｅｂページ）へジャンプするハイパーリンク（図１０の下線部）が組まれている。例えば第１ニュースタイトル６１では、“ＡＢＣ社が次世代ＸＹＺ技術を開発”という文字にハイパーリンクが組み込まれている。なお、“２２日１５：３０”という文字にはハイパーリンクが組み込まれていない。 However, it is not always possible to divide only by blank line tag units. FIG. 10 is a schematic diagram showing a Web page 60 in which a plurality of news titles are listed in different rows. The web page 60 includes first to fourth news titles 61 to 64. Each title is given a click operation by the user, and another web page (for example, a web page describing the details of the news shown in FIG. 8). A hyperlink (underlined part in FIG. 10) that jumps to is set. For example, in the first news title 61, a hyperlink is incorporated in the letters “ABC develops next-generation XYZ technology”. The hyperlink is not incorporated in the characters “22 days 15:30”.

図１１は、図１０のＷｅｂページ６０をＨＴＭＬ記述形式で表したＨＴＭＬ文書ページ６０Ａを示している。このＨＴＭＬ文書ページ６０Ａには、第１〜第４ニュースタイトル６１〜６４に対応する第１〜第４ＨＴＭＬ文書６１Ａ〜６４Ａを含まれている。また、第１〜第４ニュースタイトル６１〜６４の群の始まり及び終わりに空行タグ６１１が、また各タイトル６１〜６４の改行位置に相当する箇所に改行タグ６１４がそれぞれ記述されている。さらに、各第１〜第４ＨＴＭＬ文書６１Ａ〜６４Ａには、それぞれハイパーリンクタグ６１２、６１３が記述されている。 FIG. 11 shows an HTML document page 60A representing the Web page 60 of FIG. 10 in the HTML description format. The HTML document page 60A includes first to fourth HTML documents 61A to 64A corresponding to the first to fourth news titles 61 to 64. Also, blank line tags 611 are described at the beginning and end of the group of the first to fourth news titles 61 to 64, and line feed tags 614 are described at positions corresponding to the line feed positions of the titles 61 to 64, respectively. Further, hyperlink tags 612 and 613 are described in the first to fourth HTML documents 61A to 64A, respectively.

この場合、第１〜第４ニュースタイトル６１〜６４は、各々内容の異なるタイトル文書である。このような文書を空行タグ６１１の単位で分割したままにしておくと、まとまりのない複数の文書を一つのブロック文書と扱ってしまうこととなる。従って、第１〜第４ニュースタイトル６１〜６４のいずれか一つにのみ更新があると、後の文書更新判定において文書更新があったものと判定してしまう。従って、例えば第１〜第３ニュースタイトル６３のような内容の文書の発生を更新通知対象としているが、第４ニュースタイトル６４のような内容の文書の発生を更新通知対象としていないケースでも、Ｗｅｂページ６０に第４ニュースタイトル６４が追加された後のサンプリングで、「更新あり」と判定してしまう不都合が生じる。 In this case, the first to fourth news titles 61 to 64 are title documents having different contents. If such a document is left divided in units of blank line tags 611, a plurality of uncoordinated documents will be treated as one block document. Therefore, if only one of the first to fourth news titles 61 to 64 is updated, it is determined that the document has been updated in the subsequent document update determination. Therefore, for example, the occurrence of a document having contents such as the first to third news titles 63 is targeted for update notification, but the occurrence of a document having contents such as the fourth news title 64 is not targeted for update notifications. The inconvenience that it is determined as “updated” by sampling after the fourth news title 64 is added to the page 60 occurs.

以上のことから、第１〜第４ニュースタイトル６１〜６４のようなタイトル文書にあっては、空行タグ６１１の単位で分割の後、さらに改行タグ６１４の単位で再分割することが望ましい。かかる再分割により、異なる内容の複数の文書を適切に分割することができる。 From the above, in title documents such as the first to fourth news titles 61 to 64, it is desirable to divide them in units of blank line tags 611 and then subdivide them in units of line feed tags 614. By such subdivision, a plurality of documents having different contents can be appropriately divided.

ところで、第１〜第４ニュースタイトル６１〜６４のような、内容の異なるタイトル文書が羅列されるものは、Ｗｅｂサイトでは多くの場合、ハイパーリンクが与えられた文書、つまりハイパーリンク文字を多く含む文書であると言うことができる。一方、図８に示したようなニュース記事文書では、ハイパーリンク文字は殆ど含まれていない。従って、タイトル文書であるか否かは、ブロック文書全体の文字数に対するハイパーリンク文字の含有率で概ね判定することができる。 By the way, in many cases, title documents with different contents, such as the first to fourth news titles 61 to 64, include documents with hyperlinks, that is, many hyperlink characters. It can be said that it is a document. On the other hand, the news article document as shown in FIG. 8 contains almost no hyperlink characters. Therefore, whether or not the document is a title document can generally be determined by the content ratio of the hyperlink character with respect to the number of characters of the entire block document.

このハイパーリンク文字の含有率に関し、判別閾値（第１閾値）を低い値に設定しすぎると、まとまりのある文書を再分割してしまう可能性が高くなる。例えば図１２に示すブロック文書７１のように、所定のキーワードに対してのみハイパーリンクＨＰ１〜ＨＰ４が組み込まれている場合を例示する。このように、所定のキーワードについてリンクを張っておき、クリック操作が与えられることで当該キーワードについての詳細情報を表示する他のＷｅｂページにジャンプする文書形態は、Ｗｅｂサイト上に多々存在する。 Regarding the hyperlink character content rate, if the discrimination threshold (first threshold) is set too low, there is a high possibility that a coherent document will be subdivided. For example, a case where hyperlinks HP1 to HP4 are incorporated only for a predetermined keyword as in a block document 71 shown in FIG. As described above, there are many document forms on a website that link to a predetermined keyword and jump to another web page that displays detailed information about the keyword when a click operation is given.

この場合、ハイパーリンクＨＰ１〜ＨＰ４が付されている文字（下線部）は、「ＡＢＣ社」、「撮像素子」、「画像処理」、「デジタルカメラ」という文字列のみであり、ブロック文書７１全体の文字数からすればその割合が少ない。しかし、ハイパーリンク文字を含んではいるので、もし第１閾値を低すぎる値に設定してしまうと、このようなブロック文書７１をも改行タグの単位で再分割してしまうこととなる。このような不具合を防ぐために、上述の通り、第１閾値は３０％程度とすることが望ましい。 In this case, the characters (underlined portions) to which the hyperlinks HP1 to HP4 are attached are only the character strings “ABC company”, “imaging device”, “image processing”, and “digital camera”, and the entire block document 71 From the number of characters, the ratio is small. However, since hyperlink characters are included, if the first threshold value is set too low, such a block document 71 is also subdivided in units of line feed tags. In order to prevent such a problem, as described above, the first threshold value is desirably about 30%.

一方、ハイパーリンク文字の含有率が第１閾値より高い場合にあっても、一律に改行タグの単位でブロック文書を再分割すると不具合が生じ得る。図１３は、リンク先のサイトが一行内に複数列記されているＷｅｂページ７２を示す模式図である。ここでは、複数の新聞社のリンク先が列記されている例を示しており、第１行７２１に６社の新聞社のリンク先が、第２行７２２にも６社の新聞社のリンク先が列記されている。 On the other hand, even when the content rate of the hyperlink character is higher than the first threshold, a problem may occur if the block document is re-divided in units of line feed tags. FIG. 13 is a schematic diagram showing a Web page 72 in which a plurality of linked sites are listed in one line. Here, an example is shown in which the link destinations of a plurality of newspaper companies are listed. The link destinations of six newspaper companies in the first row 721 and the link destinations of six newspaper companies in the second row 722 are shown. Are listed.

このように、一つの行に異なる内容の文字列が含まれているようなリンク先文書の場合、改行タグの単位でブロック文書を再分割しても、適切な文書単位で分割したことにはならない。そこで、図１３のようなリンク先文書の場合は、ハイパーリンクタグの単位でブロック文書を再分割すればよい。ハイパーリンクタグはリンク先ごとに記述されることから、これにより例えば第１行７２１及び第２行７２２からなるブロック文書を、「ＡＡＡ新聞」、「ＢＢＢ民報」・・・の単位で細分化することができる。 In this way, in the case of a linked document that contains different character strings on one line, even if the block document is subdivided in units of line feed tags, Don't be. Therefore, in the case of a linked document as shown in FIG. 13, the block document may be subdivided in units of hyperlink tags. Since the hyperlink tag is described for each link destination, for example, a block document composed of the first row 721 and the second row 722 is subdivided into units of “AAA newspaper”, “BBB civil news”, and so on. can do.

このようなリンク先文書は、ハイパーリンク文字の含有率が非常に高いと言うことができる。従って、リンク先文書であるか否かは、第１閾値より相当高いレベルに設定した第２閾値に基づき判定することが望ましく、このため、上述したように第２閾値は８５％程度とすることが望ましい。 Such a linked document can be said to have a very high content of hyperlink characters. Therefore, it is desirable to determine whether the document is a linked document based on the second threshold set to a level considerably higher than the first threshold. For this reason, the second threshold is set to about 85% as described above. Is desirable.

図５に戻って、上述の点に鑑み、文字数カウント部１３６は、第１分割部１３５による分割処理で作成された各ブロック文書の文字数と、そのブロック文書中に含まれているハイパーリンク文字の文字数をカウントする。 Returning to FIG. 5, in view of the above points, the character count unit 136 determines the number of characters of each block document created by the division process by the first division unit 135 and the hyperlink characters included in the block document. Count the number of characters.

判定部１３７は、文字数カウント部１３６のカウント結果に基づき、全文字数に対するハイパーリンク文字の割合（ハイパーリンク文字含有率ｍ）を求め、そのパーセンテージに応じて、ブロック文書を再分割するか否かを決定する。判定部１３７は、決定動作に際し、閾値設定部１３３に定義されている第１閾値及び第２閾値を参照する。このため、判定部１３７は、一つのブロック文書の含有率ｍを求めた後、図７に示した基準で、当該ブロック文書について“判定０”、“判定１”及び“判定２”の決定を行う。 The determination unit 137 obtains the ratio of hyperlink characters to the total number of characters (hyperlink character content ratio m) based on the count result of the character count unit 136, and determines whether to re-divide the block document according to the percentage. decide. The determination unit 137 refers to the first threshold and the second threshold defined in the threshold setting unit 133 during the determination operation. Therefore, after determining the content ratio m of one block document, the determination unit 137 determines “determination 0”, “determination 1”, and “determination 2” for the block document based on the reference shown in FIG. Do.

第２分割部１３８は、判定部１３７が“判定１”を出力したとき、ブロック文書を改行タグに相当する特殊文字Ｃの単位で再分割し、判定部１３７が“判定２”を出力したとき、ブロック文書をハイパーリンクタグに相当する特殊文字Ｂの単位で再分割する処理を行う。一方、第２分割部１３８は、判定部１３７が“判定０”を出力したとき、再分割の処理を行わない。この場合、ブロック文書は、空行タグに相当する特殊文字Ａで区分された状態を維持する。 When the determination unit 137 outputs “determination 1”, the second division unit 138 subdivides the block document in units of special characters C corresponding to line feed tags, and when the determination unit 137 outputs “determination 2”. The block document is subdivided in units of special characters B corresponding to hyperlink tags. On the other hand, the second dividing unit 138 does not perform re-division processing when the determining unit 137 outputs “determination 0”. In this case, the block document maintains a state of being divided by the special character A corresponding to the blank line tag.

以上説明した本実施形態に係るＷｅｂ文書分割処理のフローを、図１４、図１５に示すフローチャートに基づいて説明する。先ず、タグ変換部１３４により、ＲＡＭ１３１から、Ｗｅｂサイト検索部１２による検索によりヒットしたＷｅｂサイト文書（ＨＴＭＬ文書）の一つが読み出される（ステップＳ１１）。 The flow of the Web document dividing process according to the present embodiment described above will be described based on the flowcharts shown in FIGS. First, one of the Web site documents (HTML document) hit by the search by the Web site search unit 12 is read from the RAM 131 by the tag conversion unit 134 (step S11).

次いで、タグ変換部１３４により、“＜ｂｒ＞＜ｂｒ＞タグ”を“特殊文字Ａ”に変換する処理が行なわれる（ステップＳ１２）。引き続き、タグ変換部１３４により、“＜ａｈｒｅｆ＞タグ”を“特殊文字Ｂ”に変換する処理（ステップＳ１３）と、“＜ｂｒ＞タグ”を“特殊文字Ｃ”に変換する処理（ステップＳ１４）とが実行される。そして、ＨＴＭＬ文書に含まれている他のタグ（コメントタグや改行コード等）、タブ、半角・全角スペース等を削除する処理（ステップＳ１５）が行われた後、その文書データが第１分割部１３５に出力される。 Next, the tag conversion unit 134 performs processing for converting “ tag” to “special character A” (step S12). Subsequently, the tag conversion unit 134 converts “<a href> tag” to “special character B” (step S13) and “ tag” to “special character C” (step S14). ) And are executed. Then, after processing (step S15) for deleting other tags (comment tags, line feed codes, etc.), tabs, half-width / full-width spaces, etc. included in the HTML document is performed, the document data is converted into the first division unit. It is output to 135.

これを受けて第１分割部１３５により、前記文書データが、ステップＳ１２で変換された“特殊文字Ａ”の単位で分割され、１又は複数のブロック文書が作成される（ステップＳ１６）。さらに、第１分割部１３５により、得られたブロック文書に対して１〜ｎのナンバリングが施され（ステップＳ１７）、その番号に関連付けてデータがＲＡＭ１３１へ一時的に格納される。 In response to this, the first dividing unit 135 divides the document data in units of the “special character A” converted in step S12 to create one or a plurality of block documents (step S16). Further, the first dividing unit 135 performs numbering 1 to n on the obtained block document (step S17), and data is temporarily stored in the RAM 131 in association with the number.

コンピュータ内のカウンタがＫ＝１と設定され（ステップＳ１８）、１番目のブロック文書ＫがＲＡＭ１３１から読み出され、図１５に示す判定処理が行われる。この判定処理にあたり、先ず文字数カウント部１３６により、１番目のブロック文書Ｋの文字数と、該ブロック文書Ｋに含まれているハイパーリンク文字の文字数とがカウントされる。そして、判定部１３７によりハイパーリンク文字の含有率ｍが求められる（ステップＳ１９）。 The counter in the computer is set to K = 1 (step S18), the first block document K is read from the RAM 131, and the determination process shown in FIG. 15 is performed. In this determination processing, first, the number-of-characters counting unit 136 counts the number of characters of the first block document K and the number of hyperlink characters included in the block document K. And the content rate m of a hyperlink character is calculated | required by the determination part 137 (step S19).

次いで、求められた含有率ｍに基づき、再分割の要否が決定される（ステップＳ２０）。含有率ｍが“ｍ＞８５％”の条件を満たすとき（ステップＳ２０でＹＥＳ）、判定部１３７は第２分割部１３８に“判定２”を出力する。これを受けて第２分割部１３８は、“特殊文字Ｂ”の単位でブロック文書Ｋを再分割する（ステップＳ２１）。その後、“特殊文字Ａ”、特殊文字Ｂ”及び“特殊文字Ｃ”をブロック文書Ｋから削除する処理が行われた上で（ステップＳ２４）、次段の更新判定処理部１４へブロック文書Ｋのデータが送られる。 Next, the necessity of re-division is determined based on the obtained content rate m (step S20). When the content ratio m satisfies the condition of “m> 85%” (YES in step S20), the determination unit 137 outputs “determination 2” to the second division unit 138. In response to this, the second dividing unit 138 subdivides the block document K in units of “special character B” (step S21). After that, the process of deleting “special character A”, special character B ”, and“ special character C ”from the block document K is performed (step S24), and then the update determination processing unit 14 in the next stage stores the block document K. Data is sent.

これに対し、含有率ｍが“ｍ＞８５％”の条件を満たさないとき（ステップＳ２０でＮＯ）、続いて判定部１３７により“ｍ≧３０％”の条件を満たすか否かが判定される（ステップＳ２２）。“ｍ≧３０％”の条件を満たすとき（ステップＳ２２でＹＥＳ）、判定部１３７は第２分割部１３８に“判定１”を出力する。これを受けて第２分割部１３８は、“特殊文字Ｃ”の単位でブロック文書Ｋを再分割する（ステップＳ２３）。その後、上述のステップＳ２４が実行される。一方、“ｍ≧３０％”の条件を満たさないとき（ステップＳ２２でＮＯ）、判定部１３７は第２分割部１３８に“判定０”を出力する。これを受けて第２分割部１３８は、再分割処理を行わずブロック文書Ｋをスルーする。すなわち、ステップＳ２４にスキップする。 On the other hand, when the content rate m does not satisfy the condition “m> 85%” (NO in step S20), the determination unit 137 subsequently determines whether the condition “m ≧ 30%” is satisfied. (Step S22). When the condition of “m ≧ 30%” is satisfied (YES in step S22), the determination unit 137 outputs “determination 1” to the second division unit 138. In response to this, the second dividing unit 138 subdivides the block document K in units of “special character C” (step S23). Thereafter, step S24 described above is executed. On the other hand, when the condition “m ≧ 30%” is not satisfied (NO in step S22), the determination unit 137 outputs “determination 0” to the second division unit 138. In response to this, the second dividing unit 138 passes through the block document K without performing re-division processing. That is, the process skips to step S24.

しかる後、カウンタがＫ＝ｎであるか否かが判定される（ステップＳ２５）。Ｋ＝ｎでない場合（ステップＳ２５でＮＯ）、読み出された１つのＨＴＭＬ文書につき他のブロック文書が残存していることになるので、カウンタがＫ＝Ｋ＋１と１つインクリメントされ（ステップＳ２６）、ステップＳ１９に戻って、次のブロック文書Ｋ（２番目のブロック文書）について同じ処理が繰り返される。 Thereafter, it is determined whether or not the counter is K = n (step S25). If K = n is not satisfied (NO in step S25), since another block document remains for one read HTML document, the counter is incremented by 1 as K = K + 1 (step S26). Returning to step S19, the same processing is repeated for the next block document K (second block document).

一方、Ｋ＝ｎである場合（ステップＳ２５でＹＥＳ）、他のＷｅｂサイト文書について分割処理を実行するか否かが判定される（ステップＳ２７）。他のＷｅｂサイト文書が存在している場合は（ステップＳ２７でＹＥＳ）、図１４のステップＳ１１に戻って、そのＷｅｂサイト文書について上記と同じ処理が繰り返される。これに対し、他のＷｅｂサイト文書が存在していない場合は（ステップＳ２７でＮＯ）、処理を終える。 On the other hand, if K = n (YES in step S25), it is determined whether or not the division process is to be executed for another Web site document (step S27). If another Web site document exists (YES in step S27), the process returns to step S11 in FIG. 14 and the same processing as described above is repeated for the Web site document. On the other hand, if there is no other Web site document (NO in step S27), the process ends.

以上説明した本実施形態に係るＷｅｂ文書分割方法によれば、主に文字だけで構成された文書、これにハイパーリンクの部分を多く含む文書乃至は殆どがハイパーリンクの部分で構成された文書を適切に分割することができる。従って、Ｗｅｂサイト上の文書を適切なブロック文書の単位で分割し、このブロック文書同士で比較して行う文書更新判定を、的確に実行させることができる。 According to the Web document dividing method according to the present embodiment described above, a document mainly composed only of characters, a document including many hyperlink parts or a document composed mostly of hyperlink parts. Can be divided appropriately. Accordingly, it is possible to accurately execute the document update determination performed by dividing the document on the Web site in units of appropriate block documents and comparing the block documents with each other.

以上、本発明の実施形態につき説明したが、本発明はこれに限定されるものではなく、例えば、下記に示すような変形実施形態を取ることができる。 As mentioned above, although embodiment was described about this invention, this invention is not limited to this, For example, the deformation | transformation embodiment as shown below can be taken.

［１］上記実施形態では、Ｗｅｂサイトを記述するマークアップ言語がＨＴＭＬ形式である場合を例に挙げて説明した。本発明は他のマークアップ言語にも勿論適用可能であり、例えばＸＭＬ形式にも適用できる。 [1] In the above embodiment, the case where the markup language for describing the Web site is in the HTML format has been described as an example. The present invention can of course be applied to other markup languages, for example, an XML format.

［２］上記実施形態では、本発明に係るＷｅｂ文書分割方法を文書更新判定システムＳに組み込んだ例について説明した。これ以外に、本発明は各種の文書検索、文書データ解析、統計処理等の用途に適用することができる。 [2] In the above embodiment, the example in which the Web document dividing method according to the present invention is incorporated in the document update determination system S has been described. In addition to this, the present invention can be applied to various document retrieval, document data analysis, statistical processing, and the like.

［３］上述の文書更新判定システムＳが行うＷｅｂ文書分割方法を、プログラムとして提供することもできる。このようなプログラムは、コンピュータに付属するフレキシブルディスク、ＣＤ−ＲＯＭ、ＲＯＭ、ＲＡＭおよびメモリカードなどのコンピュータ読取り可能な記録媒体にて記録させて、プログラム製品として提供することもできる。若しくは、コンピュータに内蔵するハードディスクなどの記録媒体にて記録させて、プログラムを提供することもできる。また、ネットワークを介したダウンロードによって、プログラムを提供することもできる。 [3] The Web document dividing method performed by the document update determination system S can be provided as a program. Such a program can be recorded on a computer-readable recording medium such as a flexible disk, a CD-ROM, a ROM, a RAM, and a memory card attached to the computer and provided as a program product. Alternatively, the program can be provided by being recorded on a recording medium such as a hard disk built in the computer. A program can also be provided by downloading via a network.

本発明に係るＷｅｂ文書分割方法が適用された文書更新判定システムＳのハード構成を示す構成図である。It is a block diagram which shows the hardware constitutions of the document update determination system S to which the Web document division | segmentation method based on this invention was applied. 本実施形態に係る文書更新判定システムＳの全体的な動作を概略的に示すフローチャートである。It is a flowchart which shows roughly the whole operation | movement of the document update determination system S which concerns on this embodiment. 本実施形態における文書更新判定の概要を説明するための説明図である。It is explanatory drawing for demonstrating the outline | summary of the document update determination in this embodiment. サーバ装置の機能構成を示す機能ブロック図である。It is a functional block diagram which shows the function structure of a server apparatus. 文書分割処理部の機能構成を詳細に示すブロック図である。It is a block diagram which shows the function structure of a document division process part in detail. タグ情報の定義の一例を示す表形式の図である。It is a figure of a table format which shows an example of a definition of tag information. 含有率ｍの閾値定義の一例を示す表形式の図である。It is a figure of the table format which shows an example of the threshold value definition of the content rate m. 文書分割動作の一例を示す模式図である。It is a schematic diagram which shows an example of a document division | segmentation operation | movement. 図８の第１文書５１をＨＴＭＬ記述形式で表したＨＴＭＬ文書５１Ａを示す模式図である。It is a schematic diagram which shows the HTML document 51A which represented the 1st document 51 of FIG. 8 in the HTML description format. 複数のニュースタイトルが行を変えて羅列されているＷｅｂページを示す模式図である。It is a mimetic diagram showing a Web page where a plurality of news titles are listed in different rows. 図１０のＷｅｂページをＨＴＭＬ記述形式で表したＨＴＭＬ文書ページを示す模式図である。It is a schematic diagram which shows the HTML document page which represented the Web page of FIG. 10 in the HTML description format. ブロック文書の一例を示す模式図である。It is a schematic diagram which shows an example of a block document. リンク先のサイトが一行内に複数列記されているＷｅｂページを示す模式図である。It is a mimetic diagram showing a Web page where a plurality of linked sites are listed in one line. 本発明の実施形態に係るＷｅｂ文書分割処理のフローを示すフローチャートである。It is a flowchart which shows the flow of the Web document division | segmentation process which concerns on embodiment of this invention. 本発明の実施形態に係るＷｅｂ文書分割処理のフローを示すフローチャートである。It is a flowchart which shows the flow of the Web document division | segmentation process which concerns on embodiment of this invention.

Explanation of symbols

１サーバ装置（Ｗｅｂ文書分割システム）
１１送受信部
１２Ｗｅｂサイト検索部
１３文書分割処理部
１３１ＲＡＭ
１３２分割条件設定部
１３３閾値設定部
１３４タグ変換部（文字変換手段）
１３５第１分割部（第１文書分割手段）
１３６文字数カウント部（カウント手段）
１３７判定部（判定手段）
１３８第２分割部（第２文書分割手段）
１４更新判定処理部
２端末装置
３Ｗｅｂサイト（ＷＷＷサイト） 1 Server device (Web document splitting system)
11 Transmission / Reception Unit 12 Web Site Search Unit 13 Document Division Processing Unit 131 RAM
132 Division condition setting unit 133 Threshold setting unit 134 Tag conversion unit (character conversion means)
135 1st division part (1st document division means)
136 Character count section (counting means)
137 determination part (determination means)
138 Second division unit (second document division unit)
14 Update determination processing unit 2 Terminal device 3 Web site (WWW site)

Claims

A block document is created by dividing a document on a website in units of first tags indicating blank lines,
Determining the ratio of the number of link characters provided with the second tag for hyperlinks to other websites to the total number of characters included in the block document;
When the ratio of the number of link characters is higher than a predetermined first threshold, the block document is subdivided in units of a third tag indicating a line break included in the block document. Split method.

A block document is created by dividing a document on a website in units of first tags indicating blank lines,
Determining the ratio of the number of link characters provided with the second tag for hyperlinks to other websites to the total number of characters included in the block document;
When the ratio of the number of link characters is higher than a predetermined first threshold, the block document is re-divided in units of a third tag indicating a line feed included in the block document,
When the ratio of the number of link characters is higher than a predetermined second threshold value that is higher than the first threshold value, the block document is subdivided in units of the second tag included in the block document. A Web document dividing method characterized by the above.

3. The Web document dividing method according to claim 1, wherein a markup language for describing the Web site is in an HTML format.

First document dividing means for dividing a document on a Web site by a unit of a first tag indicating a blank line and creating a block document;
Counting means for counting the number of all characters included in the block document and the number of link characters provided with second tags for hyperlinks to other Web sites;
Determining means for determining the ratio of the number of linked characters to the total number of characters and determining whether to re-divide the block document based on the ratio;
A second document dividing unit that subdivides the block document when the determination unit determines to subdivide,
The determination means determines that the block document is subdivided when the ratio of the number of link characters is higher than a predetermined first threshold,
The Web document dividing system, wherein the second document dividing unit subdivides the block document in units of a third tag indicating a line feed included in the block document.

First document dividing means for dividing a document on a Web site by a unit of a first tag indicating a blank line and creating a block document;
Counting means for counting the number of all characters included in the block document and the number of link characters provided with second tags for hyperlinks to other Web sites;
Determining means for determining the ratio of the number of linked characters to the total number of characters and determining whether to re-divide the block document based on the ratio;
A second document dividing unit that subdivides the block document when the determination unit determines to subdivide,
The determination means has a first content rate in which the ratio of the number of link characters is higher than a predetermined first threshold value, and a second content rate that is higher than a predetermined second threshold value that is higher than the first threshold value. In some cases, the block document is decided to be subdivided,
The second document dividing means subdivides the block document by a unit of a third tag indicating a line feed included in the block document in the case of the first content rate, and the block in the case of the second content rate. A Web document dividing system, wherein the block document is subdivided in units of the second tag included in the document.

The Web document dividing system according to claim 4, wherein the first threshold is 30%.

6. The Web document dividing system according to claim 5, wherein the first threshold value is 30% and the second threshold value is 85%.

8. The Web document dividing system according to claim 4, further comprising character conversion means for converting the second tag and the third tag into special characters.

A computer that can analyze character information and tag information on a website.
Dividing a document on a website in units of first tags indicating blank lines to create a block document;
Obtaining a ratio of the number of link characters provided with a second tag for hyperlink to another website to the total number of characters included in the block document;
Determining whether the ratio of the number of link characters is higher than a predetermined first threshold;
Re-dividing the block document in units of a third tag indicating a line break included in the block document if higher than the first threshold;
Web document segmentation program characterized by causing

A computer that can analyze character information and tag information on a website.
Dividing a document on a website in units of first tags indicating blank lines to create a block document;
Obtaining a ratio of the number of link characters provided with a second tag for hyperlink to another website to the total number of characters included in the block document;
It is determined whether or not the ratio of the number of link characters is a first content rate that is higher than a predetermined first threshold value, and whether or not it is a second content rate that is higher than a predetermined second threshold value that is higher than the first threshold value. And steps to
In the case of the first content rate, the block document is subdivided in units of a third tag indicating a line feed included in the block document, and in the case of the second content rate, the block document is included in the block document. Subdividing the block document in units of the second tag
Web document segmentation program characterized by causing