JP5317638B2

JP5317638B2 - Web document main content extraction apparatus and program

Info

Publication number: JP5317638B2
Application number: JP2008291379A
Authority: JP
Inventors: 光正近藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-11-13
Filing date: 2008-11-13
Publication date: 2013-10-16
Anticipated expiration: 2028-11-13
Also published as: JP2010117941A

Description

本発明は、Ｗｅｂ文書主要コンテンツ抽出装置及びプログラムに係り、特に、Ｗｅｂ文書の主要コンテンツと判断できる部分を自動的に抽出するためのＷｅｂ文書主要コンテンツ抽出装置及びプログラムに関する。 The present invention relates to a Web document main content extraction apparatus and program, and more particularly, to a Web document main content extraction apparatus and program for automatically extracting a portion of a Web document that can be determined as main content.

従来の主要コンテンツ抽出手法は、各Ｗｅｂ文書から人手によって抽出規則を作成し、その規則に基づいて主要コンテンツを抽出していた（例えば、非特許文献１参照）。
http://fm.goo.ne.jp/ In the conventional main content extraction method, an extraction rule is manually created from each Web document, and the main content is extracted based on the rule (see, for example, Non-Patent Document 1).
http://fm.goo.ne.jp/

しかしながら、Ｗｅｂ文書の情報検索や、ユーザが閲覧したＷｅｂ文書から情報推薦等を行う際に、該当Ｗｅｂ文書の主要コンテンツ部分の抽出を行わない場合、ナビゲーションリンクや広告部分等の本来主要コンテンツとは関係のない部分がノイズとなる問題があった。この問題に対して、従来は各Ｗｅｂ文書に対して人手で抽出規則を作成し、主要コンテンツの抽出を行っていたが、全てのＷｅｂ文書に対して人手で抽出規則を作成することは困難である。また、Ｗｅｂ文書の構成は月日が経つ毎に更新されるため、作成した規則を永続的に用いることは困難である。 However, when searching for information on a Web document or recommending information from a Web document browsed by a user, if the main content part of the Web document is not extracted, what is the original main content such as a navigation link or an advertisement part? There was a problem that unrelated parts became noise. For this problem, conventionally, extraction rules are manually created for each Web document and main contents are extracted. However, it is difficult to manually create extraction rules for all Web documents. is there. Further, since the configuration of the Web document is updated every time the date passes, it is difficult to use the created rule permanently.

本発明は、上記の点に鑑みなされたもので、人手を用いて抽出規則を作成することなく、主要コンテンツの自動抽出が可能となるＷｅｂ文書主要コンテンツ抽出装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and an object thereof is to provide a Web document main content extraction apparatus and program capable of automatically extracting main content without creating an extraction rule manually. To do.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、Ｗｅｂ文書の主要コンテンツを抽出するＷｅｂ文書主要コンテンツ抽出装置であって、
Ｗｅｂ文書が入力されると、該Ｗｅｂ文書を所定の分割規則に基づいてセグメントに分割し、記憶手段１６０に格納する文書分割手段１２０と、
文書分割手段１２０で分割されたセグメント毎に、主要コンテンツ判定のための特徴量を抽出し、セグメント毎に記憶手段１６０に格納する特徴量抽出手段１３０と、
セグメント毎の特徴量に基づいて、機械学習アルゴリズムを用いて主要コンテンツか否かの判定を行う主要コンテンツ判定手段１４０と、
主要コンテンツ判定手段１４０で主要コンテンツと判定された部位を結合して、主要コンテンツとして出力する主要コンテンツ出力手段１５０と、
を有し、
特徴量抽出手段１３０は、
分割されたセグメントで、文字列に含まれる句読点の数、文字列の量、Ｗｅｂで表示される文字列を含まない文字列の量、アンカーリンクの数、アンカーリンクの文字列の平均量、アンカーリンクの文字列の合計量、最大長のアンカーリンク先のＵＲＬの文字列量、および、広告に関するアンカーリンクを含むか否かを識別する数値を特徴量として抽出する。
また、本発明（請求項２）は、特徴量抽出手段１３０において、
分割されたセグメントで、テキストに関するＨＴＭＬタグの数、テキストに関するＨＴＭＬタグの連続出現数、リンクリストタグの数、テキストに関するＨＴＭＬタグの数とテキストに関するＨＴＭＬタグの連続出現数の比率、Ｗｅｂで表示される文字列の量とタグ数の比率、アンカーリンクの数とリンクリストタグの数の比率のうち、１つ以上を特徴量としてさらに抽出する。
また、本発明（請求項３）は、特徴量抽出手段１３０において、
抽出手段を行う前に、HTMLタグで用いられる記号をＷｅｂブラウザ上で表示する際に用いる特殊文字、及び、該特殊文字以外のHTML特殊文字を削除する特殊文字削除手段を含む。 The present invention (Claim 1) is a Web document main content extracting apparatus for extracting main contents of a Web document,
When a Web document is input, the Web document is divided into segments based on a predetermined division rule, and is stored in the storage unit 160.
A feature amount extraction unit 130 that extracts a feature amount for main content determination for each segment divided by the document division unit 120 and stores the feature amount in the storage unit 160 for each segment;
Main content determination means 140 for determining whether the content is main content using a machine learning algorithm based on the feature amount for each segment;
A main content output unit 150 that combines the parts determined to be the main content by the main content determination unit 140 and outputs the result as main content;
I have a,
The feature quantity extraction means 130
Number of punctuation marks included in character strings, amount of character strings, amount of character strings not including character strings displayed on the Web, number of anchor links, average amount of anchor link character strings, anchors The total amount of link character strings, the maximum character string amount of anchor link destination URL, and a numerical value for identifying whether or not an anchor link related to an advertisement is included are extracted as feature amounts .
Further, the present invention (Claim 2) is characterized in that in the feature quantity extraction means 130,
The number of HTML tags related to text, the number of consecutive HTML tags related to text, the number of linked list tags, the ratio of the number of HTML tags related to text and the number of consecutive HTML tags related to text, and displayed on the web. One or more of the ratio of the number of character strings and the number of tags and the ratio of the number of anchor links and the number of link list tags are further extracted as feature quantities.
Further, according to the present invention (Claim 3), in the feature amount extraction unit 130,
Before performing extraction means, special characters used for displaying symbols used in HTML tags on a Web browser, and special character deletion means for deleting HTML special characters other than the special characters are included.

また、本発明（請求項４）は、入力されたＷｅｂ文書に広告対象領域が存在する場合には、該広告対象領域を抽出する広告対象領域抽出手段と、
Ｗｅｂ文書からノイズとなるタグや領域を除去するノイズ除去手段と、
ノイズ除去手段から出力されたＷｅｂ文書を、所定の分割規則を用いて分割し記憶手段に格納する分割手段と、を含む。 In addition, the present invention (Claim 4 ), when an advertisement target area exists in the input Web document, the advertisement target area extracting means for extracting the advertisement target area;
Noise removing means for removing tags and areas that cause noise from a Web document;
A dividing unit that divides the Web document output from the noise removing unit using a predetermined dividing rule and stores it in the storage unit.

また、本発明（請求項５）は、抽出された特徴量を正規化する正規化手段、を含む。 Further, the present invention (Claim 5) comprises normalization means for normalizing the feature value issued extracted, the.

本発明（請求項６）は、請求項１乃至５のいずれか１項に記載のＷｅｂ文書主要コンテンツ抽出装置を構成する各手段としてコンピュータを機能させるためのＷｅｂ文書主要コンテンツ抽出プログラムである。 The present invention (Claim 6 ) is a Web document main content extraction program for causing a computer to function as each means constituting the Web document main content extraction apparatus according to any one of claims 1 to 5 .

上述のように、本発明によれば、人手を用いて抽出する規則を作成することなく、主要コンテンツの自動抽出が可能となる。また、完全自動の主要コンテンツ抽出を実現するため、Ｗｅｂ文書の内容が変更されたとしても、対応が可能である。 As described above, according to the present invention, it is possible to automatically extract main contents without creating a rule for manual extraction. In addition, since the main content extraction is fully automatic, it is possible to cope with changes in the content of the Web document.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

本発明は、最初にＷｅｂ文書の分割を行い、次に分割したＷｅｂ文書内に含まれる情報から特徴量を抽出し、主要コンテンツであるか否かを判別することで、Ｗｅｂ文書の主要コンテンツを抽出する。主な特徴量としては、テキスト情報、アンカーリンク情報、ＨＴＭＬ及びＸＨＴＭＬ等で用いられるクラス情報とタグ情報を用いる。また、精度向上のための処理として、セクション広告部分の抽出と、広告の除去等の処理を行う。主要コンテンツの例を図２、図３に示す。図２、図３において破線内が主要コンテンツである。 The present invention first divides a Web document, then extracts feature amounts from information contained in the divided Web document, and determines whether or not the main content is the main content. Extract. As main feature amounts, class information and tag information used in text information, anchor link information, HTML, XHTML, and the like are used. In addition, as processing for improving accuracy, processing such as extraction of a section advertisement portion and removal of an advertisement is performed. Examples of main contents are shown in FIGS. In FIG. 2 and FIG. 3, the inside of the broken line is the main content.

図４は、本発明の一実施の形態におけるＷｅｂ文書主要コンテンツ抽出装置の構成を示す。 FIG. 4 shows the configuration of the Web document main content extraction apparatus according to the embodiment of the present invention.

同図に示すＷｅｂ文書主要コンテンツ抽出装置１００は、Ｗｅｂ文書取得・入力部１１０、Ｗｅｂ文書分割部１２０、特徴量抽出部１３０、主要コンテンツ判定部１４０、主要コンテンツ出力部１５０、記憶部１６０から構成される。 A Web document main content extraction apparatus 100 shown in FIG. 1 includes a Web document acquisition / input unit 110, a Web document division unit 120, a feature amount extraction unit 130, a main content determination unit 140, a main content output unit 150, and a storage unit 160. Is done.

＜Ｗｅｂ文書取得・入力部＞
Ｗｅｂ文書取得・入力部１１０は、処理するＷｅｂ文書（データ）の入力を行う。当該Ｗｅｂ文書取得・入力部１１０の構成を図５に示す。同図に示すＷｅｂ文書取得・入力部１１０は、ユーザから入力された主要コンテンツを抽出したいＷｅｂ文書のＵＲＬ、もしくはファイルそのものを取得するデータ入力部１１１と、入力がＵＲＬである場合は当該ＵＲＬを取得するＵＲＬ入力部１１３と、そのＵＲＬ先のＷｅｂ文書を取得するＷｅｂ文書取得部１１４と、Ｗｅｂ文書そのものである場合は当該Ｗｅｂ文書を取得するＷｅｂ文書ファイル入力部１１２と、Ｗｅｂ文書の文字コードをＵＴＦ−８に変換し統一する文書コード変換部１１５から構成される。 <Web document acquisition / input unit>
The Web document acquisition / input unit 110 inputs a Web document (data) to be processed. The configuration of the Web document acquisition / input unit 110 is shown in FIG. The Web document acquisition / input unit 110 shown in the figure is the URL of the Web document from which the main content input by the user is to be extracted or the data input unit 111 for acquiring the file itself, and if the input is a URL, the URL is input. URL input unit 113 to acquire, Web document acquisition unit 114 to acquire the Web document of the URL destination, Web document file input unit 112 to acquire the Web document if it is the Web document itself, and the character code of the Web document Is composed of a document code conversion unit 115 for converting the data into UTF-8.

＜Ｗｅｂ文書分割部１２０＞
Ｗｅｂ文書分割部１２０は、取得した文書を分割する。図６にＷｅｂ文書分割部１２０の構成を示す。Ｗｅｂ文書分割部１２０は、広告対象領域抽出部１２１、ノイズとなるタグや領域除去部１２２、Ｗｅｂ文書分割処理部１２３から構成される。 <Web document dividing unit 120>
The Web document dividing unit 120 divides the acquired document. FIG. 6 shows the configuration of the Web document dividing unit 120. The Web document dividing unit 120 includes an advertisement target area extracting unit 121, a tag / area removing unit 122 that causes noise, and a Web document dividing processing unit 123.

Ｗｅｂ文書分割部１２０では、最初に、広告対象領域抽出部１２１において、インターネット広告当のコンテンツタグを含む領域がある場合、その領域を抽出する。ここで、インターネット広告とは、googleやoverture等の広告会社が広告配信のための主要コンテンツ絞込みに用いるタグである。googleの広告の場合、＜!-google_ad_section_start--＞から、＜!--google_ad_section_end--＞までがその領域に該当する。これらのタグはＷｅｂ文書によって文字列が少々異なったり大文字で表記されるので、大文字と小文字を区別しない正規表現を用いたり、ワイルドカードの正規表現を用いる等を行うことで、対象の文字列表記の違いを吸収する処理を行う。以下、正規表現を用いる処理の説明の際には、対象の違いを吸収する処理を行っているものとする。 First, in the Web document dividing unit 120, if there is a region including a content tag for Internet advertisement in the advertisement target region extracting unit 121, the region is extracted. Here, the Internet advertisement is a tag used by an advertising company such as google or overture to narrow down main contents for advertisement distribution. In the case of a google advertisement, the range from <!-google_ad_section_start-> to <!-google_ad_section_end-> corresponds to that area. These tags have a slightly different character string depending on the Web document or are displayed in uppercase letters. Use a regular expression that does not distinguish between uppercase and lowercase letters, or use a wildcard regular expression. Process to absorb the difference. Hereinafter, it is assumed that processing for absorbing differences in objects is performed in the description of processing using regular expressions.

ノイズとなるタグや領域除去部１２２（以下、「ノイズ除去部」と記す）は、インターネット広告が存在する場合、上記で述べた領域を抽出する処理を行い、インターネット広告の領域がない場合は、最初に入力されたＷｅｂ文書に対して処理を行う。ノイズ除去部１２２は、次に、余計なタグや領域、特定の文字列を除去する処理を行う。除去されるタグや領域は、Ｗｅｂ文書のHTMLを説明するコメントタグであったり、JavaScriptであったり、formタグであったりする。除去するタグと領域を以下に記載する。 When there is an internet advertisement, the noise tag or area removal unit 122 (hereinafter referred to as “noise removal unit”) performs the process of extracting the area described above, and when there is no internet advertisement area, The first input Web document is processed. Next, the noise removing unit 122 performs a process of removing extra tags and regions and specific character strings. The tag or area to be removed may be a comment tag explaining the HTML of the Web document, JavaScript, or a form tag. The tags and areas to be removed are described below.

・"＜!--"で始まり、"--＞"で終わるコメントタグ；
・"＜script＞"タグから"＜/script＞"タグで囲まれる領域；
・"＜style＞"タグから"＜/style＞"タグで囲まれる領域；
・"＜select＞"タグから"＜/select＞"タグで囲まれる領域；
・"＜noscript＞"タグから"＜/noscript＞"タグで囲まれる領域；
・"＜form＞"タグから"＜/form＞"タグで囲まれる領域；
・連続した空白文字列（単一の空白は除く）
・連続したタブ文字列（単一のタブは除く）
ノイズ除去部１２２は、以上のタグ、領域、文字列を正規表現を用いて除去する。タグ内にalt属性やclass属性が存在する場合も考えられるため、その場合はそれらを含めたタグを考慮した正規表現を用いて分割を行う（例：＜style class="hoge"＞）。・ Comment tags that begin with "<!-" And end with "->";
-The area enclosed by the "</ script>" tag from the "<script>"tag;
・ A region surrounded by "</ style>" tags from "<style>"tags;
・ A region surrounded by "</ select>" tags from "<select>"tags;
・ "<Noscript>" area surrounded by "</ noscript>"tag;
・ A region enclosed by "</ form>" tag from "<form>"tag;
・ Consecutive blank character string (excluding single blank)
・ Consecutive tab strings (excluding single tabs)
The noise removing unit 122 removes the above tags, regions, and character strings using regular expressions. Since there may be cases where the alt attribute and class attribute exist in the tag, in such a case, division is performed using a regular expression that takes into account the tag including them (eg, <style class = "hoge">).

Ｗｅｂ文書分割処理部１２３は、Ｗｅｂ文書の分割を行う。分割の規則は、以下のタグを用いて分割を行う。 The web document division processing unit 123 divides the web document. The division rule is to divide using the following tags.

・＜div＞
・＜/div＞
・＜td＞
・＜/td＞
タグ内にalt属性やclass属性が存在する場合も考えられるため、その場合は、それらを含めたタグを考慮した正規表現を用いて分割を行う（例：＜div class="hoge"＞）。以降分割されたＷｅｂ文書の一つ一つを「セグメント」と呼び、特徴量抽出と主要コンテンツか否かの判定をセグメント毎に行うものとする。・ <Div>
・ / Div
・ <Td>
・ </ Td>
Since there may be a case where an alt attribute or a class attribute is present in the tag, in such a case, division is performed using a regular expression that takes into account the tag including those (eg, <div class = "hoge">). Hereinafter, each of the divided Web documents is referred to as a “segment”, and feature amount extraction and determination as to whether or not the content is main content is performed for each segment.

各セグメントは記憶部１６０に格納する。 Each segment is stored in the storage unit 160.

＜特徴量抽出部１３０＞
特徴量抽出部１３０は、記憶部１６０に格納されたセグメントから特徴量を抽出し、Ｗｅｂ文書の主要コンテンツ部分の判定を行う。特徴量抽出部１３０の構成を図７に示す。 <Feature Extraction Unit 130>
The feature amount extraction unit 130 extracts feature amounts from the segments stored in the storage unit 160, and determines the main content portion of the Web document. The configuration of the feature quantity extraction unit 130 is shown in FIG.

同図に示す特徴量抽出部１３０は、アンカーリンク情報特徴量抽出部１３１、タグ情報特徴量抽出部１３２、特徴量正規化部１３４、Ｗｅｂ文書で表示される文字列特徴量抽出部（以下、「文字列特徴量抽出部」と記す）１３３、特徴量正規化部１３４、特徴量の比率特徴量抽出部（以下、「特徴量抽出処理部」と記す）１３５から構成される。 The feature quantity extraction unit 130 shown in the figure includes an anchor link information feature quantity extraction unit 131, a tag information feature quantity extraction unit 132, a feature quantity normalization unit 134, and a character string feature quantity extraction unit (hereinafter referred to as a Web document). 133, a feature amount normalization unit 134, and a feature amount ratio feature amount extraction unit (hereinafter referred to as a “feature amount extraction processing unit”) 135.

＜アンカーリンク情報特徴量抽出部１３１＞
アンカーリンク情報特徴量抽出部１３１は、Ｗｅｂ文書分割部１２０から出力され、記憶部１６０に格納されているセグメントからアンカーリンクに関する特徴量を抽出する。 <Anchor Link Information Feature Extraction Unit 131>
The anchor link information feature amount extraction unit 131 extracts a feature amount related to the anchor link from the segment output from the Web document dividing unit 120 and stored in the storage unit 160.

（１）アンカーリンク数：
あるセグメントにおいて、アンカーリンクが多数含まれているセグメントは主要コンテンツでない可能性が高い。そこで、アンカーリンク情報特徴量抽出部１３１では、アンカーリンクの数を特徴量として用いる。具体的には、＜a href=…＞…＜/a＞タグで表されるアンカーリンクの数を特徴量とする。 (1) Number of anchor links:
In a certain segment, there is a high possibility that a segment including many anchor links is not the main content. Therefore, the anchor link information feature amount extraction unit 131 uses the number of anchor links as the feature amount. Specifically, the number of anchor links represented by <a href=...> ... </a> tags is used as the feature amount.

この特徴量は以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とする。例えば、全てのセグメントにおいて最大のアンカーリンクの数が"１０"で、あるセグメント内のアンカーリンクの数が"５"だった場合には、そのセグメントにアンカーリンクの数の特徴量は０．５となる。アンカーリンクタグには、class属性やalt属性が含まれる場合もあるので、アンカーリンクタグの数は正規表現を用いてカウントする。 This feature amount is executed by a feature amount normalization unit 134, which will be described below, by performing a method of normalizing a character string amount to obtain a feature amount and a feature amount using an absolute value of the character string. Characteristic amount of character string. For example, when the maximum number of anchor links in all segments is “10” and the number of anchor links in a segment is “5”, the feature quantity of the number of anchor links in the segment is 0.5. It becomes. Since the anchor link tag may include a class attribute and an alt attribute, the number of anchor link tags is counted using a regular expression.

（２）各アンカーリンクの文字列の平均量
各アンカーリンクの文字列が平均して多い場合、そのセグメントは、関連記事等のナビゲーションリンクである可能性が高い。また、アンカーリンクの文字列が平均して少ない場合、主要コンテンツ内に含まれるキーワード検索リンクである可能性が高い。そこで、アンカーリンク情報特徴量抽出部１３１は、セグメントに含まれるアンカーリンクの文字列の平均量を特徴量として用いる。アンカーリンクの文字列とは、＜a href='…'…＞○○○＜/a＞の○○○の部分に該当する。この特徴量も、以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (2) Average amount of character string of each anchor link When the character string of each anchor link is large on average, the segment is likely to be a navigation link such as a related article. Also, if the average number of anchor link character strings is small, there is a high possibility that the keyword search link is included in the main content. Accordingly, the anchor link information feature amount extraction unit 131 uses the average amount of anchor link character strings included in the segment as a feature amount. The character string of the anchor link corresponds to the XXX part of <a href='...'...> XXXXX </a>. This feature amount is also executed in the feature amount normalization unit 134 described below by normalizing the amount of the character string to be a feature amount and using the absolute value of the character string as the feature amount, The final character string feature amount is stored in the memory 136.

（３）全てのアンカーリンクの文字列の合計量：
セグメント内に含まれるアンカーリンクの文字列の合計量が多い場合、そのセグメントはナビゲーションリンクである可能性が高い。そこで、アンカーリンク情報特徴量抽出部１３１は、セグメント内に含まれるアンカーリンクの文字列の合計量を特徴量として用いる。アンカーリンクの文字列とは、＜a href='…'＞○○○＜/a＞の○○○の部分に該当する。この特徴も以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (3) Total amount of character strings of all anchor links:
When the total amount of anchor link character strings included in a segment is large, there is a high possibility that the segment is a navigation link. Therefore, the anchor link information feature amount extraction unit 131 uses the total amount of anchor link character strings included in the segment as a feature amount. The character string of the anchor link corresponds to the XXX part of <a href='...'> XXXXX </a>. This feature is also executed in the feature amount normalization unit 134 described below by performing two methods of normalizing the amount of the character string to obtain the feature amount and using the absolute value of the character string as the feature amount. The character string amount is stored in the memory 136 as a feature amount.

（４）最大文字列のアンカーリンクＵＲＬの量：
アンカーリンク先のＵＲＬ文字列が非常に長い場合、そのセグメントは広告である可能性が高い。そこで、セグメント内で最大長のアンカーリンク先のＵＲＬの文字列を特徴量として用いる。ここで述べるアンカーリンク先のＵＲＬ文字列とは、＜a href='△△△'…＞…＜/a＞の△△△の部分に該当する。この特徴量も以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (4) Amount of anchor link URL of maximum character string:
If the URL character string of the anchor link destination is very long, the segment is likely to be an advertisement. Therefore, the character string of the URL with the longest anchor link destination in the segment is used as the feature amount. The URL character string of the anchor link destination described here corresponds to the portion of ΔΔΔ in <a href='ΔΔΔ'...></a>. In the feature amount normalization unit 134, which will be described below, this feature amount is also subjected to a method of normalizing the amount of the character string to obtain the feature amount, and two as the feature amount using the absolute value of the character string. The characteristic amount of the character string is stored in the memory 136.

（５）広告に関するアンカーリンクを含むか：
広告に関するＵＲＬを含むアンカーリンクは特徴的な文字列を含む可能性が高い。例えば、「adclick」、「adnet」、「banner」等がそれにあたる。そこで、アンカーリンク情報特徴量抽出部１３１は、このような広告となりやすい文字列を含んだルＵＲＬを含むアンカーが存在する場合、特徴量を１とし、存在しない場合を０とする特徴量を抽出し、メモリ１３６に格納する。広告になりやすい文字列は、広告除去用のFirefox用アドインであるAdblock plugin等のサイトに記載されているため、それを用いる。 (5) Whether to include an anchor link related to the advertisement:
An anchor link including a URL related to an advertisement is likely to include a characteristic character string. For example, “adclick”, “adnet”, “banner”, and the like. Therefore, the anchor link information feature amount extraction unit 131 extracts a feature amount that sets the feature amount to 1 when there is an anchor including a URL including a character string that is likely to be an advertisement, and sets 0 when the anchor does not exist. And stored in the memory 136. Character strings that are likely to become advertisements are listed on sites such as Adblock plugin, which is an add-in for Firefox that removes advertisements.

＜タグ情報特徴量抽出部１３２＞
タグ情報特徴量抽出部１３２は、ＨＴＭＬタグ等のタグ情報に関する特徴量を抽出する。 <Tag information feature amount extraction unit 132>
The tag information feature amount extraction unit 132 extracts feature amounts relating to tag information such as HTML tags.

（１）テキスト系のHTMLタグの数：
あるセグメントにおいて、Ｗｅｂ文書で表示される文字列が多い場合、テキストに関するHTMLタグが多く含まれる。また、ブログ等のCGMにおいては、Ｗｅｂ文書で表示される文字列は少ないが、ユーザが改行タグを多用する事例が多く見られる。そこで、タグ情報特徴量抽出部１３２は、テキストに関するHTMLタグの数を特徴量として用いる。この特徴量も以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (1) Number of text HTML tags:
When there are many character strings displayed in a Web document in a certain segment, many HTML tags related to text are included. In CGMs such as blogs, there are few character strings displayed in Web documents, but there are many cases where users frequently use line feed tags. Therefore, the tag information feature amount extraction unit 132 uses the number of HTML tags related to text as a feature amount. In the feature amount normalization unit 134, which will be described below, this feature amount is also subjected to a method of normalizing the amount of the character string to obtain the feature amount, and two as the feature amount using the absolute value of the character string. The characteristic amount of the character string is stored in the memory 136.

例えば、全てのセグメントにおいて、最大のHTMLタグの量が"１０"で、あるセグメント内のHTMLタグの量が"５"だった場合には、そのセグメントのHTMLタグの量の特徴量は"０．５"となる。そして、本実施の形態で使用するテキスト系のHTMLタグは、以下のタグを対象とする。 For example, if the maximum amount of HTML tags in all segments is “10” and the amount of HTML tags in a segment is “5”, the feature amount of the amount of HTML tags in that segment is “0”. .5 ". The text-type HTML tags used in the present embodiment target the following tags.

・＜p＞
・＜/p＞
・＜br＞
・＜/br＞
・＜font＞
・＜/font＞
タグ内にsize属性やclass属性が存在する場合も考えられるため、その場合は、それらを含めたタグを考慮した正規表現を用いてカウントを行う（例：＜font size="+1"＞）。・ <P>
・ </ P>
・ <br>
・ </br>
・ <Font>
・ / Font
Since the size attribute and class attribute may exist in the tag, in that case, count using a regular expression that takes into account the tag including them (example: <font size = "+ 1">) .

（２）テキスト系のHTMLタグの連続出現数：
Ｗｅｂ文書で表示される文字列が集中して記述してあるセグメントは、テキスト系のタグが多く存在すると同時に、テキスト系のHTMLタグが連続して出現する。ここでいう、「連続して出現する」というのは、他のアンカーリンク等のHTMLタグが間に出現しないということである。そこで、タグ情報特徴量抽出部１３２は、（１）で述べたテキスト系のHMTLタグの連続出現数を特徴量とする。この特徴量も以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (2) Number of consecutive occurrences of text-based HTML tags:
In a segment in which character strings displayed in a Web document are concentratedly described, there are many text-based tags, and at the same time, text-based HTML tags appear continuously. Here, “appears continuously” means that other HTML tags such as anchor links do not appear in between. Therefore, the tag information feature amount extraction unit 132 sets the number of continuous appearances of the text-type HMTL tag described in (1) as a feature amount. In the feature amount normalization unit 134, which will be described below, this feature amount is also subjected to a method of normalizing the amount of the character string to obtain the feature amount, and two as the feature amount using the absolute value of the character string. The characteristic amount of the character string is stored in the memory 136.

例えば、全てのセグメントにおいて最大のHTMLタグの連続量が"１０"で、あるセグメント内のHTMLタグの連続量が"５"だった場合には、そのセグメントの文字列の量の特徴量は"０．５"となる。 For example, when the continuous amount of the maximum HTML tag in all segments is “10” and the continuous amount of HTML tags in a segment is “5”, the feature amount of the character string amount of the segment is “ 0.5 ".

（３）リンクリストタグの数：
あるセグメント内において、リンクリストタグが多い場合、そのセグメントにはナビゲーションリンク等の多くのアンカーリンクが存在し、そのセグメントは主要コンテンツとならない可能性が高い。そこで、タグ情報特徴量抽出部１３２は、リンクリストタグの数を特徴量とする。この特徴量も、以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (3) Number of link list tags:
When there are many link list tags in a certain segment, there are many anchor links such as navigation links in the segment, and there is a high possibility that the segment will not become main content. Therefore, the tag information feature amount extraction unit 132 sets the number of link list tags as a feature amount. This feature amount is also executed in the feature amount normalization unit 134 described below by normalizing the amount of the character string to be a feature amount and using the absolute value of the character string as the feature amount, The final character string feature amount is stored in the memory 136.

例えば、全てのセグメントにおいて最大のリンクリストタグの量が"１０"で、あるセグメント内のリンクリストタグの量が"５"だった場合には、そのセグメントのリンクリストタグの量の特徴量は"０．５"となる。そして、本実施の形態で使用するリンクリストタグは、以下のタグを対象とする。 For example, if the amount of the maximum link list tag in all the segments is “10” and the amount of the link list tag in a certain segment is “5”, the feature amount of the link list tag amount of that segment is “0.5”. The link list tag used in the present embodiment targets the following tags.

・＜li＞
・＜ul＞
・＜dl＞
・＜dd＞
・＜ol＞
タグ内にalt属性やclass属性が存在する場合も考えられるため、その場合はそれらを含めたタグを考慮した正規表現を用いてカウントを行う（例：＜font class="hoge"＞）。・ <Li>
・ <Ul>
・ <Dl>
・ <Dd>
・ <Ol>
Since there may be cases where the alt attribute and class attribute exist in the tag, in such a case, counting is performed using a regular expression that considers the tag including those (eg, <font class = "hoge">).

（４）Ｗｅｂ文書で表示される文字列を含まない文字列（HTMLタグを含む）の量：
あるセグメント内において、Ｗｅｂで表示されない文字列(HTMLタグを含む)が多い場合、そのセグメントは広告当の主要コンテンツ出ない可能性が高い。そこで、Ｗｅｂ文書で表示される文字列以外の文字列（HTMLタグを含む）量を特徴量とする。この特徴量も、以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (4) Amount of character strings (including HTML tags) that do not include character strings displayed in the Web document:
If there are many character strings (including HTML tags) that are not displayed on the Web in a certain segment, there is a high possibility that the segment will not have main content for advertising. Therefore, the amount of character strings (including HTML tags) other than the character strings displayed in the Web document is used as the feature amount. This feature amount is also executed in the feature amount normalization unit 134 described below by normalizing the amount of the character string to be a feature amount and using the absolute value of the character string as the feature amount, The final character string feature amount is stored in the memory 136.

例えば、全てのセグメントにおいて最大のＷｅｂで表示されない文字列の量が"１００"であるセグメント内のＷｅｂで表示されない文字列の量が"５０"だった場合には、そのセグメントのＷｅｂで表示されない文字列の量の特徴量は"０．５"となる。 For example, if the amount of character strings that are not displayed on the web in the segment where the amount of character strings that are not displayed on the maximum Web in all the segments is “100” is “50”, they are not displayed on the Web of that segment. The feature amount of the character string amount is “0.5”.

＜Ｗｅｂ文書で表示される文字列特徴量抽出部１３３＞
以下では、Ｗｅｂブラウザで表示される文字列に関する特徴量について述べる。ここで述べる「文字列」とは、HMTLタグ等のＷｅｂブラウザで表示されない文字列を含まないものとする。 <Character String Feature Extraction Unit 133 Displayed in Web Document>
Below, the feature-value regarding the character string displayed with a web browser is described. The “character string” described here does not include a character string that is not displayed by a Web browser such as an HMTL tag.

（１）文字列の量：
一般的にＷｅｂ文書の主要コンテンツ部分は、主要コンテンツでない部分と比較すると多くの文字列が含まれている。また、全体的に文字列の少ないＷｅｂ文書においても同様のことが言える。そのため、当該文字列特徴量抽出部１３３では、分割されたＷｅｂ文書に含まれる文字列の数を特徴量とする。そして、以下に説明する特徴量正規化部１３４において、文字列の量を正規化して特徴量とする手法と、文字列の絶対値を用いて特徴量とする二つを実行し、最終的な文字列の量の特徴量とし、メモリ１３６に格納する。 (1) Amount of character string:
In general, the main content portion of a Web document includes many character strings as compared with a portion that is not the main content. The same can be said for a Web document with a small number of character strings as a whole. Therefore, the character string feature amount extraction unit 133 uses the number of character strings included in the divided Web document as a feature amount. Then, in the feature amount normalization unit 134 described below, a method of normalizing the amount of the character string to obtain the feature amount and two using the absolute value of the character string as the feature amount are executed. The character string amount is stored in the memory 136 as a feature amount.

＜特徴量正規化部１３４＞
特徴量正規化部１３４では、メモリ１３６に格納された上記の各特徴量について、以下の方法により正規化する。 <Feature amount normalization unit 134>
The feature amount normalization unit 134 normalizes each feature amount stored in the memory 136 by the following method.

（１）文字列：
ａ）文字列の量の正規化を行い特徴量とする手法：
全てのセグメントにおいて最大の文字列の量を持つセグメントの特徴量を"１"とする正規化を行う。例えば、全てのセグメントにおいて最大の文字列の量が"２００"で、あるセグメント内の文字列の量が"１００"だった場合には、そのセグメントの文字列の量の特徴量は"０・５"となる。このような正規化を行うことで、全体的に文字列の少ないＷｅｂ文書においても主要コンテンツの抽出が可能になる。 (1) Character string:
a) A method of normalizing the amount of character string to obtain a feature amount:
Normalization is performed by setting the feature amount of the segment having the maximum character string amount to “1” in all segments. For example, if the maximum amount of character strings in all segments is “200” and the amount of character strings in a segment is “100”, the feature amount of the amount of character strings in that segment is “0 · 5 ". By performing such normalization, it is possible to extract main contents even in a Web document having a small character string as a whole.

ｂ）文字列の量の絶対値を用いて特徴量とする手法：
上記のａ）で述べた正規化を行い特徴量とする方法は、全体的に文字列の少ないＷｅｂ文書において有効であったが、正規化を行うことで、全体的に文字列の量が多く、主要コンテンツ部分のセグメント間の文字列の量の差が大きい場合に不都合が生じる。例えば、全てのセグメントにおいて最大の文字列の量が"１０００"で、あるセグメント内の文字列の量が"１００"だった場合、そのセグメントの文字列の量の特徴量は"０．１"になる。文字列の量としては多いはずだが、正規化を行うことで、このような弊害が生じる。そこで、文字列の絶対値を用いて特徴量とする手法を行う必要がある。 b) A method of using the absolute value of the amount of character string as a feature amount:
The normalization method described above in a) and used as a feature amount is effective for a Web document with a small number of character strings as a whole. However, by performing normalization, the amount of character strings is large as a whole. Inconvenience occurs when the difference in the amount of character strings between segments of the main content portion is large. For example, if the maximum amount of character strings in all segments is “1000” and the amount of character strings in a segment is “100”, the feature amount of the amount of character strings in that segment is “0.1”. become. Although the amount of character strings should be large, such detriment is caused by normalization. Therefore, it is necessary to perform a technique for using the absolute value of the character string as a feature amount.

具体的には、ある特定の値を超えた場合に、その文字列の特徴量を"１"とする手法を用いる。例えば、あるセグメント内の文字列の量が"１００"の場合、文字列の量が"５"以上の場合の特徴量が"１"となり、文字列の量が"１０"以上の特徴量が"１"となり、…，文字列の量が"１０５"以上の特徴量は"０"となり、…，文字列の量が"２００"以上の特徴量は"０"となるように特徴量を作成する。このように、ある特定の文字列量を超えた場合に特徴量を"１"とする手法を用いることで、特徴量の最大値は"１"のままで文字列の量の絶対値を特徴量とすることができる。また、本実施の形態においての文字列の量の絶対値の特徴量の間隔は"５"としたが、場合において適切な間隔を用いるのが好ましい。８，１６，３２，６４といった２の乗数を用いて特徴量の間隔とする手法も考えられる。文字列の量がｘ以上の…の最大のｘも同様に、場合において適切な値に変更する。主要コンテンツ判定部１４０における計算量を減らしたい場合にはｘの値を小さくすればよい。 Specifically, when a certain value is exceeded, a method of setting the character string feature amount to “1” is used. For example, when the amount of character string in a segment is “100”, the feature amount when the amount of character string is “5” or more is “1”, and the feature amount of character string is “10” or more. The feature amount is set to “1”, so that the feature amount of the character string “105” or more is “0”, and the feature amount of the character string “200” or more is “0”. create. In this way, by using the method of setting the feature value to “1” when a certain character string amount is exceeded, the absolute value of the character string amount is maintained with the maximum feature value being “1”. It can be an amount. Further, although the feature value interval of the absolute value of the character string amount in this embodiment is set to “5”, it is preferable to use an appropriate interval in this case. A method of setting the interval between feature quantities using a multiplier of 2 such as 8, 16, 32, and 64 is also conceivable. Similarly, the maximum x of the character string amount x or more is changed to an appropriate value in some cases. In order to reduce the amount of calculation in the main content determination unit 140, the value of x may be reduced.

（２）句読点の数：
Ｗｅｂ広告等のノイズとなりやすいセグメントは、文字列の量は多いが、句読点の数が少ない傾向になる。そのため、句読点の数を特徴とする。具体的には、特徴量正規下部１３４は、セグメント内の文字列に含まれる「、」、「，」、「。」、「．」、「！」、「・」、「？」、「…」の数を特徴量としてカウントする。この特徴量も文字列の量で述べた正規化による特徴量と、絶対値による特徴量の２通りを算出する。算出方法については、（１）の文字列の項で述べた手法と同じものを用いる。 (2) Number of punctuation marks:
Segments that tend to be noisy, such as Web advertisements, tend to have a small amount of punctuation, although the amount of character strings is large. Therefore, it is characterized by the number of punctuation marks. Specifically, the feature quantity normal lower part 134 includes “,”, “,”, “.”, “.”, “!”, “•”, “?”, “...” included in the character string in the segment. "Is counted as a feature quantity. This feature amount is also calculated in two ways: a feature amount by normalization described in terms of the amount of character string and a feature amount by absolute value. As the calculation method, the same method as that described in the character string section of (1) is used.

＜特徴量の比率特徴量抽出部１３５＞
特徴量の比率特徴量抽出部（以下、「比率特徴量抽出部」と記す）１３５は、メモリ１３６に格納されている前述のアンカーリンク情報特徴量、タグ情報特徴量、Ｗｅｂ文書で表示される文字列特徴量間の比率を用いた特徴量を求める。 <Characteristic Ratio Ratio Feature Quantity Extraction Unit 135>
A characteristic feature ratio feature quantity extraction unit (hereinafter referred to as “ratio feature quantity extraction unit”) 135 is displayed as the aforementioned anchor link information feature quantity, tag information feature quantity, and Web document stored in the memory 136. A feature amount using a ratio between character string feature amounts is obtained.

（１）テキスト系のタグ数とテキスト系のタグの連続出現数の比率：
テキスト系のタグが多数あり、また、テキスト系のタグの連続出現数が多いセグメントは、テキストが密に書かれているため、主要コンテンツである可能性が高い。しかしながら、テキスト系のタグは多数あるが、テキスト系のタグの連続出現数が少ないセグメントは、セグメントのサイズが大きいだけで、主要コンテンツでない可能性が高いと言える。そこで、比率特徴量抽出部１３５は、テキスト系のタグ数とテキスト系のタグの連続出現数の比率を特徴量として用いる。 (1) Ratio of the number of text tags and the number of consecutive occurrences of text tags:
A segment having a large number of text-type tags and a large number of consecutive appearances of text-type tags is highly likely to be the main content because the text is densely written. However, although there are many text-type tags, a segment with a small number of consecutive appearances of text-type tags can be said to have a high possibility that it is not the main content because the segment size is large. Therefore, the ratio feature quantity extraction unit 135 uses the ratio of the number of text tags and the number of consecutive appearances of text tags as a feature quantity.

具体的には、テキスト系のタグ数を分母とし、テキスト系のタグの連続出現数を分子とした値を、特徴量として用いる。ここで、テキスト系のタグ数が"０"の場合は、分母が"０"となってしまうため、この場合のテキスト系のタグ数とテキスト系のタグの連続出現数の比率の特徴量は"０"とする。本特徴量についても、上記の特徴量正規化部１３４において特徴量の正規化を同様に行い、最終的な特徴量とし、メモリ１３６に格納する。この特徴量が大きければ大きいほど主要コンテンツである可能性が高い。 Specifically, a value with the number of text tags as the denominator and the number of consecutive occurrences of text tags as the numerator is used as the feature amount. Here, when the number of text tags is “0”, the denominator is “0”. In this case, the feature amount of the ratio between the number of text tags and the number of consecutive occurrences of text tags is Set to “0”. The feature amount normalization unit 134 also performs normalization of the feature amount in the same manner, and stores the feature amount as a final feature amount in the memory 136. The larger the feature amount, the higher the possibility that it is the main content.

（２）Ｗｅｂで表示される文字列とタグの比率：
あるセグメント内において、Ｗｅｂで表示される文字列が多い場合は主要コンテンツとなる可能性が高いが、同じセグメント内において、HTMLタグ等のタグが多い場合もある。この場合、上記の（１）の「テキスト系のタグ数とテキスト系のタグの連続出現数の比率」の項で述べたように、セグメントサイズが大きいだけで、主要コンテンツでない可能性がある。そこで比率特徴量抽出部１３５は、Ｗｅｂで表示される文字列とタグの比率を特徴量として用いることで、このような場合に対処する。 (2) Ratio of character string and tag displayed on the web:
If there are many character strings displayed on the Web in a certain segment, there is a high possibility of becoming main content, but there may be many tags such as HTML tags in the same segment. In this case, as described in the section “Ratio between the number of text-type tags and the number of consecutive appearances of text-type tags” in (1) above, there is a possibility that the segment content is large and it is not the main content. Therefore, the ratio feature quantity extraction unit 135 copes with such a case by using the ratio between the character string displayed on the Web and the tag as the feature quantity.

具体的には、Ｗｅｂで表示される文字列を分子とし、タグの数を分母とした値を特徴量とする。この特徴量が大きければ大きいほど、主要コンテンツである可能性が高い。本特徴量も、特徴量正規化部１３４における正規化を行い、最終的な特徴量とし、メモリ１３６に格納する。タグの数が"０"の場合は、分母が"０"となってしまうため、特徴量は"１"とする。 Specifically, a character string displayed on the Web is used as a numerator, and a value using the number of tags as a denominator is used as a feature amount. The larger the feature amount, the higher the possibility that it is the main content. This feature quantity is also normalized by the feature quantity normalization unit 134 and stored in the memory 136 as a final feature quantity. When the number of tags is “0”, the denominator is “0”, so the feature value is “1”.

（３）アンカーリンクの数とリンクリストタグの数の比率：
あるセグメント内において、アンカーリンクの数や、リンクリストタグの数が多ければ多いほど、そのセグメントは主要コンテンツでない可能性が高いが、セグメントが広いためにこれらの特徴量が偶然高くなってしまう場合も考えられる。そこで、比率特徴量抽出部１３５は、アンカーリンクの数とリンクリストタグの数の比率を特徴量として用いる。 (3) Ratio of the number of anchor links to the number of link list tags:
Within a certain segment, the more anchor links or link list tags, the more likely that segment is not the main content, but because these segments are large, these feature quantities will increase by chance. Is also possible. Therefore, the ratio feature quantity extraction unit 135 uses the ratio of the number of anchor links and the number of link list tags as a feature quantity.

具体的には、アンカーリンクの数を分母とし、リンクリストタグの数を分子とし、特徴量とする。この特徴量が大きければ大きいほど、セグメントの面積に対し密度の高いリンク数が存在することになり、主要コンテンツでない可能性が高い。本特徴量も、特徴量正規化部１３４における正規化を行い、最終的な特徴量とし、メモリ１３６に格納する。アンカーリンクの数が"０"の場合は分母が０となってしまうため、特徴量は"０"とする。 Specifically, the number of anchor links is used as the denominator, the number of link list tags is used as the numerator, and the feature amount is set. The larger the feature amount, the higher the number of links with respect to the area of the segment, and the higher the possibility that it is not the main content. This feature quantity is also normalized by the feature quantity normalization unit 134 and stored in the memory 136 as a final feature quantity. When the number of anchor links is “0”, the denominator is 0, so the feature value is “0”.

＜主要コンテンツ判定部＞
主要コンテンツ判定部１４０は、上記の特徴量抽出部１３０で抽出された特徴量を用いて、主要コンテンツか否かを判定する。主要コンテンツ判定部１４０の構成を図８に示す。 <Main content determination unit>
The main content determination unit 140 determines whether the content is main content by using the feature amount extracted by the feature amount extraction unit 130. The configuration of the main content determination unit 140 is shown in FIG.

同図に示す主要コンテンツ判定部１４０は、特徴量抽出部１３０から出力されたセグメントの特徴量を入力する特徴量入力部１４１と、セグメント毎にテキストが存在するか否かを判定するテキスト判定部１４２と、特徴量毎のパラメータに基づいてセグメントが主要コンテンツのものを抽出する主要コンテンツ判定処理部１４３から構成される。 The main content determination unit 140 shown in the figure includes a feature amount input unit 141 that inputs the feature amount of the segment output from the feature amount extraction unit 130, and a text determination unit that determines whether text exists for each segment. 142 and a main content determination processing unit 143 that extracts a segment whose main content is based on a parameter for each feature amount.

以下に、特徴量抽出部１３０から取得したセグメント毎の特徴量を用いて主要コンテンツか否かを判定する方法について述べる。 Hereinafter, a method for determining whether or not the content is the main content using the feature amount for each segment acquired from the feature amount extraction unit 130 will be described.

図９は、本発明の一実施の形態における特徴量のパラメータ推定方法のフローチャートである。 FIG. 9 is a flowchart of the feature quantity parameter estimation method according to the embodiment of the present invention.

最初に、人手で主要コンテンツか否かを、特徴量を抽出したセグメント毎に判定した訓練データを作成する（ステップ１０１，１０２，１０３）。ここで、Ｗｅｂで表示される文字列が存在しない場合には、主要コンテンツと見做されないと記述したが、機械学習を用いた手法において、負例として学習に有効であるため、訓練データにはそのようなデータも採用する。そして、そのセグメントの特徴量を用いて学習を行い、特徴量毎の重みを算出する（ステップ１０４）。速度を重視する場合は、最大エントロピー法で学習し、精度を重視する場合には、二次の多項式カーネルを用いたSupport Vector Machineを用いて学習を行い、学習モデルを作成する（ステップ１０５）。そして、これらの学習したパラメータを用いて、セグメントの特徴量を主要コンテンツ判定処理部１４３に入力する。 First, training data is created for each segment from which a feature amount is extracted to determine whether or not the content is main content manually (steps 101, 102, and 103). Here, when there is no character string displayed on the Web, it is described that it is not regarded as main content. However, in the method using machine learning, it is effective for learning as a negative example. Such data is also adopted. Then, learning is performed using the feature amount of the segment, and a weight for each feature amount is calculated (step 104). When importance is attached to speed, learning is performed by the maximum entropy method, and when importance is attached to accuracy, learning is performed using a support vector machine using a second-order polynomial kernel to create a learning model (step 105). Then, using these learned parameters, the segment feature amount is input to the main content determination processing unit 143.

主要コンテンツ判定処理部１４３は、セグメント毎に特徴量に基づいて、主要コンテンツか否かを判定し、主要コンテンツのみを主要コンテンツ出力部１５０に出力する。 The main content determination processing unit 143 determines whether or not the main content is based on the feature amount for each segment, and outputs only the main content to the main content output unit 150.

なお、本装置をユーザＰＣ等に組み込む場合、全ての特徴量を用いて処理することは、処理量的に難しい、そのため、抽出する特徴量を絞り込むことで処理量を削減する。ここで、機械学習による学習モデルは、絞り込んだ特徴量モデル毎に学習モデルを作成する。 Note that when the apparatus is incorporated in a user PC or the like, it is difficult to process using all the feature amounts, so the processing amount is reduced by narrowing down the feature amounts to be extracted. Here, the learning model by machine learning creates a learning model for each narrowed-down feature quantity model.

＜主要コンテンツ出力部１５０＞
主要コンテンツ出力部１５０は、主要コンテンツ判定部１４０にて、主要コンテンツか否かの判定が行われた後に、学習器によって主要コンテンツと判定されたセグメントのみを結合して最終出力するとする。 <Main content output unit 150>
The main content output unit 150 combines only segments determined to be main content by the learning device after the main content determination unit 140 determines whether or not the content is main content, and finally outputs the combined content.

主要コンテンツ出力部１５０の構成を図１０に示す。同図に示す主要コンテンツ出力部１５０は、タグ付テキスト出力部１５１とタグなしテキスト出力部１５２、データ出力部１５３から構成される。情報検索の事前処理として本装置を用いたい場合は、タグ付テキスト出力部１５１を用いてＨＴＭＬタグ等のタグを残して、データ出力部１５３より出力する。一方、情報推薦等で、Ｗｅｂ文書の内容を解析したい場合には、タグなしテキスト出力部１５２を用いてＨＴＭＬタグ等のタグを削除して、データ出力部１５３より出力する。 The configuration of the main content output unit 150 is shown in FIG. The main content output unit 150 shown in the figure includes a tagged text output unit 151, an untagged text output unit 152, and a data output unit 153. When it is desired to use this apparatus as pre-processing for information retrieval, a tag such as an HTML tag is left using the tagged text output unit 151 and output from the data output unit 153. On the other hand, when it is desired to analyze the content of the Web document for information recommendation or the like, tags such as HTML tags are deleted using the untagged text output unit 152 and output from the data output unit 153.

＜特徴量抽出のための事前処理＞
精度向上のために、特徴量抽出部１３０において特徴量を抽出する事前処理として、不要文字列等を除去する手法が有効である。 <Pre-processing for feature extraction>
In order to improve accuracy, a technique of removing unnecessary character strings and the like is effective as a pre-processing for extracting feature amounts in the feature amount extraction unit 130.

以下に記述する不要文字列を事前に除去しておくことで、主要コンテンツの判定精度を高める。 By removing unnecessary character strings described below in advance, the determination accuracy of main contents is improved.

・ 
・<
・>
・&
・«
・＆raquo;
これらの文字列は、HTMLタグ等で用いる記号をＷｅｂブラウザ上で表示する際に用いる特殊文字である。また、上記で挙げた特殊文字以外のHTML特殊文字も削除の対象とする。特殊文字は、実際にＷｅｂブラウザ上で表示される文字列に対して、文字列の量が多いため、学習の際のノイズとなりやすい。・ &Nbsp;
・ &Lt;
・ &Gt;
・ &Amp;
・ &Laquo;
・ &Raquo;
These character strings are special characters used when symbols used in HTML tags or the like are displayed on the Web browser. In addition, HTML special characters other than the special characters listed above are also subject to deletion. Special characters tend to be noise during learning because the amount of character strings is larger than the character strings actually displayed on the Web browser.

上記のように、本発明は、様々な統計的特徴量を自動抽出し、Ｗｅｂ文書の主要コンテンツ部分を自動的に抽出する技術により、情報検索技術や情報推薦技術の前処理としてＷｅｂ文書から主要コンテンツを抽出し、解析精度の向上が実現できる。 As described above, the present invention is based on a technique for automatically extracting various statistical features and automatically extracting the main content portion of the Web document, so that the main process is performed from the Web document as a pre-processing of the information search technique and the information recommendation technique. Content can be extracted and analysis accuracy can be improved.

なお、上記の実施の形態における図４に示すＷｅｂ文書主要コンテンツ抽出装置の各構成要素の動作をプログラムとして構築し、Ｗｅｂ文書主要コンテンツ抽出装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 In addition, the operation | movement of each component of the Web document main content extraction apparatus shown in FIG. 4 in the above embodiment is constructed as a program, and is installed and executed on a computer used as the Web document main content extraction apparatus. It is possible to distribute through a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、全文検索エンジン等の精度向上の前処理として用いることや、ＵＲＬを指定するだけでＲＳＳフィードが作成できる機能、Ｗｅｂ文書の主要コンテンツの内容にあわせた広告配信、さらにはユーザのＷｅｂ閲覧履歴の解析の前処理として使用する等の、Ｗｅｂ文書を解析する際の基礎技術として利用可能である。 The present invention can be used as a pre-process for improving accuracy of a full-text search engine, a function capable of creating an RSS feed simply by specifying a URL, advertisement distribution in accordance with the main content of a Web document, and further a user's Web It can be used as a basic technique for analyzing a Web document, such as being used as preprocessing for browsing history analysis.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における主要コンテンツ例（１）である。It is a main content example (1) in one embodiment of this invention. 本発明の一実施の形態における主要コンテンツ例（２）である。It is a main content example (2) in one embodiment of this invention. 本発明の一実施の形態におけるＷｅｂ文書主要コンテンツ抽出装置の構成図である。It is a block diagram of the Web document main content extraction device in one embodiment of the present invention. 本発明の一実施の形態におけるＷｅｂ文書取得・入力部の構成図である。It is a block diagram of the Web document acquisition / input part in one embodiment of this invention. 本発明の一実施の形態におけるＷｅｂ文書分割部の構成図である。It is a block diagram of the Web document division part in one embodiment of this invention. 本発明の一実施の形態における特徴量抽出部の構成図である。It is a block diagram of the feature-value extraction part in one embodiment of this invention. 本発明の一実施の形態における主要コンテンツ判定部の構成図である。It is a block diagram of the main content determination part in one embodiment of this invention. 本発明の一実施の形態における特徴量のパラメータ推定方法のフローチャートである。It is a flowchart of the parameter estimation method of the feature-value in one embodiment of this invention. 本発明の一実施の形態における主要コンテンツ出力部の構成図である。It is a block diagram of the main content output part in one embodiment of this invention.

Explanation of symbols

１００Ｗｅｂ文書主要コンテンツ抽出装置
１１０Ｗｅｂ文書取得・入力部
１１１データ入力部
１１２Ｗｅｂ文書ファイル入力部
１１３ＵＲＬ入力部
１１４Ｗｅｂ文書取得部
１１５文字コード変換部
１２０文書分割手段、Ｗｅｂ文書分割部
１２１広告対象領域抽出部
１２２ノイズとなるタグや領域除去部
１２３Ｗｅｂ文書分割処理部
１３０特徴量抽出手段、特徴量抽出部
１３１アンカーリンク情報特徴量抽出部
１３２タグ情報特徴量抽出部
１３３Ｗｅｂ文書で表示される文字列特徴量抽出部
１３４特徴量正規化部
１３５特徴量の比率特徴量抽出部
１３６メモリ
１４０主要コンテンツ判定手段、主要コンテンツ判定部
１４１特徴量入力部
１４２テキスト判定部
１４３主要コンテンツ判定処理部
１５０主要コンテンツ出力手段、主要コンテンツ出力部
１５１タグ付きテキスト出力部
１５２タグなしテキスト出力部
１５３データ出力部
１６０記憶手段、記憶部 100 Web document main content extraction device 110 Web document acquisition / input unit 111 Data input unit 112 Web document file input unit 113 URL input unit 114 Web document acquisition unit 115 Character code conversion unit 120 Document dividing means, Web document dividing unit 121 Region extraction unit 122 Tag that causes noise and region removal unit 123 Web document division processing unit 130 Feature amount extraction means, feature amount extraction unit 131 Anchor link information feature amount extraction unit 132 Tag information feature amount extraction unit 133 Displayed as a Web document Character string feature amount extraction unit 134 Feature amount normalization unit 135 Feature ratio ratio feature amount extraction unit 136 Memory 140 Main content determination means, main content determination unit 141 Feature amount input unit 142 Text determination unit 143 Main content determination processing unit 150 Main Content output means, main Content output unit 151 tagged text output unit 152 untagged text output unit 153 data output unit 160 storage unit, the storage unit

Claims

A Web document main content extraction apparatus for extracting main contents of a Web document,
When a Web document is input, the Web document is divided into segments based on a predetermined division rule, and stored in a storage unit;
For each segment divided by the document dividing unit, a feature amount for main content determination is extracted and stored in the storage unit for each segment;
Main content determination means for determining whether or not the main content is based on the feature value for each segment using a machine learning algorithm;
A main content output unit that combines the parts determined to be the main content by the main content determination unit, and outputs the main content;
I have a,
The feature amount extraction means includes:
In the segmented segment, the number of punctuation included in the character string, the amount of the character string, the amount of the character string not including the character string displayed on the web, the number of anchor links, the average amount of the anchor link character string, A total amount of character strings of anchor links, a character string amount of a URL having the maximum anchor link destination, and a numerical value for identifying whether or not an anchor link related to an advertisement is included are extracted as feature amounts. Web document main content extraction device.

The feature amount extraction means includes:
In the divided segment, the number of HTML tags related to text, the number of consecutive HTML tags related to text, the number of linked list tags, the ratio of the number of HTML tags related to text and the number of consecutive HTML tags related to text, displayed on the web One or more of the ratio of the number of character strings to be processed and the number of tags and the ratio of the number of anchor links to the number of link list tags are further extracted as feature quantities.
The Web document main content extraction device according to claim 1.

The feature amount extraction means includes:
Before performing the extraction means, includes special characters used when displaying symbols used in HTML tags on a Web browser, and special character deletion means for deleting HTML special characters other than the special characters
The Web document main content extraction device according to claim 1 or 2 .

The document dividing means includes
An advertising target area extracting means for extracting the advertising target area when the advertising target area exists in the input web document;
Noise removing means for removing a tag or region that causes noise from the Web document;
A dividing unit that divides the Web document output from the noise removing unit using the predetermined dividing rule and stores the divided document in the storage unit;
including
The Web document main content extraction device according to any one of claims 1 to 3 .

Normalizing means for normalizing the extracted issued feature amount including,
The Web document main content extraction device according to any one of claims 1 to 3 .

6. A Web document main content extraction program for causing a computer to function as each means constituting the Web document main content extraction device according to claim 1 .