JP5462591B2

JP5462591B2 - Specific content determination device, specific content determination method, specific content determination program, and related content insertion device

Info

Publication number: JP5462591B2
Application number: JP2009250646A
Authority: JP
Inventors: 志学岩淵
Original assignee: Rakuten Inc
Current assignee: Rakuten Group Inc
Priority date: 2009-10-30
Filing date: 2009-10-30
Publication date: 2014-04-02
Anticipated expiration: 2029-10-30
Also published as: JP2011096078A

Description

本発明は、Ｗｅｂページを構成するコンテンツを抽出する技術分野に関する。 The present invention relates to a technical field for extracting content constituting a Web page.

従来、Ｗｅｂサイト上に公開されているＷｅｂページを構成している素材たるコンテンツを取得し、取得したコンテンツに基づいて新たなコンテンツを生成する技術が知られている。例えば、非特許文献１には、ユーザにより画像データのＵＲＬが指定されると、当該ＵＲＬに対応する画像データをＷｅｂ上から取得し、取得した画像データに基づいてバナーを自動作成する技術が開示されている。 2. Description of the Related Art Conventionally, there is known a technique for acquiring content that is a material constituting a Web page published on a Web site and generating new content based on the acquired content. For example, Non-Patent Document 1 discloses a technology that, when a URL of image data is specified by a user, acquires image data corresponding to the URL from the Web, and automatically creates a banner based on the acquired image data. Has been.

“バナー自動作成”、[online]、[平成21年10月21日検索]、インターネット<ＵＲＬ：http://hyperbannermaker.com/>“Automatic banner creation”, [online], [October 21, 2009 search], Internet <URL: http://hyperbannermaker.com/>

Ｗｅｂサイトを構成する各Ｗｅｂページには、そのＷｅｂサイトの目的に沿った内容が掲載される。そのため、Ｗｅｂサイトを構成する各Ｗｅｂページの内容は、基本的には互いに関連性を有しているのであるが、夫々何らかの特徴を有してる場合がある。そして、そのＷｅｂページの内容を決める要因が、Ｗｅｂページを構成しているコンテンツ（例えば、テキストデータ、画像データ等）の内容である。従って、Ｗｅｂページを構成しているコンテンツの中で、そのＷｅｂページを特徴付けるコンテンツ、すなわち、そのＷｅｂページ特有のコンテンツが存在することがある。 Each web page constituting the web site is posted with content in accordance with the purpose of the web site. For this reason, the contents of the Web pages constituting the Web site are basically related to each other, but may have some characteristics. And the factor which determines the content of the web page is the content of the content (for example, text data, image data, etc.) which comprises the web page. Therefore, content that characterizes the Web page among content constituting the Web page, that is, content specific to the Web page may exist.

非特許文献１に記載の技術は、Ｗｅｂページに特有のコンテンツを抽出するものであるが、自動的に抽出するものではなく、ユーザが手作業でコンテンツを指定しなければならず、該Ｗｅｂページに特有のコンテンツを容易に抽出することはできない。そのため、どのコンテンツがＷｅｂページ特有のコンテンツであるかをユーザが判断することができない場合、又はユーザの嗜好によって好みのコンテンツが偏ってしまう場合など、Ｗｅｂページに特有のコンテンツを適格に抽出することができない。また、対象とするＷｅｂページのページ数が多いと、ユーザの作業が甚大となってしまう問題があった。 The technique described in Non-Patent Document 1 extracts content unique to a Web page, but does not automatically extract content, and the user must manually specify the content. It is not possible to easily extract content peculiar to. Therefore, when the user cannot determine which content is specific to the Web page, or when the favorite content is biased depending on the user's preference, the content specific to the Web page is properly extracted. I can't. In addition, when the number of target Web pages is large, there is a problem that the user's work becomes enormous.

また、例えば、ＨＴＭＬ（HyperText Markup Language）文書のタグの記述に基づいて、画像だけ又はテキストだけといったように、特定種類のコンテンツを全て抽出することは可能である。しかしながら、抽出されたコンテンツの中には、Ｗｅｂページ特有のものではないありふれたコンテンツも含まれるので、抽出結果としての信頼性が低く、そのため抽出結果からユーザが特有のコンテンツを探さなければならなかった。 Further, for example, it is possible to extract all of a specific type of content such as only an image or only text based on the description of a tag of an HTML (HyperText Markup Language) document. However, since the extracted content includes common content that is not unique to Web pages, the reliability of the extraction result is low, and the user must search for the specific content from the extraction result. It was.

本発明は以上の点に鑑みてなされたものであり、Ｗｅｂページを構成しているコンテンツの中からのそのＷｅｂページ特有のコンテンツを容易に抽出することができる特有コンテンツ判定装置、特有コンテンツ判定方法、特有コンテンツ判定プログラム等を提供することを目的とする。 The present invention has been made in view of the above points, and a unique content determination apparatus and a specific content determination method capable of easily extracting content specific to a Web page from content constituting the Web page. An object is to provide a unique content determination program and the like.

上記課題を解決するために、請求項１に記載の発明は、所定のサイトに含まれる複数のＷｅｂページのうち、指定されたＷｅｂページを構成しているコンテンツを抽出する抽出手段と、前記複数のＷｅｂページのうち、前記指定されたＷｅｂページを構成している各コンテンツが他のＷｅｂページで用いられる頻度をカウントする計算手段と、前記指定されたＷｅｂページを構成しているコンテンツのうち、他のＷｅｂページで用いられる頻度が所定値以下のコンテンツを当該指定されたＷｅｂページに特有のコンテンツであると判定する判定手段と、を備えることを特徴とする。 In order to solve the above-mentioned problem, the invention described in claim 1 is characterized in that an extracting means for extracting content constituting a specified Web page among a plurality of Web pages included in a predetermined site; Among the web pages, the calculation means for counting the frequency with which each of the contents constituting the designated web page is used on another web page, and among the contents constituting the designated web page, And determining means for determining that content having a frequency of being used in another Web page is equal to or less than a predetermined value as content specific to the specified Web page.

この発明によれば、所定のサイトに含まれる複数のＷｅｂページのうち、指定されたＷｅｂページを構成している各コンテンツが複数のＷｅｂページで用いられる頻度が夫々カウントされる。他のＷｅｂページで用いられる頻度が小さいコンテンツであるほど、指定されたＷｅｂページ以外にはあまり出現しないコンテンツである。そこで、他のＷｅｂページで用いられる頻度が所定値以下であるかを判定することで、当該条件を満たす全てのコンテンツが、指定されたＷｅｂページに特有のコンテンツであると特定される。よって、Ｗｅｂページに特有のコンテンツを容易に抽出することができる。 According to this invention, the frequency with which each content which comprises the designated web page is used by several web pages among the several web pages contained in a predetermined site is counted, respectively. The content that is used less frequently on other Web pages is the content that does not appear much other than the specified Web page. Therefore, by determining whether the frequency used in other Web pages is equal to or less than a predetermined value, all content satisfying the conditions is specified as content specific to the specified Web page. Therefore, content unique to the Web page can be easily extracted.

請求項２に記載の発明は、請求項１に記載の特有コンテンツ判定装置において、前記抽出手段は、１つ以上のコンテンツで構成されたコンテンツグループの単位で、Ｗｅｂページを構成しているコンテンツを抽出し、前記計算手段は、前記指定されたＷｅｂページを構成しているコンテンツグループが他のＷｅｂページで用いられる頻度をカウントし、前記判定手段は、前記指定されたＷｅｂページを構成しているコンテンツグループのうち、他のＷｅｂページで用いられる頻度が所定値以下のコンテンツグループを当該指定されたＷｅｂページに特有のコンテンツグループであると判定することを特徴とする。 Invention according to claim 2, in specific content determination device according to claim 1, wherein the extraction means, in units of content group consists of one or more content and content constituting the Web page The calculation means counts the frequency with which the content group constituting the designated web page is used on another web page, and the determination means constitutes the designated web page Among the content groups, it is determined that a content group whose frequency used in other Web pages is a predetermined value or less is a content group specific to the specified Web page.

この発明によれば、コンテンツグループの単位でＷｅｂページに特有のコンテンツが判断されるので、例えば、Ｗｅｂページ上において或るまとまりをもって表示されていたり、互いに関連性を有しているようなコンテンツをコンテンツグループとしたときに、Ｗｅｂページに特有のコンテンツとなるものを抽出することができる。 According to the present invention, content specific to the Web page is determined in units of content groups. For example, content that is displayed in a certain unit on the Web page or is related to each other is displayed. When a content group is used, it is possible to extract content that is unique to a Web page.

請求項３に記載の発明は、請求項２に記載の特有コンテンツ判定装置において、前記抽出手段は、所定のマークアップ言語で記述され、Ｗｅｂページを構成するコンテンツを示すドキュメントデータに基づいて、コンテンツグループを抽出することを特徴とする。 According to a third aspect of the present invention, in the specific content determination apparatus according to the second aspect , the extraction means is a content based on document data that is described in a predetermined markup language and that indicates the content constituting the Web page. It is characterized by extracting groups.

この発明によれば、Ｗｅｂページを構成するコンテンツを示すドキュメントデータに基づいてコンテンツグループが抽出されるので、適格にコンテンツグループを抽出することができる。 According to the present invention, since the content group is extracted based on the document data indicating the content constituting the Web page, the content group can be extracted appropriately.

請求項４に記載の発明は、請求項３に記載の特有コンテンツ判定装置において、前記抽出手段は、前記コンテンツを示すドキュメントデータにおいて予め定められたタグに基づいてコンテンツグループを定めることを特徴とする。 According to a fourth aspect of the present invention, in the specific content determination apparatus according to the third aspect , the extracting unit determines a content group based on a predetermined tag in document data indicating the content. .

この発明によれば、予め定められたタグに基づいてコンテンツグループが抽出されるので、Ｗｅｂページに特有のコンテンツと、特有ではないコンテンツとが夫々予め定められたタグでグループ化されている場合に、Ｗｅｂページに特有のコンテンツを判断する精度を上げることができる。 According to the present invention, the content group is extracted based on a predetermined tag. Therefore, when content specific to a Web page and content that is not specific are grouped by a predetermined tag, respectively. Therefore, it is possible to increase the accuracy of determining content unique to the Web page.

請求項５に記載の発明は、請求項１乃至４の何れか１項に記載の特有コンテンツ判定装置において、前記抽出手段は、投稿された記事が掲載されるＷｅｂページから前記記事に対して投稿されたコメントを抽出し、前記抽出された各コメントを、コメントが示す内容別に分類する分類手段と、出現頻度の閾値を設定する設定手段であり、前記コメントが分類された前記内容の数が多いほど前記閾値を小さくする設定手段と、を更に備え、前記計算手段は、前記コメントが分類された各前記内容のＷｅｂページにおける出現頻度を計算し、前記判定手段は、前記計算手段により計算された出現頻度が前記設定された閾値以下である前記内容を前記Ｗｅｂページに特有の内容であると判定することを特徴とする。
請求項６に記載の発明は、所定のサイトに含まれる複数のＷｅｂページのうち、指定されたＷｅｂページを構成しているコンテンツを抽出する抽出工程と、前記複数のＷｅｂページのうち、前記指定されたＷｅｂページを構成している各コンテンツが他のＷｅｂページで用いられる頻度をカウントする計算工程と、前記指定されたＷｅｂページを構成しているコンテンツのうち、他のＷｅｂページで用いられる頻度が所定値以下のコンテンツを当該指定されたＷｅｂページに特有のコンテンツであると判定する判定工程と、を有することを特徴とする。
請求項７に記載の発明は請求項６に記載の特有コンテンツ判定方法において、前記抽出工程は、投稿された記事が掲載されるＷｅｂページから前記記事に対して投稿されたコメントを抽出し、前記抽出された各コメントを、コメントが示す内容別に分類する分類工程と、出現頻度の閾値を設定する設定工程であり、前記コメントが分類された前記内容の数が多いほど前記閾値を小さくする設定工程と、を更に含み、前記計算工程は、前記コメントが分類された各前記内容のＷｅｂページにおける出現頻度を計算し、前記判定工程は、前記計算工程により計算された出現頻度が前記設定された閾値以下である前記内容を前記Ｗｅｂページに特有の内容であると判定することを特徴とする。 According to a fifth aspect of the present invention, in the specific content determination device according to any one of the first to fourth aspects, the extracting unit posts the article from a Web page on which the posted article is posted. A classifying unit that extracts the comment that has been extracted, and classifies each extracted comment according to the content indicated by the comment, and a setting unit that sets a threshold of appearance frequency, and the number of the content into which the comment is classified Setting means for reducing the threshold as much as possible, the calculating means calculates the appearance frequency of each of the contents into which the comment is classified in the Web page, and the determining means is calculated by the calculating means The content whose appearance frequency is not more than the set threshold value is determined to be content specific to the Web page.
The invention according to claim 6 is an extraction step of extracting content constituting a designated web page from among a plurality of web pages included in a predetermined site, and the designation among the plurality of web pages. a calculation step of the contents constituting the Web page counts the frequency used in other Web pages, of the contents constituting the Web page that the specified frequency used in other Web pages And a determination step of determining that the content equal to or less than a predetermined value is content specific to the designated Web page.
The invention described in claim 7 in specific content determination method according to claim 6, wherein the extracting step extracts a comment posted on the article from a Web page articles posted is posted, the A classification step for classifying each extracted comment according to the content indicated by the comment, and a setting step for setting a threshold of appearance frequency. A setting step for decreasing the threshold as the number of the contents into which the comment is classified increases. And the calculating step calculates the appearance frequency of each content in the Web page into which the comment is classified, and the determining step includes setting the appearance frequency calculated by the calculating step to the set threshold value. The content described below is determined as content specific to the Web page.

請求項８に記載の発明は、コンピュータを、所定のサイトに含まれる複数のＷｅｂページのうち、指定されたＷｅｂページを構成しているコンテンツを抽出する抽出手段、前記複数のＷｅｂページのうち、前記指定されたＷｅｂページを構成している各コンテンツが他のＷｅｂページで用いられる頻度をカウントする計算手段、及び、前記指定されたＷｅｂページを構成しているコンテンツのうち、他のＷｅｂページで用いられる頻度が所定値以下のコンテンツを当該指定されたＷｅｂページに特有のコンテンツであると判定する判定手段、として機能させることを特徴とする。
請求項９に記載の発明は、請求項８に記載の特有コンテンツ判定プログラムにおいて、前記抽出手段は、投稿された記事が掲載されるＷｅｂページから前記記事に対して投稿されたコメントを抽出し、前記コンピュータを、前記抽出された各コメントを、コメントが示す内容別に分類する分類手段、及び、出現頻度の閾値を設定する設定手段であり、前記コメントが分類された前記内容の数が多いほど前記閾値を小さくする設定手段、として更に機能させ、前記計算手段は、前記コメントが分類された各前記内容のＷｅｂページにおける出現頻度を計算し、前記判定手段は、前記計算手段により計算された出現頻度が前記設定された閾値以下である前記内容を前記Ｗｅｂページに特有の内容であると判定することを特徴とする。 The invention according to claim 8 is an extraction means for extracting a content constituting a specified Web page from among a plurality of Web pages included in a predetermined site, and among the plurality of Web pages, calculation means for each content constituting the specified Web page is to count the frequency to be used in other Web pages, and, among the contents constituting the specified Web pages, other Web pages It is characterized by functioning as a determination means for determining that a content having a frequency of use or less is a content specific to the designated Web page.
The invention according to claim 9 is the specific content determination program according to claim 8 , wherein the extraction unit extracts a comment posted to the article from a Web page on which the posted article is posted, The computer is a classification unit that classifies each extracted comment according to the content indicated by the comment, and a setting unit that sets a threshold of appearance frequency. The larger the number of the content into which the comment is classified, Further functioning as a setting means for reducing the threshold value, the calculating means calculates the appearance frequency of each of the contents in which the comment is classified, and the determining means calculates the appearance frequency calculated by the calculating means. It is determined that the content that is equal to or less than the set threshold is content specific to the Web page.

請求項１１に記載の発明は、請求項１乃至６の何れか１項に記載の特有コンテンツ判定装置と、前記特有コンテンツ判定装置により特有のコンテンツであると判定されたコンテンツに関連する関連コンテンツを、前記指定されたＷｅｂページに挿入する挿入手段と、を備えることを特徴とする。 According to an eleventh aspect of the present invention, there is provided the specific content determination device according to any one of the first to sixth aspects and related content related to the content determined to be specific content by the specific content determination device. And insertion means for inserting into the designated Web page.

この発明によれば、特有のコンテンツであると判定されたコンテンツに関連するコンテンツが、指定されたＷｅｂページに挿入されるので、Ｗｅｂページの特徴と関連する情報を当該Ｗｅｂページに追加することができる。 According to the present invention, the content related to the content determined to be the specific content is inserted into the specified Web page, so that information related to the characteristics of the Web page can be added to the Web page. it can.

請求項１２に記載の発明は、請求項１１に記載の関連コンテンツ挿入装置において、前記特有コンテンツ判定装置は、前記指定されたＷｅｂページを構成しているコンテンツとして、投稿された記事のテキストデータが含まれている場合に、当該テキストデータを当該Ｗｅｂページに特有のコンテンツであると判定し、前記特有コンテンツ判定装置により特有のコンテンツであると判定された記事のテキストデータから前記指定されたＷｅｂページの特徴語を抽出する特徴語抽出手段と、それぞれ語に関連付けて記憶手段に記憶された複数のコンテンツの中から、前記抽出された特徴語に関連するコンテンツを前記関連コンテンツとして選択する選択手段と、を更に備え、前記挿入手段は、前記選択された関連コンテンツを、前記指定されたＷｅｂページに挿入することを特徴とする。 According to a twelfth aspect of the present invention, in the related content insertion device according to the eleventh aspect , the unique content determination device has the text data of the posted article as the content constituting the designated Web page. If it is included, the text data is determined to be content specific to the Web page, and the specified Web page is determined from text data of an article determined to be specific content by the specific content determination device. A feature word extracting means for extracting the feature word of each of the above, and a selecting means for selecting content related to the extracted feature word as the related content from among a plurality of contents stored in the storage means in association with each word , And the inserting means inserts the selected related content into the designated Characterized in that it inserted into the eb page.

この発明によれば、各記事のテキストデータにその記事特有の内容が含まれているのであれば、特有コンテンツ判定装置により各記事のテキストデータを抽出することができる。これにより、Ｗｅｂページに掲載されている記事の内容に関連する情報を当該Ｗｅｂページに追加することができる。 According to this invention, if the text data of each article includes contents specific to the article, the text data of each article can be extracted by the unique content determination device. Thereby, the information relevant to the content of the article published on the web page can be added to the web page.

本発明によれば、他のＷｅｂページで用いられる頻度が小さいコンテンツであるほど、指定されたＷｅｂページ以外に前記複数のＷｅｂページ上にあまり出現しないコンテンツであるので、他のＷｅｂページで用いられる頻度が所定値以下であるかを判定することで、当該条件を満たす全てのコンテンツが、指定されたＷｅｂページに特有のコンテンツであると特定される。よって、Ｗｅｂページに特有のコンテンツを容易に抽出することができる。 According to the present invention, the content that is used less frequently on other Web pages is the content that does not appear much on the plurality of Web pages other than the specified Web page, and is therefore used on other Web pages. By determining whether the frequency is equal to or less than a predetermined value, all the content satisfying the condition is specified as content specific to the designated Web page. Therefore, content unique to the Web page can be easily extracted.

一実施形態に係るブログシステムＳの概要構成の一例を示す図である。It is a figure showing an example of outline composition of blog system S concerning one embodiment. 一実施形態に係るブログサーバ１の概要構成の一例を示すブロック図である。It is a block diagram which shows an example of schematic structure of the blog server 1 which concerns on one Embodiment. ブロガーが指定されてからブログページに広告コンテンツが挿入されるまでの処理の概要を示す図である。It is a figure which shows the outline | summary of a process after advertisement content is inserted in a blog page after a blogger is designated. Ｗｅｂページの構成例を示す図である。It is a figure which shows the structural example of a web page. ＨＴＭＬ文書から生成されたＤＯＭツリーの一例を示す図である。It is a figure which shows an example of the DOM tree produced | generated from the HTML document. 記憶部１５に記憶されたコンテンツブロック対応情報の内容の一例を示す図である。4 is a diagram illustrating an example of content of content block correspondence information stored in a storage unit 15. FIG. 一実施形態に係るブログサーバ１のシステム制御部２０の広告コンテンツ挿入処理における処理例を示すフローチャートである。It is a flowchart which shows the process example in the advertisement content insertion process of the system control part 20 of the blog server 1 which concerns on one Embodiment. 一実施形態に係るブログサーバ１のシステム制御部２０の１ページ対応抽出処理における処理例を示すフローチャートである。It is a flowchart which shows the process example in the 1-page corresponding | compatible extraction process of the system control part 20 of the blog server 1 which concerns on one Embodiment. 一実施形態に係るブログサーバ１のシステム制御部２０のツリー探索処理における処理例を示すフローチャートである。It is a flowchart which shows the process example in the tree search process of the system control part 20 of the blog server 1 which concerns on one Embodiment. 一実施形態に係るブログサーバ１のシステム制御部２０の特有コンテンツブロック判定処理における処理例を示すフローチャートである。It is a flowchart which shows the process example in the specific content block determination process of the system control part 20 of the blog server 1 which concerns on one Embodiment. 一実施形態の変形例に係るブログサーバ１のシステム制御部２０のブログ更新時処理における処理例を示すフローチャートである。It is a flowchart which shows the process example in the process at the time of the blog update of the system control part 20 of the blog server 1 which concerns on the modification of one Embodiment.

以下、図面を参照して本発明の実施形態について詳細に説明する。なお、以下に説明する実施の形態は、ブログサービスを提供するブログシステムにおいて、ブログページ送信するサーバ装置に対して本発明を適用した場合の実施形態である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The embodiment described below is an embodiment when the present invention is applied to a server device that transmits a blog page in a blog system that provides a blog service.

［１．ブログシステムの構成及び機能概要］
先ず、本実施形態に係るブログシステムＳの構成及び概要機能について、図１を用いて説明する。 [1. Overview of Blog System Configuration and Functions]
First, the configuration and outline function of the blog system S according to the present embodiment will be described with reference to FIG.

図１は、本実施形態に係るブログシステムＳの概要構成の一例を示す図である。 FIG. 1 is a diagram illustrating an example of a schematic configuration of a blog system S according to the present embodiment.

図１に示すように、ブログシステムＳは、特有コンテンツ判定装置及び関連コンテンツ挿入装置の一例としてのブログサーバ１と、管理端末２と、複数のユーザ端末３と、を含んで構成されている。そして、ブログサーバ１と、各ユーザ端末３とは、ネットワークＮＷを介して、例えば、通信プロトコルにＴＣＰ／ＩＰ等を用いて相互にデータの送受信が可能になっている。なお、ネットワークＮＷは、例えば、インターネット、専用通信回線（例えば、ＣＡＴＶ（Community Antenna Television）回線）、移動体通信網（基地局等を含む）、及びゲートウェイ等により構築されている。また、ブログサーバ１と管理端末２とは、ＬＡＮ（Local Area Network）等のネットワークを介して接続されている。 As illustrated in FIG. 1, the blog system S includes a blog server 1 as an example of a specific content determination device and a related content insertion device, a management terminal 2, and a plurality of user terminals 3. The blog server 1 and each user terminal 3 can transmit / receive data to / from each other using, for example, TCP / IP as a communication protocol via the network NW. The network NW is constructed by, for example, the Internet, a dedicated communication line (for example, a CATV (Community Antenna Television) line), a mobile communication network (including a base station), a gateway, and the like. The blog server 1 and the management terminal 2 are connected via a network such as a LAN (Local Area Network).

このような構成のブログシステムＳにおいて、ブログサーバ１は、ユーザ端末３からのリクエストに応じて、ブログサービスサイトを構成するＷｅｂページを送信するＷｅｂサーバである。ユーザ端末３を利用するユーザがブログサービスサイトのユーザ登録をすると、そのユーザは、当該ブログサービスサイトにおいてユーザ自身のブログを運営することができるようになっている。そして、ユーザ登録されたユーザ（ブロガー）は、ブログサービスサイトにアクセスして、自己のブログを更新（ブログ記事（ブログ１件毎の記録）を追加）することができるようになっている。そのため、ブログサーバ１は、ブログの更新に応じて、ブログのＷｅｂページとして、１又は複数のブログ記事が掲載されるブログページを生成又は更新する。そして、ブログサーバ１は、ブログページＤＢ１０１を備え、ブログページを当該ブログページＤＢ１０１に登録する。 In the blog system S having such a configuration, the blog server 1 is a Web server that transmits a Web page constituting the blog service site in response to a request from the user terminal 3. When a user who uses the user terminal 3 registers as a user of a blog service site, the user can operate his / her blog on the blog service site. The registered user (blogger) can access the blog service site and update his / her blog (add blog articles (records for each blog)). Therefore, the blog server 1 generates or updates a blog page on which one or a plurality of blog articles are posted as a blog web page in accordance with the update of the blog. The blog server 1 includes a blog page DB 101 and registers the blog page in the blog page DB 101.

また、ブログサーバ１は、システム管理者により指定されたブロガーのブログページに対して、広告コンテンツ（関連コンテンツの一例）を挿入する。広告コンテンツとしては、例えば、広告文章のテキストデータ、バナーの画像データ、動画データ、Adobe Flash（商標）やSilverlight（商標）等により生成されたリッチインターネットアプリケーション（ＲＩＡ）等がある。各ブログページに挿入される広告コンテンツは、対象のブログページに掲載されているブログ記事に関連する商品やサービスに関する広告を示すコンテンツである。そのため、ブログサーバ１は、複数の広告コンテンツが登録されている広告ＤＢ１０２を備える。そして、ブログサーバ１は、ブログページからブログ記事を抽出し、更にブログ記事から特徴語を抽出して、抽出した特徴語に関連する広告コンテンツを選択する。 Further, the blog server 1 inserts advertising content (an example of related content) into the blogger blog page designated by the system administrator. The advertisement content includes, for example, text data of advertisement text, banner image data, moving image data, a rich internet application (RIA) generated by Adobe Flash (trademark), Silverlight (trademark), and the like. The advertising content inserted into each blog page is content indicating an advertisement related to a product or service related to the blog article posted on the target blog page. Therefore, the blog server 1 includes an advertisement DB 102 in which a plurality of advertisement contents are registered. Then, the blog server 1 extracts a blog article from the blog page, further extracts a feature word from the blog article, and selects advertisement content related to the extracted feature word.

ユーザ端末３は、ブロガーとしてのユーザや、ブログを閲覧するユーザにより利用される端末装置である。ユーザ端末３としては、例えば、パーソナルコンピュータ、ＰＤＡ、携帯電話機等が用いられる。 The user terminal 3 is a terminal device used by a user as a blogger or a user browsing a blog. As the user terminal 3, for example, a personal computer, a PDA, a mobile phone or the like is used.

管理端末２は、ブログシステムＳのシステム管理者により使用される端末装置である。管理端末２としては、例えば、パーソナルコンピュータ等が用いられる。 The management terminal 2 is a terminal device used by a system administrator of the blog system S. For example, a personal computer or the like is used as the management terminal 2.

［２．ブログサーバの構成及び機能］
次に、ブログサーバ１の構成及び機能について、図２を用いて説明する。 [2. Blog server configuration and functions]
Next, the configuration and function of the blog server 1 will be described with reference to FIG.

図２は、本実施形態に係るブログサーバ１の概要構成の一例を示すブロック図である。また、図３は、ブロガーが指定されてからブログページに広告コンテンツが挿入されるまでの処理の概要を示す図である。また、図４は、Ｗｅｂページの構成例を示す図である。また、図５は、ＨＴＭＬ文書から生成されたＤＯＭツリーの一例を示す図である。また、図６は、記憶部１５に記憶されたコンテンツブロック対応情報の内容の一例を示す図である。 FIG. 2 is a block diagram illustrating an example of a schematic configuration of the blog server 1 according to the present embodiment. FIG. 3 is a diagram showing an outline of processing from when a blogger is designated until advertisement content is inserted into the blog page. FIG. 4 is a diagram illustrating a configuration example of a Web page. FIG. 5 is a diagram illustrating an example of a DOM tree generated from an HTML document. FIG. 6 is a diagram illustrating an example of the content of the content block correspondence information stored in the storage unit 15.

図２に示すように、ブログサーバ１は、操作部１１と、表示部１２と、通信部１３と、ドライブ部１４と、記憶手段の一例としての記憶部１５と、入出力インタフェース部１６と、システム制御部２０と、を備えている。そして、システム制御部２０と入出力インタフェース部１６とは、システムバス２１を介して接続されている。 As shown in FIG. 2, the blog server 1 includes an operation unit 11, a display unit 12, a communication unit 13, a drive unit 14, a storage unit 15 as an example of a storage unit, an input / output interface unit 16, And a system control unit 20. The system control unit 20 and the input / output interface unit 16 are connected via a system bus 21.

操作部１１は、例えば、キーボード、マウス等により構成されており、システム管理者等からの操作指示を受け付け、その指示内容を指示信号としてシステム制御部２０に出力するようになっている。表示部１２は、例えば、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ等により構成されており、文字や画像等の情報を表示するようになっている。通信部１３は、ネットワークＮＷ等に接続して、管理端末２、ユーザ端末３等との通信状態を制御するようになっている。ドライブ部１４は、例えば、フレキシブルディスク、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）等のディスクＤＫからデータ等を読み出す一方、当該ディスクＤＫに対してデータ等を記録するようになっている。 The operation unit 11 includes, for example, a keyboard and a mouse, and receives an operation instruction from a system administrator or the like, and outputs the instruction content to the system control unit 20 as an instruction signal. The display unit 12 includes, for example, a CRT (Cathode Ray Tube) display, a liquid crystal display, and the like, and displays information such as characters and images. The communication unit 13 is connected to a network NW or the like and controls a communication state with the management terminal 2, the user terminal 3, and the like. For example, the drive unit 14 reads data from a disk DK such as a flexible disk, a CD (Compact Disc), a DVD (Digital Versatile Disc), and the like, and records data on the disk DK.

記憶部１５は、例えば、ハードディスクドライブ等により構成されており、各種プログラム及びデータ等を記憶するようになっている。また、記憶部１５には、ブログページＤＢ１０１及び広告ＤＢ１０２が構築されている。ブログページＤＢ１０１には、ブログサービスサイトを構成する各ブログページ（ブログページのＨＴＭＬ文書（ドキュメントデータの一例）、ブログページの素材である画像データ等）が、例えば、そのページのＵＲＬ及びブロガーの識別情報であるユーザＩＤに対応付けて登録されている。広告ＤＢ１０２には、複数の広告コンテンツが、その広告コンテンツによる広告対象の商品やサービスに関連するキーワードに対応付けて登録されている。ここで、広告コンテンツにテキストデータ以外のコンテンツが含まれている場合には、当該コンテンツのＵＲＬも対応付けて登録されている。また、ブログページ上に表示された広告コンテンツをユーザが選択したときに広告対象の商品やサービスに関するＷｅｂページに移動するため、当該ＷｅｂページのＵＲＬも対応付けて登録されている。 The storage unit 15 is configured by, for example, a hard disk drive or the like, and stores various programs, data, and the like. In the storage unit 15, a blog page DB 101 and an advertisement DB 102 are constructed. In the blog page DB 101, each blog page (HTML document (an example of document data) of the blog page, image data that is the material of the blog page, etc.) constituting the blog service site includes, for example, the URL and blogger identification of the page It is registered in association with the user ID which is information. In the advertisement DB 102, a plurality of advertisement contents are registered in association with keywords related to products and services to be advertised by the advertisement contents. Here, when content other than text data is included in the advertising content, the URL of the content is also registered in association with it. In addition, when the user selects the advertisement content displayed on the blog page, the URL of the Web page is also registered in association with the Web page related to the advertisement target product or service.

入出力インタフェース部１６は、操作部１１〜記憶部１５とシステム制御部２０との間のインタフェース処理を行うようになっている。システム制御部２０は、ＣＰＵ（Central Processing Unit）１７、ＲＯＭ（Read Only Memory）１８、ＲＡＭ（Random Access Memory）１９等により構成されている。 The input / output interface unit 16 performs interface processing between the operation unit 11 to the storage unit 15 and the system control unit 20. The system control unit 20 includes a CPU (Central Processing Unit) 17, a ROM (Read Only Memory) 18, a RAM (Random Access Memory) 19, and the like.

システム制御部２０は、ＣＰＵ１７が、ＲＯＭ１８や記憶部１５に記憶された各種プログラムを読み出し実行することによりブログサーバ１の各部を制御する。また、システム制御部２０は、広告コンテンツ挿入ソフトウェア（特有コンテンツ判定プログラムの一例）を実行することにより、抽出手段、計算手段、判定手段及び挿入手段として機能する。なお、広告コンテンツ挿入ソフトウェア等は、例えば、他のサーバ装置等からネットワークＮＷを介して取得されるようにしても良いし、ＣＤ−ＲＯＭ等のディスクＤＫに記録されてドライブ部１４を介して読み込まれるようにしても良い。 The system control unit 20 controls each unit of the blog server 1 by the CPU 17 reading and executing various programs stored in the ROM 18 and the storage unit 15. The system control unit 20 functions as an extraction unit, a calculation unit, a determination unit, and an insertion unit by executing advertisement content insertion software (an example of a specific content determination program). Note that the advertisement content insertion software or the like may be acquired from another server device or the like via the network NW, or may be recorded on a disk DK such as a CD-ROM and read via the drive unit 14. You may make it.

広告コンテンツ挿入ソフトウェアは、ブログページに広告コンテンツを挿入するためのプログラムである。図３に示すように、広告コンテンツ挿入ソフトウェアは、マネージャ部、素材抽出エンジン、文章解析エンジン、広告選択部等により構成されている。マネージャ部は、素材抽出エンジン、文章解析エンジン及び広告選択部の実行を制御する。素材抽出エンジンは、ブログページのＨＴＭＬ文書からＷｅｂ素材としてのコンテンツを抽出するとともに、ブログページ特有のコンテンツを判定するためのソフトウェアである。コンテンツの抽出は、後述のコンテンツブロック（コンテンツグループの一例）の単位で行われる。本実施形態においては、例えば、その記事特有の内容を含むブログ記事が、ブログページに特有のコンテンツブロックに相当する。 Advertising content insertion software is a program for inserting advertising content into a blog page. As shown in FIG. 3, the advertisement content insertion software includes a manager unit, a material extraction engine, a sentence analysis engine, an advertisement selection unit, and the like. The manager unit controls execution of the material extraction engine, the sentence analysis engine, and the advertisement selection unit. The material extraction engine is software for extracting content as a Web material from an HTML document of a blog page and determining content specific to the blog page. Content extraction is performed in units of content blocks (an example of a content group) described later. In the present embodiment, for example, a blog article including content specific to the article corresponds to a content block specific to the blog page.

文章解析エンジンは、ブログページ特有のコンテンツとして抽出されたブログ記事から当該ブログページの特徴語を抽出するためのソフトウェアである。広告選択部は、抽出された特徴語をキーワードとして、ブログページに関連する広告コンテンツを選択するためのソフトウェアである。 The sentence analysis engine is software for extracting feature words of the blog page from the blog article extracted as content unique to the blog page. The advertisement selection unit is software for selecting advertisement content related to the blog page using the extracted feature words as keywords.

以下に、広告コンテンツの挿入の概要について説明する。図３に示すように、システム管理者により対象のブロガーのユーザＩＤが指定される（１）。すると、システム制御部２０は、指定されたユーザＩＤに対応する全ブログページのＨＴＭＬ文書をブログページＤＢ１０１から取得して解析し、Ｗｅｂ素材たるコンテンツをコンテンツブロック単位で抽出する。そして、その抽出結果として、抽出したコンテンツブロック毎にコンテンツブロック対応情報（コンテンツ情報の一例）を生成する（２）。次いで、システム制御部２０は、抽出した各コンテンツブロックの、指定されたユーザＩＤに対応する全ブログページにおける出現頻度を計算する。本実施形態において計算される出現頻度は、例えば、出現回数（度数）である。そして、システム制御部２０は、各ブログページにおいて、出現頻度が所定の閾値以下であるコンテンツブロックを、そのブログページ特有のコンテンツブロックであると判定する（３）。 Below, the outline | summary of insertion of advertisement content is demonstrated. As shown in FIG. 3, the system administrator designates the user ID of the target blogger (1). Then, the system control unit 20 acquires and analyzes the HTML documents of all the blog pages corresponding to the designated user ID from the blog page DB 101, and extracts the content as the Web material in units of content blocks. Then, as the extraction result, content block correspondence information (an example of content information) is generated for each extracted content block (2). Next, the system control unit 20 calculates the appearance frequency of each extracted content block on all blog pages corresponding to the specified user ID. The appearance frequency calculated in the present embodiment is, for example, the number of appearances (frequency). Then, the system control unit 20 determines in each blog page that a content block whose appearance frequency is equal to or less than a predetermined threshold is a content block unique to the blog page (3).

システム制御部２０は、特有のコンテンツブロックであると判定したコンテンツブロック、すなわち、ブログ記事に対して形態素解析等の分析を行って、ブロックページ毎の特徴語を抽出する（４）。特徴語の抽出方法としては、種々の方法があり、また公知であるので、詳細な説明は省略する。一例としては、出現頻度の最も高い単語を特徴語とする。 The system control unit 20 performs analysis such as morphological analysis on the content block determined to be a specific content block, that is, a blog article, and extracts a feature word for each block page (4). There are various methods for extracting feature words, and since they are well-known, detailed description thereof is omitted. As an example, a word having the highest appearance frequency is used as a feature word.

次いで、システム制御部２０は、広告ＤＢ１０２を参照し、抽出した特徴語に関連する広告コンテンツを選択する（５）。そして、システム制御部２０は、選択した広告コンテンツをブログページに挿入して表示させるための規定（タグやデータそのものの記述等）を、ブログページのＨＴＭＬ文書に挿入する（６）。 Next, the system control unit 20 refers to the advertisement DB 102 and selects an advertisement content related to the extracted feature word (5). Then, the system control unit 20 inserts a rule (such as a description of a tag or data itself) for inserting and displaying the selected advertisement content on the blog page and displaying it in the HTML document of the blog page (6).

次に、コンテンツブロックの抽出方法について説明する。なお、本実施形態においては、テキストデータ及び画像データがＷｅｂ素材として抽出されるものとする。 Next, a content block extraction method will be described. In the present embodiment, it is assumed that text data and image data are extracted as Web materials.

例えば、ブログページの構成（レイアウト）が図４に示すようなものであるとする。Ｗｅｂ素材としての各コンテンツは、ブログページ上において、それぞれ或るまとまり（かたまり）毎に表示されている。その各まとまりがコンテンツブロックに相当する。各コンテンツは、ＨＴＭＬ文書に記述されているＤＩＶタグ及びＴＡＢＬＥタグ（予め定められたタグの一例）により夫々コンテンツブロックに分けられる。つまり、各コンテンツは、ＤＩＶタグ及びＴＡＢＬＥタグ（以下、「ブロック化タグ」という）によりブロック化（グループ化）される。 For example, assume that the configuration (layout) of the blog page is as shown in FIG. Each content as a Web material is displayed for each certain group (group) on the blog page. Each group corresponds to a content block. Each content is divided into content blocks by a DIV tag and a TABLE tag (an example of a predetermined tag) described in the HTML document. That is, each content is blocked (grouped) by the DIV tag and the TABLE tag (hereinafter referred to as “blocked tag”).

図４には、コンテンツブロック３０１〜３０６が表示されている。コンテンツブロック３０１は、例えば、ページのヘッダ部分のコンテンツブロックであり、テキストＡ及び画像ａにより構成されている。また、コンテンツブロック３０２は、例えば、各種カテゴリの商品に関するＷｅｂページに移動するためのナビゲーション部分のコンテンツブロックであり、例えば、他のＷｅｂページへのリンクを示すテキストＢ、テキストＣ及びテキストＤにより構成されている。また、コンテンツブロック３０３は、例えば、ブログの表示領域に対応するコンテンツブロックであり、ブログ等の見出しを示すテキストＥ、コンテンツブロック３０４及びコンテンツブロック３０５により構成されいている。このように、コンテンツブロックが入れ子状、つまり、階層構造になっている場合もある。この場合、コンテンツブロック３０３に含まれるコンテンツはテキストＥのみとされ、コンテンツブロック３０４及びコンテンツブロック３０５は、コンテンツブロック３０３から独立しているものとされる。コンテンツブロック３０４及び３０５は、夫々１件のブログ記事である。コンテンツブロック３０４には、ブログ記事の表題や本文を示すテキストＦ及びＧにより構成されている。コンテンツブロック３０５には、ブログ記事の表題や本文を示すテキストＨ、Ｉ及びＪと、ブログ記事に関連してブロガーにより登録された画像ｂ及びｃとにより構成されている。コンテンツブロック３０６は、例えば、コピーライト表示を示すコンテンツブロックであり、テキストＩにより構成されている。 In FIG. 4, content blocks 301 to 306 are displayed. The content block 301 is, for example, a content block in the header portion of the page, and is composed of a text A and an image a. The content block 302 is, for example, a content block of a navigation part for moving to Web pages related to products of various categories, and is composed of, for example, text B, text C, and text D indicating links to other Web pages. Has been. The content block 303 is, for example, a content block corresponding to a blog display area, and includes a text E indicating a headline such as a blog, a content block 304, and a content block 305. In this way, the content blocks may be nested, that is, have a hierarchical structure. In this case, the content included in the content block 303 is only the text E, and the content block 304 and the content block 305 are independent of the content block 303. Each of the content blocks 304 and 305 is one blog article. The content block 304 includes text F and G indicating the title and body of the blog article. The content block 305 includes texts H, I, and J indicating the title and body of a blog article, and images b and c registered by a blogger in relation to the blog article. The content block 306 is, for example, a content block indicating copyright display, and is composed of text I.

これらのコンテンツブロックのうち、コンテンツブロック３０１、３０２、３０３及び３０６は、図４に示すブログページ以外のブログページ上でも比較的頻繁に現れる。一方、コンテンツブロック３０４及びコンテンツブロック３０５は、基本的に当該ブログページにだけに用いられる。よって、コンテンツブロック３０４又はコンテンツブロック３０５が、当該ブログページ特有のコンテンツブロックであると判断されることとなる。 Of these content blocks, content blocks 301, 302, 303, and 306 appear relatively frequently on blog pages other than the blog page shown in FIG. On the other hand, the content block 304 and the content block 305 are basically used only for the blog page. Therefore, it is determined that the content block 304 or the content block 305 is a content block unique to the blog page.

本実施形態においては、その記事特有の内容を含むブログ記事に相当するコンテンツブロックを特有のコンテンツブロックと判定されるようにする必要がある。こうした特有の内容を含むブログ記事は、１ページ内に複数含まれている場合がある。そのため、出現頻度が所定の閾値以下であるコンテンツブロックは、全て特有のコンテンツブロックとされる。例えば、閾値の値を１回に設定する。そうすると、特有の内容を含むブログ記事は、特有のコンテンツブロックと判定され、他のブログ記事と同じような内容のみを含むブログ記事は、特有のコンテンツブロックとは判定されない。また、ヘッダ部分、ナビゲーション部分、コピーライト表示部分等の各ブログページに共通するようなコンテンツブロックの出現頻度は夫々２回以上となるので、これらも特有のコンテンツブロックとは判定されない。なお、閾値は記憶部１５に予め記憶される。 In the present embodiment, it is necessary to determine a content block corresponding to a blog article including the content specific to the article as a specific content block. In some cases, a plurality of blog articles including such specific contents are included in one page. For this reason, all content blocks whose appearance frequency is equal to or lower than a predetermined threshold are set as unique content blocks. For example, the threshold value is set to once. Then, a blog article including specific content is determined as a specific content block, and a blog article including only content similar to other blog articles is not determined as a specific content block. In addition, since the appearance frequency of content blocks that are common to each blog page such as the header portion, the navigation portion, and the copyright display portion is two times or more, these are not determined to be specific content blocks. The threshold value is stored in advance in the storage unit 15.

図４に示すブログページのＨＴＭＬ文書をＤＯＭ（Document Object Model）ツリー、すなわち、木構造で表したものが図５である。なお、図５に示すＤＯＭツリーにおいて、本実施形態の説明に必要のないタグのノードの図示は省略している。 FIG. 5 shows the HTML document of the blog page shown in FIG. 4 in a DOM (Document Object Model) tree, that is, a tree structure. In the DOM tree shown in FIG. 5, illustration of tag nodes that are not necessary for the description of the present embodiment is omitted.

ＤＯＭツリーにおいては、ＤＩＶタグを示すＤＩＶノード及びＴＡＢＬＥタグを示すＴＡＢＬＥノードが、各コンテンツをコンテンツブロックにブロック化するノードとされる（以下、「ブロック化ノード」という）。システム制御部２０は、例えば、深さ優先探索によりＤＯＭツリーを探索し、コンテンツブロックを確定していく。具体的に、システム制御部２０は、ブロック化ノードを発見すると、当該ノードを頂点とする部分木の各ノードに規定されているコンテンツを一まとめにしてコンテンツブロックとする。ただし、コンテンツブロックが階層的に規定された結果、或るブロック化ノード（以下、「上位のブロック化ノード」という）を発見した後、当該ブロック化ノードの子孫のノードの中から更にブロック化ノード（以下、「下位のブロック化ノード」という）を発見すると、コンテンツブロックが分割される。例えば、ルートノードからの距離が近いノードほど階層的により上位のノードであるとすると、上位のブロック化ノードを頂点とする部分木（以下、「上位部分木」という）に相当するコンテンツブロックを、下位のブロック化ノードを頂点とする部分木（以下、「下位部分木」という）に相当するコンテンツブロックと、上位部分木のうち下位部分木を除く部分に相当するコンテンツブロックとに分ける（例えば、コンテンツブロック３０４とコンテンツブロック３０３）。この場合においては、前者のコンテンツブロックを、後者のコンテンツブロックよりも階層的に下位にあるものとする。例えば、コンテンツブロック３０１、３０２、３０３及び３０６の階層は１であり、コンテンツブロック３０４及び３０５の階層は２である。つまり、階層の値が小さいほど階層的により上位となる。 In the DOM tree, a DIV node indicating a DIV tag and a TABLE node indicating a TABLE tag are nodes that block each content into content blocks (hereinafter referred to as “blocked nodes”). For example, the system control unit 20 searches the DOM tree by depth-first search and determines the content block. Specifically, when the system control unit 20 finds a blocked node, the contents defined in each node of the partial tree having the node as a vertex are grouped into a content block. However, as a result of hierarchical definition of the content blocks, after finding a certain blocked node (hereinafter referred to as “higher level blocked node”), further blocked nodes from the descendant nodes of the blocked node When it is found (hereinafter referred to as “lower block node”), the content block is divided. For example, assuming that a node closer to the root node is a higher-level node, a content block corresponding to a subtree having a higher-level blocked node as a vertex (hereinafter referred to as “higher-level subtree”) It is divided into a content block corresponding to a subtree having a lower blocked node as a vertex (hereinafter referred to as “lower subtree”) and a content block corresponding to a portion of the upper subtree excluding the lower subtree (for example, Content block 304 and content block 303). In this case, the former content block is hierarchically lower than the latter content block. For example, the hierarchy of the content blocks 301, 302, 303, and 306 is 1, and the hierarchy of the content blocks 304 and 305 is 2. In other words, the lower the hierarchy value, the higher the hierarchy.

これを、ＨＴＭＬ文書のタグの記述で説明すると、基本的にはブロック化タグが記述されていると、ブロック化タグに挟まれた範囲内に規定が記述されているコンテンツが、まとめて当該ブロック化タグに対応するコンテンツブロックとされる。ただし、ブロック化タグが階層的に記述されている場合、或るブロック化タグに挟まれた範囲内に規定が記述されているコンテンツのうち、当該ブロック化タグよりも下位に記述されているブロック化タグに挟まれた範囲に規定が記述されているコンテンツを除いたものが、まとめて当該ブロック化タグに対応するコンテンツブロックとされる。 This will be explained in the description of the tag of the HTML document. Basically, if a blocked tag is described, the contents whose specifications are described within the range sandwiched between the blocked tags are collectively displayed in the block. The content block corresponds to the categorized tag. However, when the blocking tag is described hierarchically, the block described below the blocking tag among the contents in which the specification is described within a range sandwiched between certain blocking tags A content block corresponding to the block tag is collectively obtained by excluding the content whose definition is described in the range between the block tags.

このようにしてコンテンツブロックを抽出すると、システム制御部２０は、抽出結果を示すコンテンツブロック対応情報を一時的に記憶部１５に保存する。図６に示すように、コンテンツブロック対応情報（符号４０１）は、コンテンツブロック毎に保存される。コンテンツブロック対応情報には、抽出元のＨＴＭＬ文書のＵＲＬ設定部分（符号４０２）とブロック構成情報（符号４０３）とにより構成されている。ブロック構成情報には、抽出された各コンテンツが設定される。ここで、テキストデータについては、ＤＯＭツリーにおけるテキストノードの内容が設定される。一方、画像データについては、画像データそのものの代わりとして、ＤＯＭツリーにおいてＩＭＧタグを示すＩＭＧノードのｓｒｃ属性としての画像データのＵＲＬが設定される。なお、本実施形態においては、ブログページ特有のコンテンツブロックであると判定されたコンテンツブロック、すなわち、ブログ記事から特徴語を抽出するので、テキストデータが抽出されれば良く、画像データについては抽出しなくても良い。 When the content block is extracted in this way, the system control unit 20 temporarily stores content block correspondence information indicating the extraction result in the storage unit 15. As shown in FIG. 6, the content block correspondence information (reference numeral 401) is stored for each content block. The content block correspondence information includes a URL setting part (reference numeral 402) and block configuration information (reference numeral 403) of the extraction source HTML document. Each extracted content is set in the block configuration information. Here, for text data, the contents of the text node in the DOM tree are set. On the other hand, for the image data, instead of the image data itself, the URL of the image data is set as the src attribute of the IMG node indicating the IMG tag in the DOM tree. In the present embodiment, since feature words are extracted from a content block determined to be a blog page-specific content block, that is, a blog article, text data may be extracted, and image data is extracted. It is not necessary.

［３．ブログシステムの動作］
次に、ブログシステムＳの動作について、図７乃至図１０を用いて説明する。 [3. Operation of the blog system]
Next, the operation of the blog system S will be described with reference to FIGS.

図７は、本実施形態に係るブログサーバ１のシステム制御部２０の広告コンテンツ挿入処理における処理例を示すフローチャートである。 FIG. 7 is a flowchart illustrating a processing example in the advertisement content insertion processing of the system control unit 20 of the blog server 1 according to the present embodiment.

広告コンテンツ挿入処理は、例えば、システム管理者の操作に基づいて管理端末２から広告コンテンツ挿入処理の実行のリクエストが送信されてきたときに開始される。 The advertisement content insertion process is started, for example, when a request for execution of the advertisement content insertion process is transmitted from the management terminal 2 based on the operation of the system administrator.

そして、システム管理者が広告コンテンツの挿入対象とするブログを運営するブロガーのユーザＩＤを指定すると、図７に示すように、システム制御部２０は、指定されたユーザＩＤを管理端末２から受信する（ステップＳ１）。 Then, when the system administrator specifies the user ID of the blogger who operates the blog to which the advertising content is to be inserted, the system control unit 20 receives the specified user ID from the management terminal 2 as shown in FIG. (Step S1).

次いで、システム制御部２０は、ブロック数ＮＵＭに０を設定する（ステップＳ２）。ブロック数ＮＵＭは、現時点で発見済みのコンテンツブロックの個数である。また、ＮＵＭはグローバル変数であり、後述する１ページ対応抽出処理及びツリー探索処理からアクセスが可能である。 Next, the system control unit 20 sets 0 to the block number NUM (step S2). The block number NUM is the number of content blocks that have been discovered at the present time. NUM is a global variable and can be accessed from a one-page extraction process and a tree search process described later.

次いで、システム制御部２０は、受信したユーザＩＤに対応する最初のブログページのＨＴＭＬ文書をブログページＤＢ１０１から取得する（ステップＳ３）。次いで、システム制御部２０は、取得したＨＴＭＬ文書を指定して、後述する１ページ対応抽出処理を実行する（ステップＳ４）。この１ページ対応抽出処理では、取得したＨＴＭＬ文書からコンテンツブロックが抽出され、コンテンツブロック対応情報が保存される。 Next, the system control unit 20 acquires the HTML document of the first blog page corresponding to the received user ID from the blog page DB 101 (step S3). Next, the system control unit 20 designates the acquired HTML document and executes a one-page extraction process described later (step S4). In this one-page extraction process, a content block is extracted from the acquired HTML document, and the content block correspondence information is stored.

次いで、システム制御部２０は、受信したユーザＩＤに対応する全てのブログページのコンテンツブロックを抽出したか否かを判定する（ステップＳ５）。このとき、システム制御部２０は、コンテンツブロックを抽出していないブログページが存在する場合には（ステップＳ５：ＮＯ）、次のブログページのＨＴＭＬ文書をブログページＤＢ１０１から取得して（ステップＳ６）、ステップＳ４に移行する。そして、システム制御部２０は、ステップＳ４〜Ｓ６の処理を繰り返して全てのブログページのコンテンツブロックを抽出すると（ステップＳ５：ＹＥＳ）、ステップＳ７に移行する。 Next, the system control unit 20 determines whether or not the content blocks of all the blog pages corresponding to the received user ID have been extracted (step S5). At this time, if there is a blog page from which no content block has been extracted (step S5: NO), the system control unit 20 acquires the HTML document of the next blog page from the blog page DB 101 (step S6). The process proceeds to step S4. And if the system control part 20 repeats the process of step S4-S6 and extracts the content block of all the blog pages (step S5: YES), it will transfer to step S7.

ステップＳ７において、システム制御部２０は、受信したユーザＩＤに対応する最初のブログページのＨＴＭＬ文書を特定する。 In step S7, the system control unit 20 specifies the HTML document of the first blog page corresponding to the received user ID.

次いで、システム制御部２０は、取得したＨＴＭＬ文書を指定して、後述する特有コンテンツブロック判定処理を実行する（ステップＳ８）。この特有コンテンツブロック判定処理では、特定したＨＴＭＬ文書からコンテンツブロックが抽出され、ブログページ特有のコンテンツブロックが判定される。 Next, the system control unit 20 designates the acquired HTML document and executes a specific content block determination process described later (step S8). In the specific content block determination process, a content block is extracted from the specified HTML document, and a content block specific to the blog page is determined.

次いで、システム制御部２０は、特有と判定されたコンテンツブロックを構成する各テキストデータからブログページの特徴語を抽出する（ステップＳ９）。次いで、システム制御部２０は、抽出した特徴語に基づいて、ブログページに関連する広告ページを当該ブログページに挿入する（ステップＳ１０）。具体的に、システム制御部２０は、抽出した特徴語をキーワードとし、広告ＤＢ１０２に参照して当該キーワードに対応する広告コンテンツを選択する。次いで、システム制御部２０は、特定したＨＴＭＬ文書上の所定の位置に、選択した広告コンテンツの規定を挿入する。例えば、システム制御部２０は、広告コンテンツにテキストデータが含まれている場合には、当該テキストデータの内容をＨＴＭＬ文書に追加する。また、例えば、システム制御部２０は、広告コンテンツに画像データが含まれている場合には、当該画像データを表示するためのＩＭＧタグをＨＴＭＬ文書に追加する。また、例えば、システム制御部２０は、広告対象の商品やサービスに関するＷｅｂページへのリンク情報をＨＴＭＬ文書に追加する。 Next, the system control unit 20 extracts feature words of the blog page from each text data constituting the content block determined to be unique (step S9). Next, the system control unit 20 inserts an advertisement page related to the blog page into the blog page based on the extracted feature words (step S10). Specifically, the system control unit 20 uses the extracted feature word as a keyword, refers to the advertisement DB 102, and selects advertisement content corresponding to the keyword. Next, the system control unit 20 inserts the rule of the selected advertisement content at a predetermined position on the identified HTML document. For example, if the advertising content includes text data, the system control unit 20 adds the content of the text data to the HTML document. For example, when image data is included in the advertisement content, the system control unit 20 adds an IMG tag for displaying the image data to the HTML document. Further, for example, the system control unit 20 adds link information to the Web page related to the advertisement target product or service to the HTML document.

システム制御部２０は、特定したＨＴＭＬ文書に広告コンテンツの規定を挿入すると、当該ＨＴＭＬ文書で、ブログページＤＢ１０１に登録されているＨＴＭＬ文書を更新する（ステップＳ１１）。 When the advertisement content rule is inserted into the specified HTML document, the system control unit 20 updates the HTML document registered in the blog page DB 101 with the HTML document (step S11).

次いで、システム制御部２０は、受信したユーザＩＤに対応する全てのブログページに広告コンテンツを挿入したか否かを判定する（ステップＳ１２）。このとき、システム制御部２０は、広告コンテンツを挿入していないブログページが存在する場合には（ステップＳ１２：ＮＯ）、次のブログページのＨＴＭＬ文書を特定して（ステップＳ１３）、ステップＳ８に移行する。そして、システム制御部２０は、ステップＳ８〜Ｓ１３の処理を繰り返して全てのブログページに広告コンテンツを挿入すると（ステップＳ１２：ＹＥＳ）、記憶部１５に保存させておいた全てのコンテンツブロック対応情報を、記憶部１５から削除する（ステップＳ１４）。システム制御部２０は、この処理を終えると、広告コンテンツ挿入処理を終了させる。 Next, the system control unit 20 determines whether or not advertisement content has been inserted into all the blog pages corresponding to the received user ID (step S12). At this time, if there is a blog page in which no advertising content is inserted (step S12: NO), the system control unit 20 specifies the HTML document of the next blog page (step S13), and proceeds to step S8. Transition. And if the system control part 20 repeats the process of step S8-S13 and inserts an advertising content in all the blog pages (step S12: YES), all the content block corresponding | compatible information preserve | saved at the memory | storage part 15 will be shown. The data is deleted from the storage unit 15 (step S14). After completing this process, the system control unit 20 ends the advertisement content insertion process.

図８は、本実施形態に係るブログサーバ１のシステム制御部２０の１ページ対応抽出処理における処理例を示すフローチャートである。 FIG. 8 is a flowchart showing a processing example in the one-page extraction process of the system control unit 20 of the blog server 1 according to this embodiment.

図８に示すように、システム制御部２０は、先ず、取得したＨＴＭＬ文書のＤＯＭツリーをＲＡＭ１９上に生成する（ステップＳ２１）。 As shown in FIG. 8, the system control unit 20 first generates a DOM tree of the acquired HTML document on the RAM 19 (step S21).

次いで、システム制御部２０は、階層ＬＶに０を設定する（ステップＳ２２）。階層ＬＶは、ＤＯＭツリーにおいて現在探索中のノードが属するコンテンツブロックの階層である。ＬＶはグローバル変数であり、１ページ対応抽出処理及び後述するツリー探索処理からアクセスが可能である。 Next, the system control unit 20 sets 0 in the hierarchy LV (step S22). The hierarchy LV is a hierarchy of content blocks to which the currently searched node belongs in the DOM tree. LV is a global variable and can be accessed from the one-page correspondence extraction process and the tree search process described later.

次いで、システム制御部２０は、ＤＯＭツリーのルートノードを指定して（ステップＳ２３）、ツリー探索処理を実行する（ステップＳ２４）。ツリー探索処理は再帰呼び出しが可能であり、このツリー探索処理により、Ｗｅｂページから全てのコンテンツブロックが抽出され、コンテンツブロック対応情報が生成される。 Next, the system control unit 20 designates the root node of the DOM tree (step S23) and executes tree search processing (step S24). The tree search process can be recursively called. By this tree search process, all content blocks are extracted from the Web page, and content block correspondence information is generated.

次いで、システム制御部２０は、ツリー探索処理により生成された各コンテンツブロック対応情報を記憶部１５に保存する（ステップＳ２５）。システム制御部２０は、この処理を終えると、１ページ対応抽出処理を終了させる。 Next, the system control unit 20 saves each content block correspondence information generated by the tree search process in the storage unit 15 (step S25). When completing this process, the system control unit 20 ends the one-page correspondence extraction process.

図９は、本実施形態に係るブログサーバ１のシステム制御部２０のツリー探索処理における処理例を示すフローチャートである。 FIG. 9 is a flowchart illustrating a processing example in the tree search process of the system control unit 20 of the blog server 1 according to the present embodiment.

図９に示すように、システム制御部２０は、先ず、指定されたノードの種類を判定する（ステップＳ３１）。このとき、システム制御部２０は、指定されたノードの種類が、ＤＩＶノード又はＴＡＢＬＥノード（ブロック化ノード）である場合には、すなわち、コンテンツブロックが発見された場合には（ステップＳ３１：ＤＩＶ又はＴＡＢＬＥ）、ステップＳ３２に移行する。 As shown in FIG. 9, the system control unit 20 first determines the type of the designated node (step S31). At this time, when the designated node type is a DIV node or a TABLE node (blocked node), that is, when a content block is found (step S31: DIV or TABLE), the process proceeds to step S32.

ステップＳ３２において、システム制御部２０は、ブロック数ＮＵＭに１を加算するとともに、階層ＬＶに１を加算する。次いで、システム制御部２０は、ブロック番号ＢＮ［ＬＶ］にＮＵＭを設定する（ステップＳ３３）。ブロック番号ＢＮ［ＬＶ］は、現在探索中のノードが属する階層ＬＶで示されるコンテンツブロックのブロック番号である。このブロック番号は、コンテンツブロックの発見順に付与される。また、ＢＮ［ＬＶ］は、グローバル変数である。 In step S32, the system control unit 20 adds 1 to the block number NUM and adds 1 to the hierarchy LV. Next, the system control unit 20 sets NUM in the block number BN [LV] (step S33). The block number BN [LV] is the block number of the content block indicated by the hierarchy LV to which the currently searched node belongs. This block number is assigned in the order of discovery of content blocks. BN [LV] is a global variable.

次いで、システム制御部２０は、ブロック番号ＢＮ［ＬＶ］のコンテンツブロックに対応するコンテンツブロック対応情報を初期化する（ステップＳ３４）。具体的に、システム制御部２０は、コンテンツブロック対応情報を格納する領域をＲＡＭ１９上に設定し、取得したＨＴＭＬ文書のＵＲＬを、当該領域に設定する。 Next, the system control unit 20 initializes content block correspondence information corresponding to the content block of the block number BN [LV] (step S34). Specifically, the system control unit 20 sets an area for storing the content block correspondence information on the RAM 19 and sets the URL of the acquired HTML document in the area.

次いで、システム制御部２０は、指定されたノードの子ノードのうち、未だ探索されていない子ノードが存在するか否かを判定する（ステップＳ３５）。このとき、システム制御部２０は、未だ探索されていない子ノードが存在する場合には（ステップＳ３５：ＹＥＳ）、ステップＳ３６に移行する。 Next, the system control unit 20 determines whether there is a child node that has not been searched among the child nodes of the designated node (step S35). At this time, if there is a child node that has not been searched yet (step S35: YES), the system control unit 20 proceeds to step S36.

ステップＳ３６において、システム制御部２０は、探索されていない子ノードのうちの１つの子ノードを指定して、ツリー探索処理を実行する（ステップＳ３７）。システム制御部２０は、ツリー探索処理を終えると、ステップＳ３５に移行する。 In step S36, the system control unit 20 designates one of the unsearched child nodes and executes tree search processing (step S37). After completing the tree search process, the system control unit 20 proceeds to step S35.

そして、システム制御部２０は、ステップＳ３５〜Ｓ３７の処理を繰り返して全ての子ノードのツリー探索処理を終えると（ステップＳ３５：ＮＯ）、ステップＳ３８に移行する。なお、システム制御部２０は、指定されたノードの子ノードが１つも存在しない場合にも、ステップＳ３８に移行する。ステップＳ３８において、システム制御部２０は、階層ＬＶから１を減算して、ツリー探索処理を終了させる。 And the system control part 20 will repeat the process of step S35-S37, will complete | finish the tree search process of all the child nodes (step S35: NO), and will transfer to step S38. Note that the system control unit 20 also proceeds to step S38 when there is no child node of the designated node. In step S38, the system control unit 20 subtracts 1 from the hierarchy LV and ends the tree search process.

ステップＳ３１において、システム制御部２０は、指定されたノードの種類がテキストノードである場合には（ステップＳ３１：テキスト）、指定されたノードの内容（テキストデータ）を、ブロック番号ＢＮ［ＬＶ］のコンテンツブロックに対応するコンテンツブロック対応情報中のブロック構成情報に追加設定する（ステップＳ３９）。システム制御部２０は、この処理を終えると、ツリー探索処理を終了させる。 In step S31, when the type of the designated node is a text node (step S31: text), the system control unit 20 displays the content (text data) of the designated node with the block number BN [LV]. It is additionally set in the block configuration information in the content block correspondence information corresponding to the content block (step S39). After completing this process, the system control unit 20 ends the tree search process.

ステップＳ３１において、システム制御部２０は、指定されたノードの種類がＩＭＧノードである場合には（ステップＳ３１：ＩＭＧ）、指定されたノードのｓｒｃ属性として設定されている画像データのＵＲＬを取得し、取得したＵＲＬを、ブロック番号ＢＮ［ＬＶ］のコンテンツブロックに対応するコンテンツブロック対応情報中のブロック構成情報に追加設定する（ステップＳ４０）。システム制御部２０は、この処理を終えると、ツリー探索処理を終了させる。 In step S31, if the type of the designated node is an IMG node (step S31: IMG), the system control unit 20 acquires the URL of the image data set as the src attribute of the designated node. The acquired URL is additionally set in the block configuration information in the content block correspondence information corresponding to the content block of the block number BN [LV] (step S40). After completing this process, the system control unit 20 ends the tree search process.

ステップＳ３１において、システム制御部２０は、指定されたノードの種類が、ＤＩＶノード、ＴＡＢＬＥノード、テキストノード、及びＩＭＧノードの何れでもない場合には（ステップＳ３１：その他）、指定されたノードの子ノードのうち、未だ探索されていない子ノードが存在するか否かを判定する（ステップＳ４１）。このとき、システム制御部２０は、未だ探索されていない子ノードが存在する場合には（ステップＳ４１：ＹＥＳ）、探索されていない子ノードのうちの１つの子ノードを指定して（ステップＳ４２）、ツリー探索処理を実行する（ステップＳ４３）。システム制御部２０は、ツリー探索処理を終えると、ステップＳ４１に移行する。 In step S31, when the type of the designated node is not any of the DIV node, the TABLE node, the text node, and the IMG node (step S31: Other), the system control unit 20 is a child of the designated node. It is determined whether there is a child node that has not been searched for among the nodes (step S41). At this time, if there is a child node that has not been searched yet (step S41: YES), the system control unit 20 designates one of the child nodes that have not been searched (step S42). Then, tree search processing is executed (step S43). After completing the tree search process, the system control unit 20 proceeds to step S41.

一方、システム制御部２０は、指定されたノードの全ての子ノードのツリー探索処理を終えた場合、又は、指定されたノードの子ノードが１つも存在しない場合には（ステップＳ４１：ＮＯ）、ツリー探索処理を終了させる。 On the other hand, when the system control unit 20 finishes the tree search process for all the child nodes of the designated node, or when there is no child node of the designated node (step S41: NO), The tree search process is terminated.

図１０は、本実施形態に係るコンテンツ生成サーバ１のシステム制御部２０の特有コンテンツブロック判定処理における処理例を示すフローチャートである。 FIG. 10 is a flowchart illustrating a processing example in the specific content block determination process of the system control unit 20 of the content generation server 1 according to the present embodiment.

図１０に示すように、システム制御部２０は、先ず、１ページ対応抽出処理と同様に、指定されたＨＴＭＬ文書のＤＯＭツリー生成（ステップＳ６１）、ブロック数ＮＵＭ及び階層ＬＶに対して０の設定を行い（ステップＳ６２）、ＤＯＭツリーのルートノードを指定して（ステップＳ６３）、ツリー探索処理を実行する（ステップＳ６４）。 As shown in FIG. 10, the system control unit 20 first generates a DOM tree of the specified HTML document (step S61), sets 0 for the block number NUM and the hierarchy LV, as in the one-page extraction process. (Step S62), the root node of the DOM tree is designated (step S63), and the tree search process is executed (step S64).

次いで、システム制御部２０は、ブロック番号ｉに１を設定する（ステップＳ６５）。次いで、システム制御部２０は、ブロック番号ｉのコンテンツブロックの出現頻度を計算する（ステップＳ６６）。 Next, the system control unit 20 sets 1 to the block number i (step S65). Next, the system control unit 20 calculates the appearance frequency of the content block with the block number i (step S66).

具体的に、システム制御部２０は、ステップＳ６４のツリー探索処理において生成されたコンテンツブロック対応情報ｉ（ブロック番号ｉのコンテンツブロックに対応するコンテンツブロック対応情報）のブロック構成情報と、記憶部１５に保存されている各コンテンツブロック対応情報のブロック構成情報とを比較する。このとき、システム制御部２０は、ブロック構成情報の内容が一致する場合には、出現回数１回としてカウントする。このとき、システム制御部２０は、ブロック構成情報中におけるコンテンツの規定順は無視してかまわない。また、システム制御部２０は、記憶部１５に保存されているコンテンツブロック対応情報のブロック構成情報に規定されている一部のコンテンツがコンテンツブロック対応情報ｉのブロック構成情報に規定されている全部のコンテンツに一致する場合も、出現回数１回としてカウントしても良い。更に、システム制御部２０は、コンテンツブロック対応情報のブロック構成情報中に規定されているテキストデータ同士を比較する場合には、テキストデータが示す文章等そのものが一致するか否かを判定するのではなく、その文章等により表現されている実質的な内容を比較しても良い。例えば、システム制御部２０は、夫々のテキストデータの形態素解析等を行うことによりテキストデータから単語を抽出し、抽出した単語同士を比較しても良い。そして、システム制御部２０は、全ての単語が一致した場合にテキストデータ同士が一致したと判断しても良いし、所定の割合以上で単語が一致した場合にテキストデータ同士が一致したと判断しても良い。システム制御部２０は、このようにしてコンテンツブロック対応情報ｉのブロック構成情報と、記憶部１５に保存されている全てのコンテンツブロック対応情報のブロック構成情報とを比較して、出現頻度を計算する。 Specifically, the system control unit 20 stores the block configuration information of the content block correspondence information i (the content block correspondence information corresponding to the content block with the block number i) generated in the tree search process in step S64 and the storage unit 15. The block configuration information of each stored content block correspondence information is compared. At this time, if the contents of the block configuration information match, the system control unit 20 counts the number of appearances as one. At this time, the system control unit 20 may ignore the content order in the block configuration information. In addition, the system control unit 20 includes a part of the contents specified in the block configuration information of the content block correspondence information i in which all contents specified in the block configuration information of the content block correspondence information stored in the storage unit 15 Even if it matches the content, it may be counted as one appearance. Furthermore, when comparing the text data defined in the block configuration information of the content block correspondence information, the system control unit 20 does not determine whether or not the sentences etc. indicated by the text data match. Instead, the substantial contents expressed by the sentences may be compared. For example, the system control unit 20 may extract words from the text data by performing morphological analysis of each text data and compare the extracted words. Then, the system control unit 20 may determine that the text data match when all the words match, or determine that the text data match when the words match at a predetermined ratio or more. May be. In this way, the system control unit 20 compares the block configuration information of the content block correspondence information i with the block configuration information of all the content block correspondence information stored in the storage unit 15, and calculates the appearance frequency. .

システム制御部２０は、出現頻度を計算すると、計算した出現頻度が、記憶部１５に記憶されている閾値以下であるか否かを判定する（ステップＳ６７）。このとき、システム制御部２０は、出現頻度が閾値以下である場合には（ステップＳ６７：ＹＥＳ）、ブロック番号ｉのコンテンツブロックを、特有のコンテンツブロックの１つであると判定する（ステップＳ６８）。つまり、システム制御部２０は、ブロック番号ｉのコンテンツブロックを、指定されたＨＴＭＬ文書が対応するブログページに特有のコンテンツブロックに加える。 When calculating the appearance frequency, the system control unit 20 determines whether the calculated appearance frequency is equal to or less than the threshold stored in the storage unit 15 (step S67). At this time, if the appearance frequency is equal to or lower than the threshold (step S67: YES), the system control unit 20 determines that the content block with the block number i is one of the specific content blocks (step S68). . That is, the system control unit 20 adds the content block with the block number i to the content block specific to the blog page corresponding to the designated HTML document.

システム制御部２０は、出現頻度が閾値よりも大きい場合（ステップＳ６７：ＮＯ）、又は、ステップＳ６８の処理を終えた場合には、ブロック番号ｉに１を加算して（ステップＳ６９）、ブロック番号ｉがブロック数ＮＵＭの値より大きいか否かを判定する（ステップＳ７０）。このとき、システム制御部２０は、ブロック番号ｉがブロック数ＮＵＭの値以下である場合には（ステップＳ７０：ＮＯ）、ステップＳ６６に移行する。そして、システム制御部２０は、ツリー探索処理において抽出された全てのコンテンツブロックの出現頻度を計算すると（ステップＳ７０：ＹＥＳ）、特有コンテンツブロック判定処理を終了させる。 The system control unit 20 adds 1 to the block number i (step S69) when the appearance frequency is greater than the threshold (step S67: NO) or when the process of step S68 is completed. It is determined whether i is larger than the number of blocks NUM (step S70). At this time, if the block number i is equal to or less than the value of the block number NUM (step S70: NO), the system control unit 20 proceeds to step S66. When the system control unit 20 calculates the appearance frequencies of all the content blocks extracted in the tree search process (step S70: YES), the system control unit 20 ends the specific content block determination process.

なお、システム制御部２０は、ステップＳ６４のツリー探索処理によりコンテンツブロックを抽出していたが、広告コンテンツ挿入処理から実行された１ページ対応抽出処理（図７ステップＳ４）において、受信したブロガーのユーザＩＤに対応する全てのブログページについてコンテンツブロックが抽出され、その結果としてコンテンツブロック対応情報が記憶部１５に記憶されているので、再度コンテンツブロックを抽出しなくても良い。その場合には、指定されたＨＴＭＬ文書のＵＲＬに基づいて、当該ＨＴＭＬ文書が対応するブログページを構成する各コンテンツブロックのコンテンツブロック対応情報を記憶部１５から取得することができる。 The system control unit 20 has extracted the content block by the tree search process in step S64. However, in the one-page correspondence extraction process (step S4 in FIG. 7) executed from the advertisement content insertion process, the received blogger user Content blocks are extracted for all the blog pages corresponding to the ID, and as a result, the content block correspondence information is stored in the storage unit 15. Therefore, it is not necessary to extract the content blocks again. In that case, based on the URL of the designated HTML document, the content block correspondence information of each content block constituting the blog page to which the HTML document corresponds can be acquired from the storage unit 15.

［４．変形例１］
次に、本実施形態の変形例について、図１１を用いて説明する。 [4. Modification 1]
Next, a modification of this embodiment will be described with reference to FIG.

これまでの説明においては、システム管理者によりブロガーが指定されたときに、指定されたブロガーのブログページに広告コンテンツを挿入していたが、ブログが更新されたタイミングで広告コンテンツを挿入しても良い。 In the explanation so far, when the blogger is designated by the system administrator, the advertising content is inserted into the blog page of the designated blogger. However, even if the advertising content is inserted at the timing when the blog is updated. good.

図１１は、本実施形態の変形例に係るブログサーバ１のシステム制御部２０のブログ更新時処理における処理例を示すフローチャートである。なお、図１１において、図７と同様の処理については同様のステップ番号を付してある。 FIG. 11 is a flowchart illustrating a processing example in the blog update processing of the system control unit 20 of the blog server 1 according to the modification of the present embodiment. In FIG. 11, the same steps as those in FIG. 7 are given the same step numbers.

先ず、ブログの更新に先立ち、ブロガーは、ユーザ端末３を操作してブログサービスサイトにアクセスし、自身のユーザＩＤとパスワードとを入力することによりブログサービスサイトにログインする。このログインにより、ブログサーバ１はユーザ端末３に対してセッションＩＤを発行し、セッションＩＤとユーザＩＤとを対応付けて管理する。ユーザ端末３からブログサーバ１へのリクエストにはセッションＩＤが含まれているので、ブログサーバ１は、どのブロガーからのリクエストであるかを特定することができる。 First, prior to updating the blog, the blogger accesses the blog service site by operating the user terminal 3, and logs in to the blog service site by inputting his / her user ID and password. By this login, the blog server 1 issues a session ID to the user terminal 3, and manages the session ID and the user ID in association with each other. Since the request from the user terminal 3 to the blog server 1 includes the session ID, the blog server 1 can specify which blogger the request is from.

そして、ブロガーが新しいブログ記事の登録操作を行うと、ユーザ端末３は、ブログ記事のデータ（表題や本文等のテキストデータ、画像データ等）をブログサーバ１に送信し、図１１に示すように、ブログサーバ１のシステム制御部２０は、ブログ記事のデータを受信する（ステップＳ７１）。次いで、システム制御部２０は、ブロガーのユーザＩＤに対応するブログページの中から、更新すべきブログページのＨＴＭＬ文書をブログページＤＢ１０１から取得する（ステップＳ７２）。次いで、システム制御部２０は、受信したブログ記事のデータに基づいて、取得したＨＴＭＬ文書を更新する（ステップＳ７３）。例えば、システム制御部２０は、取得したＨＴＭＬ文書に、ブログ記事用のＴＡＢＬＥタグ又はＤＩＶタグを追加し、当該タグに挟まれた形で、受信したブログ記事の表題や本文のテキストデータ等を追加する。次いで、システム制御部２０は、ブログ記事のデータを追加したＨＴＭＬ文書で、ブログページＤＢ１０１に登録されているＨＴＭＬ文書を更新する（ステップＳ７４）。 Then, when the blogger performs a new blog article registration operation, the user terminal 3 transmits blog article data (text data such as titles and body text, image data, etc.) to the blog server 1, as shown in FIG. The system control unit 20 of the blog server 1 receives blog article data (step S71). Next, the system control unit 20 acquires an HTML document of the blog page to be updated from the blog page DB 101 from among the blog pages corresponding to the blogger user ID (step S72). Next, the system control unit 20 updates the acquired HTML document based on the received blog article data (step S73). For example, the system control unit 20 adds a TABLE tag or DIV tag for a blog article to the acquired HTML document, and adds the title of the received blog article, text data of the body text, etc. sandwiched between the tags. To do. Next, the system control unit 20 updates the HTML document registered in the blog page DB 101 with the HTML document to which the blog article data is added (step S74).

次いで、システム制御部２０は、ブロガーのユーザＩＤに対応する全てのブログページからコンテンツブロックを抽出する（ステップＳ３〜Ｓ６）。 Next, the system control unit 20 extracts content blocks from all blog pages corresponding to the blogger user ID (steps S3 to S6).

次いで、システム制御部２０は、ステップＳ７３において更新したＨＴＭＬ文書を指定して、特有コンテンツブロック判定処理を実行し（ステップＳ８）、特有と判定されたコンテンツブロックを構成する各テキストデータからブログページの特徴語を抽出する（ステップＳ９）。 Next, the system control unit 20 designates the HTML document updated in step S73, executes the specific content block determination process (step S8), and determines the blog page from each text data constituting the content block determined to be specific. Feature words are extracted (step S9).

次いで、システム制御部２０は、指定されたＨＴＭＬ文書から、既存の広告コンテンツの規定を削除し（ステップＳ７５）、抽出した特徴語をキーワードとして、関連する広告コンテンツの規定を挿入する（ステップＳ１０）。つまり、システム制御部２０は、ブログページ上に表示される広告コンテンツを変更する。 Next, the system control unit 20 deletes the existing advertisement content rule from the specified HTML document (step S75), and inserts the related advertisement content rule using the extracted feature word as a keyword (step S10). . That is, the system control unit 20 changes the advertising content displayed on the blog page.

そして、システム制御部２０は、広告コンテンツの規定が挿入されたＨＴＭＬ文書で、ブログページＤＢ１０１に登録されているＨＴＭＬ文書を更新し（ステップＳ１１）、記憶部１５から全てのコンテンツブロック対応情報を削除する（ステップＳ１４）。 Then, the system control unit 20 updates the HTML document registered in the blog page DB 101 with the HTML document into which the advertisement content rule is inserted (step S11), and deletes all content block correspondence information from the storage unit 15. (Step S14).

なお、ブログの更新に伴ってブログページを新規に生成しなければならない場合の処理もも、基本的に上述した処理と同様で良い。ただし、新規に生成されたブログページには、広告コンテンツは未だ挿入されていないので、ステップＳ７５における広告コンテンツの規定の削除は行われない。 Note that the processing when a blog page has to be newly generated as the blog is updated may be basically the same as the processing described above. However, since the advertisement content has not yet been inserted in the newly generated blog page, the regulation of the advertisement content in step S75 is not performed.

［５．変形例２］
これまでの説明においては、ブログページに特有のコンテンツの判定に用いられる閾値として１回を設定していたが、２回以上の値を閾値として設定しても良い。 [5. Modification 2]
In the description so far, one time is set as the threshold value used for determining the content specific to the blog page, but a value of two times or more may be set as the threshold value.

例えば、閾値を１回とした場合には、出現頻度が１回であるコンテンツブロック（ブログ記事）がブログページに特有のコンテンツとして抽出され、抽出されたブログ記事のテキストデータから特徴語が抽出される。このとき、抽出された各ブログ記事のテキストデータのデータ量が少ないと、そこから抽出される単語の数は少なくなる。そして、十分な数の単語を抽出することができないと、どの単語が特徴語であるかを全く判断することができない場合や、的確に判断することができない場合がある。そこで、閾値の値を上げて、ブログページに特有のコンテンツと判定される条件をゆるめることで、特徴語を抽出する対象となるブログ記事を増やしていく。これにより、特徴語を抽出することが可能となる。 For example, when the threshold is set to once, a content block (blog article) having an appearance frequency of once is extracted as content specific to the blog page, and feature words are extracted from the text data of the extracted blog article. The At this time, if the amount of text data of each extracted blog article is small, the number of words extracted from the text data decreases. If a sufficient number of words cannot be extracted, it may not be possible to determine at all which word is a feature word, or may not be able to accurately determine. Therefore, by increasing the threshold value and loosening the condition for determining content specific to the blog page, the number of blog articles that are the target of feature word extraction is increased. As a result, feature words can be extracted.

具体的には、ブログサーバ１のシステム制御部２０が、最初は閾値を１回に設定して、ブログページに特有のコンテンツブロックを判定することにより、出現回数が１回のブログ記事を抽出して特徴語を抽出する。このとき、システム制御部２０は、特徴語を抽出することができないと判定した場合には、閾値を２回に変更して、ブログ記事の抽出及び特徴語の抽出を行う。システム制御部２０は、それでも特徴語を抽出することができないと判定した場合には、閾値を３回に変更して、ブログ記事の抽出及び特徴語の抽出を行う。システム制御部２０は、こうした処理を、特徴語が抽出することができるまで継続する。つまり、特有のコンテンツブロックの抽出結果に基づく処理を正常に行うことができなかった場合に、閾値を上げるのである。 Specifically, the system control unit 20 of the blog server 1 initially sets a threshold value to once, and determines a content block specific to the blog page, thereby extracting a blog article that appears once. To extract feature words. At this time, when it is determined that the feature word cannot be extracted, the system control unit 20 changes the threshold value to two times to extract the blog article and the feature word. If the system control unit 20 determines that the feature word cannot still be extracted, the system control unit 20 changes the threshold to three times to extract the blog article and the feature word. The system control unit 20 continues such processing until a feature word can be extracted. That is, the threshold value is raised when the processing based on the extraction result of the specific content block cannot be normally performed.

ただし、閾値を無制限に上げていくと、ブログ記事ではないものも抽出されてしまうので、閾値がある程度まで上がると処理を中断するものとする。例えば、閾値が、指定されたブロガーに対応するブログページのページ数の値にまで上がると、各ブログページで共通して用いられるコンテンツブロックを抽出してしまうので、閾値がブログページのページ数の値になったら処理を中断しても良い。 However, if the threshold value is increased indefinitely, non-blog articles are also extracted. Therefore, the processing is interrupted when the threshold value is increased to some extent. For example, if the threshold value rises to the value of the number of pages of the blog page corresponding to the specified blogger, the content block used in common in each blog page is extracted, so the threshold value is the number of pages of the blog page. Processing may be interrupted when the value is reached.

また例えば、ブログページの所定ページ数あたり１回のみ出現するコンテンツブロックをブログページ特有のコンテンツブロックであると、システム管理者側で予め定めても良い。この場合、指定されたブロガーに対応するブログページのページ数に比例して、閾値としての出現回数を変えても良い。 Further, for example, the system administrator may determine in advance that a content block that appears only once per a predetermined number of pages of the blog page is a content block specific to the blog page. In this case, the number of appearances as the threshold value may be changed in proportion to the number of blog pages corresponding to the designated blogger.

［６．変形例３］
これまでの説明においては、ブログページに特有のコンテンツの判定に用いられる出現頻度として、出現回数（度数）を用いていたが、指定されたブロガーに対応するブログページの全コンテンツブロックに対する出現回数の割合（相対度数）を用いても良い。 [6. Modification 3]
In the description so far, the number of appearances (frequency) is used as the appearance frequency used to determine the content specific to the blog page. However, the number of appearances for all content blocks of the blog page corresponding to the specified blogger. A ratio (relative frequency) may be used.

例えば、ブロガーが登録したブログ記事に対して、他のユーザからコメントを登録することができ、ブログ記事とともにコメントが閲覧可能になっているとする。このコメントのテキストデータもブログページを構成するコンテンツの１つとなる。ブログサーバ１のシステム制御部２０は、コメントのテキストデータをブログページに追加する場合、当該ブログページのＨＴＭＬ文書に、ブロック化タグの記述を追加した上で当該テキストデータを追加することにより、コメントのテキストデータを、ブログ記事や他のコメントのテキストデータとは独立したコンテンツブロックとする。そして、システム制御部２０は、コンテンツブロックとしてコメントのテキストデータを抽出し、抽出したコメントのテキストデータが特有の内容を有している場合には、そのコメントに関連する広告コンテンツをブログページに挿入するようにする。 For example, it is assumed that a comment can be registered from another user on a blog article registered by a blogger, and the comment can be viewed together with the blog article. The text data of this comment is also one of the contents constituting the blog page. When adding text data of a comment to a blog page, the system control unit 20 of the blog server 1 adds a description of a blocking tag to the HTML document of the blog page, and then adds the text data. Is a content block independent of the text data of blog articles and other comments. Then, the system control unit 20 extracts the text data of the comment as a content block, and when the extracted text data of the comment has a specific content, the advertising content related to the comment is inserted into the blog page To do.

ところで、或るブログ記事に対して複数のコメントが登録された場合において、複数のコメントの各内容が、例えば、多数派の意見と少数派の意見といったように、頻繁に出現する内容と、あまり頻繁には出現しない内容とに分かれる場合がある。このとき、多数派の意見は、一般的な意見であり、あまり特徴的な内容ではないと考えることができる。一方、少数派の意見は、特異な意見であり、ブログページに特有の内容と考えることができる。そうした場合に、少数派の意見を示すコメントをブログページ特有のコンテンツとして抽出したい。 By the way, when a plurality of comments are registered for a certain blog article, the contents of the plurality of comments are not so much as contents that frequently appear, for example, opinions of majority and minority. It may be divided into contents that do not appear frequently. At this time, the opinions of the majority can be considered as general opinions and not very characteristic content. On the other hand, the opinions of minorities are unique opinions and can be considered as content specific to blog pages. In such a case, I want to extract comments that show minority opinions as content specific to the blog page.

しかしながら、多数派の意見の数と少数派の意見の数は、相対的なものであり、コメントの総数によって変化する。こうした場合において、出現頻度として度数を用い、閾値を例えば１回とすると、頻繁に現れない内容（少数派の意見）を適切に抽出することができない場合がある。そこで、出現頻度として相対度数を用い、閾値を所定の割合に設定するのである。このときの閾値は任意に設定することができる。例えば、抽出されたコンテンツブロックの内容がＮ個のパターン（Ｎは２以上の整数）に分けられる場合、少数派の意見を区別するために、閾値には１÷Ｎ未満の範囲で閾値を設定しても良い。このように、システム制御部２０が、その時々の状況に応じて閾値を変更しても良い。 However, the number of majority opinions and the number of minority opinions are relative and vary with the total number of comments. In such a case, if the frequency is used as the appearance frequency and the threshold is set to, for example, once, contents that do not appear frequently (minority opinions) may not be appropriately extracted. Therefore, the relative frequency is used as the appearance frequency, and the threshold is set to a predetermined ratio. The threshold value at this time can be set arbitrarily. For example, when the content of the extracted content block is divided into N patterns (N is an integer of 2 or more), in order to distinguish minority opinions, a threshold value is set within a range of less than 1 / N. You may do it. As described above, the system control unit 20 may change the threshold according to the situation at that time.

なお、ブログ等のような記事に対してコメント等を登録することができるシステムとして、例えば、或るユーザが登録したつぶやきに対して、これをフォローするつぶやきを他のユーザが登録することができるTwitter（商標）や、電子掲示板等がある。 In addition, as a system capable of registering comments and the like for articles such as blogs, for example, other users can register a tweet that follows a tweet registered by a certain user. Twitter (trademark) and electronic bulletin boards.

以上説明したように、本実施形態によれば、ブログサーバ１のシステム制御部２０が、ＨＴＭＬ文書が指定されることによって順次指定されたブログページを構成しているコンテンツを抽出し、指定されたブログページを構成している各コンテンツの出現頻度を計算し、指定されたブログページを構成するコンテンツのうち、出現頻度が所定の閾値以下のコンテンツを当該ブログページに特有のコンテンツであると判断する。 As described above, according to the present embodiment, the system control unit 20 of the blog server 1 extracts the contents constituting the blog pages that are sequentially designated by the designation of the HTML document, and is designated. The appearance frequency of each content constituting the blog page is calculated, and among the contents constituting the specified blog page, the content whose appearance frequency is equal to or less than a predetermined threshold is determined to be content specific to the blog page. .

従って、出現頻度が小さいコンテンツであるほど、指定されたブログページ以外にはあまり出現しないコンテンツであるので、出現頻度が閾値以下であるかを判定することで、当該条件を満たす全てのコンテンツが、指定されたブログページに特有のコンテンツであると特定される。よって、ブログページに特有のコンテンツを容易に抽出することができる。 Therefore, since the content having a smaller appearance frequency is a content that does not appear much other than the designated blog page, all content satisfying the condition is determined by determining whether the appearance frequency is equal to or less than a threshold. Identified as content specific to the specified blog page. Therefore, content specific to the blog page can be easily extracted.

また、ブログサーバ１のシステム制御部２０が、指定されたブログページに特有のコンテンツに関連する広告コンテンツを当該ブログページに挿入する。 Further, the system control unit 20 of the blog server 1 inserts advertisement content related to content specific to the designated blog page into the blog page.

従って、ブログページの特徴と関連する情報をＷｅｂページに追加することができる。 Therefore, information related to the characteristics of the blog page can be added to the Web page.

また、ブログサーバ１のシステム制御部２０が、指定されたブログページを構成しているコンテンツとして、ブログの記事のテキストデータが含まれている場合に、当該テキストデータを、当該ブログページに特有のコンテンツであると判定し、ブログの記事のテキストデータから当該ブログページの特徴語を抽出し、当該特徴語をキーワードとして、予め関連付けられている広告コンテンツを、当該ブログページに挿入する。 In addition, when the system control unit 20 of the blog server 1 includes text data of a blog article as the content constituting the designated blog page, the text data is converted into the text data unique to the blog page. It is determined that the content is the content, the feature word of the blog page is extracted from the text data of the blog article, and the advertisement content associated in advance is inserted into the blog page using the feature word as a keyword.

従って、ブログページに掲載されているブログの内容に関連する広告を当該ブログページに追加することができる。 Therefore, an advertisement related to the content of the blog posted on the blog page can be added to the blog page.

また、ブログサーバ１のシステム制御部２０が、ブログサービスサイトに含まれる複数のブログページ上における各コンテンツの出現頻度を計算する。 Further, the system control unit 20 of the blog server 1 calculates the appearance frequency of each content on a plurality of blog pages included in the blog service site.

従って、ブログサービスサイトに含まれる複数のＷｅｂページ（例えば、指定されたブロガーのユーザＩＤに対応する複数のブログページ）上において、指定されたブログページを構成している各コンテンツの出現頻度が計算されるので、ブログサービスサイト内で共通して用いられるコンテンツは、特有のコンテンツではないと判定することが可能となり、判断精度を上げることができる。 Therefore, the appearance frequency of each content constituting the specified blog page is calculated on a plurality of Web pages (for example, a plurality of blog pages corresponding to the specified blogger user ID) included in the blog service site. Therefore, it is possible to determine that the content used in common in the blog service site is not unique content, and the determination accuracy can be improved.

また、ブログサーバ１のシステム制御部２０が、１つ以上のコンテンツで構成されるコンテンツブロックの単位で、ブログページを構成しているコンテンツを抽出し、指定されたブログページを構成している各コンテンツブロック出現頻度を計算し、指定されたブログページを構成するコンテンツブロックのうち、出現頻度が閾値以下のコンテンツブロックを当該ブログページに特有のコンテンツブロックであると判断する。 Further, the system control unit 20 of the blog server 1 extracts the contents constituting the blog page in units of content blocks composed of one or more contents, and configures each designated blog page. The content block appearance frequency is calculated, and among the content blocks constituting the designated blog page, a content block whose appearance frequency is equal to or less than a threshold is determined to be a content block specific to the blog page.

従って、ブログページ上において、例えば、ヘッダ部分、ナビゲーション部分、ブログが表示される部分、コピーライト表示の部分等のように、１つ以上のコンテンツがまとまりをもってコンテンツブロックとして表示されている場合に、ブログページに特有のコンテンツブロックを抽出することができる。 Therefore, on the blog page, for example, when one or more contents are collectively displayed as a content block, such as a header part, a navigation part, a part where a blog is displayed, a copyright display part, etc., A content block specific to a blog page can be extracted.

また、ブログサーバ１のシステム制御部２０が、ブログページを構成しているコンテンツを当該ブログページのＨＴＭＬ文書に基づいて抽出し、ＨＴＭＬ文書においてＤＩＶタグ又はＴＡＢＬＥタグに基づいて、コンテンツブロックを定める。 Further, the system control unit 20 of the blog server 1 extracts the contents constituting the blog page based on the HTML document of the blog page, and determines the content block based on the DIV tag or the TABLE tag in the HTML document.

従って、ＤＩＶタグにより、ＨＴＭＬ文書の作成の際に明示的にブロック化された１つ以上のコンテンツを特定することができ、また、ＴＡＢＬＥタグにより、表形式でブロック化されて表示される１つ以上のコンテンツを特定することができるので、例えば、これらのタグにより、ブログページに特有のコンテンツと、特有ではないコンテンツとがブロック化されている場合に、Ｗｅｂページに特有のコンテンツを判断する精度を上げることができる。 Therefore, one or more contents explicitly blocked when creating an HTML document can be specified by the DIV tag, and one of the contents that is blocked and displayed in a table format by the TABLE tag. Since the above content can be specified, for example, when the content specific to the blog page and the non-specific content are blocked by these tags, the accuracy of determining the content specific to the web page Can be raised.

なお、上記実施形態においては、Ｗｅｂページを構成しているコンテンツとして、テキストデータ及び画像データを抽出していたが、抽出対象のコンテンツはこれらに限られるものではない。例えば、Ｗｅｂページ上に表示されるコンテンツ、又は、Ｗｅｂページが表示されている際に再生されるコンテンツ（例えば、動画データ、音声データ、電子文書等）であれば良い。また、所定の種類のコンテンツのみを抽出しても良い。 In the above embodiment, text data and image data are extracted as the contents constituting the Web page, but the contents to be extracted are not limited to these. For example, it may be content displayed on a Web page or content that is played back when a Web page is displayed (for example, moving image data, audio data, electronic document, etc.). Further, only a predetermined type of content may be extracted.

また、上記実施形態においては、指定されたブログページを構成する各コンテンツブロックに対応するコンテンツブロック対応情報を、指定されたブロガーのユーザＩＤに対応する全てのブログページを構成する各コンテンツブロック対応情報と比較することによって各出現頻度が計算されていた。つまり、指定されたブログページを構成する各コンテンツブロックの出現頻度を計算する場合に、指定されたブロガーに対応する全てのブログページを対象とした範囲に出現する頻度を計算するようになっていたが、対象とする範囲はこれだけに限られるものではない。例えば、予め定められたページ数分のブログページを対象としても良いし、ブログサービスサイトを構成する全てのブログページを対象としても良い。 Moreover, in the said embodiment, each content block corresponding information which comprises all the blog pages corresponding to the user ID of the designated blogger for the content block corresponding information corresponding to each content block which comprises the designated blog page. Each appearance frequency was calculated by comparing with. In other words, when calculating the appearance frequency of each content block making up the specified blog page, the frequency of appearance in the range targeting all the blog pages corresponding to the specified blogger was calculated. However, the target range is not limited to this. For example, blog pages corresponding to a predetermined number of pages may be targeted, or all blog pages constituting the blog service site may be targeted.

また、上記実施形態においては、ＤＩＶタグに挟まれているコンテンツ、及び、ＴＡＢＬＥタグに挟まれているコンテンツを、コンテンツブロックとしてグループ化して抽出していたが、コンテンツをグループ化するタグとしては、これらのみに限られるものではない。 Moreover, in the said embodiment, although the content pinched | interposed into the DIV tag and the content pinched | interposed into the TABLE tag were extracted by grouping as a content block, as a tag which groups content, It is not limited only to these.

また、上記実施形態においては、Ｗｅｂページに特有のコンテンツをコンテンツブロックの単位で抽出していたが、各コンテンツをそのまま一つずつ抽出しても良い。 In the above embodiment, content specific to a Web page is extracted in units of content blocks, but each content may be extracted one by one as it is.

また、上記実施形態として、Ｗｅｂページに特有のコンテンツに関連するコンテンツとして、商品やサービスに関する広告を示す広告コンテンツを、当該Ｗｅｂページに挿入していたが、関連するコンテンツであれば広告コンテンツに限られるものではない。例えば、特有のコンテンツと判定されたブログ記事等のコンテンツに関連する画像データ（静止画や動画像）を、背景画像や挿入画像（挿絵等）として挿入しても良い。具体的には、例えば、画像データ用のデータベースを構築し、当該データベースに、画像データとキーワードとを対応付けて登録しておく。画像データに対応付けられるキーワードは、その画像データによって表される画像を示す単語や当該画像に関連する単語である。そして、特有のコンテンツと判定されたコンテンツから特徴語を抽出し、抽出した特徴語をキーワードとして、関連する画像データをデータベースから選択する。そして、対象のＨＴＭＬ文書のＢＯＤＹタグに、選択した画像データのＵＲＬをｂａｃｋｇｒｏｕｎｄ属性として挿入したり、対象のＨＴＭＬ文書の所定位置に、選択した画像データを表示するＩＭＧタグを挿入したりする。これにより、特有のコンテンツと判定されたブログ記事等のコンテンツの内容に適した画像をＷｅｂページに挿入することができる。 In the above embodiment, advertising content indicating an advertisement related to a product or service has been inserted into the Web page as content related to content unique to the Web page. It is not something that can be done. For example, image data (still image or moving image) related to content such as a blog article determined to be unique content may be inserted as a background image or an insertion image (illustration). Specifically, for example, a database for image data is constructed, and image data and keywords are associated and registered in the database. The keyword associated with the image data is a word indicating an image represented by the image data or a word related to the image. Then, feature words are extracted from the content determined to be unique content, and related image data is selected from the database using the extracted feature words as keywords. Then, the URL of the selected image data is inserted as a background attribute in the BODY tag of the target HTML document, or an IMG tag for displaying the selected image data is inserted at a predetermined position of the target HTML document. As a result, an image suitable for the content content such as a blog article determined to be unique content can be inserted into the Web page.

また、Ｗｅｂページに特有のコンテンツの用途としては、関連するコンテンツをＷｅｂページに挿入することのみに限られるものではない。例えば、Ｗｅｂページに特有のコンテンツに基づいて、新たなコンテンツを生成しても良い。 Further, the use of content specific to a Web page is not limited to only inserting related content into a Web page. For example, new content may be generated based on content unique to the Web page.

また、上記実施形態においては、サーバ装置に対して本発明の特有コンテンツ判定装置を適用していたが、記憶手段やネットワーク上からＨＴＭＬ文書を取得することができれば、端末装置等に対して特有コンテンツ判定装置を適用しても良い。 In the above embodiment, the specific content determination device of the present invention is applied to the server device. However, if the HTML document can be acquired from the storage means or the network, the special content is determined for the terminal device or the like. A determination device may be applied.

また、上記実施形態においては、ＨＴＭＬ文書に対して本発明のドキュメントデータを適用していたが、マークアップ言語で記述され、Ｗｅｂページを構成するコンテンツを示すデータ（例えば、ＸＨＴＭＬ（Extensible HyperText Markup Language）文書等）に対してドキュメントデータを適用しても良い。 In the above-described embodiment, the document data of the present invention is applied to an HTML document. However, data (for example, XHTML (Extensible HyperText Markup Language) that is described in a markup language and indicates content constituting a Web page. Document data may be applied to a document).

また、上記実施形態においては、ブログサービスサイトにおけるブログページを構成するコンテンツを抽出していたが、対象とするサイト及びページの種類はこれらのみに限られるものではない。 Moreover, in the said embodiment, although the content which comprises the blog page in a blog service site was extracted, the kind of site and page made into object are not restricted only to these.

１ブログサーバ
２管理端末
３ユーザ端末
１１操作部
１２表示部
１３通信部
１４ドライブ部
１５記憶部
１６入出力インタフェース部
１７ＣＰＵ
１８ＲＯＭ
１９ＲＡＭ
２０システム制御部
２１システムバス
１０１ブログページＤＢ
１０２広告ＤＢ
ＮＷネットワーク
Ｓブログシステム DESCRIPTION OF SYMBOLS 1 Blog server 2 Management terminal 3 User terminal 11 Operation part 12 Display part 13 Communication part 14 Drive part 15 Storage part 16 Input / output interface part 17 CPU
18 ROM
19 RAM
20 System Control Unit 21 System Bus 101 Blog Page DB
102 Advertising DB
NW Network S Blog System

Claims

An extracting means for extracting content constituting a designated web page from a plurality of web pages included in a predetermined site;
Calculating means for counting the frequency with which each of the contents constituting the designated web page is used on another web page among the plurality of web pages;
A determination unit that determines that content that is used in other Web pages is less than or equal to a predetermined value among the contents that constitute the specified Web page is content specific to the specified Web page;
A unique content determination apparatus comprising:

In the specific content determination apparatus according to claim 1,
The extraction means extracts the content constituting the web page in units of content groups composed of one or more contents,
The calculation means counts the frequency at which the content group constituting the designated web page is used on another web page,
The determination means is a content group specific to the designated web page, among the content groups constituting the designated web page, a content group whose frequency used in other web pages is a predetermined value or less. A unique content determination apparatus characterized by determining that

In the specific content determination apparatus according to claim 2,
The specific content determination apparatus, wherein the extraction unit extracts a content group based on document data that is described in a predetermined markup language and indicates content that constitutes a Web page.

In the specific content determination apparatus according to claim 3,
The specific content determination apparatus, wherein the extraction unit determines a content group based on a predetermined tag in document data indicating the content.

In the specific content determination apparatus according to any one of claims 1 to 4,
The extraction means extracts a comment posted to the article from a web page on which the posted article is posted,
Classifying means for classifying each extracted comment according to the content indicated by the comment;
Setting means for setting a threshold of appearance frequency, and setting means for decreasing the threshold as the number of the contents into which the comments are classified increases.
Further comprising
The calculating means calculates the appearance frequency of each of the contents in which the comment is classified in a Web page,
The determination unit determines that the content whose appearance frequency calculated by the calculation unit is equal to or less than the set threshold is content specific to the Web page.

An extracting step of extracting content constituting the designated web page among a plurality of web pages included in a predetermined site;
A calculation step of counting the frequency with which each content constituting the designated web page is used on another web page among the plurality of web pages;
A determination step of determining, among the contents constituting the specified web page, content whose frequency used in other web pages is a predetermined value or less as content specific to the designated web page;
A unique content determination method characterized by comprising:

The specific content determination method according to claim 6,
The extraction step extracts a comment posted to the article from a web page where the posted article is posted,
A classification step of classifying each extracted comment according to the content indicated by the comment;
A setting step of setting a threshold of appearance frequency, a setting step of decreasing the threshold as the number of the contents into which the comment is classified is increased,
Further including
The calculation step calculates the appearance frequency of each content in the web page into which the comment is classified,
The determination step is characterized by determining that the content whose appearance frequency calculated by the calculation step is equal to or less than the set threshold is content specific to the Web page.

Computer
Extracting means for extracting content constituting a designated web page from a plurality of web pages included in a predetermined site;
A calculating means for counting the frequency with which each content constituting the designated web page is used in another web page among the plurality of web pages; and
A determination unit that determines that content that is used in other Web pages and has a frequency equal to or lower than a predetermined value among the contents that constitute the specified Web page is content specific to the specified Web page;
A unique content determination program characterized in that it is made to function as:

In the specific content determination program according to claim 8,
The extraction means extracts a comment posted to the article from a web page on which the posted article is posted,
The computer,
Classification means for classifying each extracted comment according to the content indicated by the comment, and
Setting means for setting a threshold of appearance frequency, and setting means for decreasing the threshold as the number of the contents into which the comment is classified increases.
Further function as
The calculating means calculates the appearance frequency of each of the contents in which the comment is classified in a Web page,
The determination unit determines that the content whose appearance frequency calculated by the calculation unit is equal to or less than the set threshold is content specific to the Web page.

The specific content determination device according to any one of claims 1 to 5,
Insertion means for inserting related content related to content determined to be specific content by the specific content determination device into the specified web page;
A related content insertion device comprising:

The related content insertion device according to claim 10,
When the specific content determination device includes text data of a posted article as content constituting the designated web page, the text data is content specific to the web page. Judgment,
Feature word extraction means for extracting feature words of the designated web page from text data of an article determined to be unique content by the unique content determination device;
Selecting means for selecting, as the related content, content related to the extracted feature word from a plurality of contents stored in the storage means in association with each word;
Further comprising
The related content insertion apparatus, wherein the insertion means inserts the selected related content into the designated Web page.