JP2010231508A

JP2010231508A - Device, method and program for determining significance

Info

Publication number: JP2010231508A
Application number: JP2009078383A
Authority: JP
Inventors: Masanori Hara; 正憲原; Akira Yamada; 山田　　明; Masaru Miyake; 優三宅
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2009-03-27
Filing date: 2009-03-27
Publication date: 2010-10-14
Anticipated expiration: 2029-03-27
Also published as: JP5216654B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for determining significance to an object article without need to collect data in advance. <P>SOLUTION: A determination server 10 for determining the significance of article data displayed on a Web page includes: a URL extraction part 12 for extracting link data and article data included in the Web page; a quoted file acquisition part 13 for acquiring a file as the destination of link pointed by the extracted link data; and a quotation analyzing part 14 for, when at least a portion of the article data is included in the obtained file, determining the significance of the article data to be low. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、Ｗｅｂページに表示されている記事データの重要度を判定する重要度判定装置、重要度判定方法、およびプログラムに関する。 The present invention relates to an importance determination device, an importance determination method, and a program for determining the importance of article data displayed on a Web page.

従来、インターネット上には、不特定多数に対する情報発信の場としてのＷｅｂページが多数存在している。このようなＷｅｂページには、企業の広告ページの他、個人ユーザの日記等を記述することができるブログと呼ばれるサービスにより作成されたページも多く見られる。 Conventionally, there are many web pages on the Internet as a place for sending information to unspecified people. In such Web pages, in addition to corporate advertisement pages, many pages created by a service called a blog that can describe a diary of an individual user can be seen.

このブログサービスでは、Ｗｅｂページ（ブログ）をユーザが作成、編集できる仕様であるため、近年では、情報発信とは異なる目的により作成されたスパムブログ（スプログ）が多数見られるようになっている。具体的には、アクセス数を増やす目的で、他のＷｅｂページの文章をそのまま引用したもの等が挙げられる。これらのスプログは、独自の記事を持たず重要度が低いため、検索サービスの精度を低下させる要因となったり、大量に生成されることによりブログサービスを提供するサーバのリソースを圧迫したりする問題がある。 In this blog service, since a user can create and edit a web page (blog), in recent years, many spam blogs (splogs) created for a purpose different from information transmission have been seen. Specifically, for the purpose of increasing the number of accesses, a text quoted as it is from another Web page can be cited. These splogs do not have their own articles and are of low importance, so they can cause problems in reducing the accuracy of the search service, or they can generate a large amount of information and squeeze the resources of the server that provides the blog service. There is.

そこで、このようなスパムブログを検出する方法が提案されている。例えば、非特許文献１には、特定のキーワードが書かれているブログ記事を予め抽出し、そのスパム率を調査しておくことが示されている。また、非特許文献２には、集めた文書の中で、コピーコンテンツの割合が閾値以上の文書をスプログと判定することが示されている。 Therefore, a method for detecting such a spam blog has been proposed. For example, Non-Patent Document 1 shows that a blog article in which a specific keyword is written is extracted in advance and its spam rate is investigated. Further, Non-Patent Document 2 shows that, among collected documents, a document whose copy content ratio is equal to or greater than a threshold is determined as a splog.

「キーワードの特性を利用したスパムブログの収集と分析」、第２２回人工知能学会全国大会、２００８年"Collecting and analyzing spam blogs using keyword characteristics", 22nd Annual Conference of Japanese Society for Artificial Intelligence, 2008 「日本語ｓｐｌｏｇの現状と対策」、電子情報通信学会東京支部学生会研究発表会、２００７年"Current Status and Countermeasures of Japanese Splog", IEICE Tokyo Branch Student Conference, 2007

しかしながら、非特許文献１の方法では、予めキーワードを選出する必要があり、このキーワードを含まないスプログを検知することができない。また、非特許文献２の方法では、予め大量のブログを用意しておく必要がある。１日に１００万件以上の投稿がある現状では、これらに対して十分な量をサンプリングすることは現実的ではない。そこで、事前にデータ収集することなく、簡便にスプログ等の重要度の低い記事を検知できる方法が望まれている。 However, in the method of Non-Patent Document 1, it is necessary to select a keyword in advance, and splogs that do not include this keyword cannot be detected. In the method of Non-Patent Document 2, it is necessary to prepare a large number of blogs in advance. In the current situation where there are more than 1 million posts per day, it is not practical to sample a sufficient amount for these posts. Therefore, a method that can easily detect articles with low importance such as splogs without collecting data in advance is desired.

本発明は、事前のデータ収集を必要とせず、対象の記事に対する重要度を判定できる重要度判定装置、重要度判定方法、およびプログラムを提供することを目的とする。 An object of the present invention is to provide an importance level determination apparatus, an importance level determination method, and a program that can determine the level of importance for a target article without requiring prior data collection.

本発明では、以下のような解決手段を提供する。 The present invention provides the following solutions.

（１）Ｗｅｂページに表示されている記事データの重要度を判定する重要度判定装置であって、
前記Ｗｅｂページに含まれるリンクデータおよび記事データを抽出する抽出手段と、
前記抽出手段により抽出されたリンクデータが指し示すリンク先のファイルを取得する取得手段と、
前記取得手段により取得されたファイル内に、前記記事データの少なくとも一部分が含まれる場合、当該記事データの重要度を低く判定する判定手段と、を備える重要度判定装置。 (1) An importance level determination device for determining the importance level of article data displayed on a web page,
Extraction means for extracting link data and article data included in the web page;
An acquisition unit that acquires a link destination file indicated by the link data extracted by the extraction unit;
An importance level determination apparatus comprising: a determination unit that determines, when the file acquired by the acquisition unit includes at least a part of the article data, the importance level of the article data to be low.

このような構成によれば、当該重要度判定装置は、Ｗｅｂページ（ブログ）内に記述されているリンクデータ、具体的には、ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）を抽出し、このＵＲＬのリンク先のファイルを取得する。そして、取得したファイル内に、Ｗｅｂページの記事が含まれる場合、Ｗｅｂページがリンク先のファイルを引用していると判断できるので、当該重要度判定装置は、この記事の重要度を低く判定する。 According to such a configuration, the importance level determination device extracts link data described in a Web page (blog), specifically, a URL (Uniform Resource Locator), and links the link destination of this URL. Get the file. If the acquired file includes an article of the Web page, it can be determined that the Web page quotes the linked file, and thus the importance level determination device determines the importance level of the article to be low. .

したがって、当該重要度判定装置は、Ｗｅｂページに表示される記事の重要度を判定することにより、この重要度が低い記事を表示させているＷｅｂページをスプログであると判定することができる。このとき、当該重要度判定装置は、判定対象のＷｅｂページとＵＲＬのリンク先データのみを参照することで重要度判定を行うので、事前のデータ収集を必要とせず、簡便に重要度を判定することができる。 Therefore, the importance level determination apparatus can determine that the web page displaying the article with low importance is splog by determining the importance level of the article displayed on the web page. At this time, since the importance level determination apparatus performs the importance level determination by referring only to the determination target Web page and URL link destination data, the importance level can be easily determined without requiring prior data collection. be able to.

（２）前記抽出手段は、前記リンクデータ近傍の記事データを、所定の文字列で区切り分割して抽出することを特徴とする（１）に記載の重要度判定装置。 (2) The importance determination apparatus according to (1), wherein the extraction unit extracts and divides and separates article data in the vicinity of the link data by a predetermined character string.

このような構成によれば、当該重要度判定装置は、リンクデータ（ＵＲＬ）の近傍の記事データを用いて判定するので、引用された可能性の高い記事を効率的に抽出できる。また、タグ、改行、句読点や「．．．」等、所定の文字列を区切りとして分割するので、分割された小さな単位で引用の有無を判定できる。その結果、記事全体としての引用の度合いを容易に判定することができる。 According to such a configuration, since the importance level determination device determines using article data in the vicinity of the link data (URL), it is possible to efficiently extract articles that are likely to be cited. In addition, since a predetermined character string such as a tag, a line feed, a punctuation mark, or “...” Is divided as a delimiter, the presence / absence of citation can be determined in divided small units. As a result, it is possible to easily determine the degree of citation for the entire article.

（３）前記判定手段は、前記取得手段により取得されたファイル内に、前記記事データが含まれる量に基づいて、前記記事データの重要度を判定することを特徴とする（２）に記載の重要度判定装置。 (3) The determination unit determines the importance of the article data based on the amount of the article data included in the file acquired by the acquisition unit. Importance determination device.

このような構成によれば、当該重要度判定装置は、リンク先のファイル内に、Ｗｅｂページ（ブログ）の記事データが含まれる量に基づいて重要度を判定する。すなわち、引用量の多い記事データほど、重要度を低く判定できるので、重要度に基づいてスプログを精度良く検知できる可能性がある。 According to such a configuration, the importance level determination device determines the importance level based on the amount of article data of the Web page (blog) included in the linked file. In other words, since article data with a larger amount of citations can be determined to be less important, there is a possibility that splogs can be accurately detected based on the importance.

（４）前記判定手段は、前記取得手段により取得されたファイル内に、前記記事データが含まれる割合に基づいて、前記Ｗｅｂページの重要度を判定することを特徴とする（２）に記載の重要度判定装置。 (4) The determination unit may determine the importance of the Web page based on a ratio of the article data included in the file acquired by the acquisition unit. Importance determination device.

このような構成によれば、当該重要度判定装置は、リンク先のファイル内に、Ｗｅｂページ（ブログ）の記事データが含まれる割合に基づいて重要度を判定する。すなわち、引用割合の多い記事データほど、重要度を低く判定できるので、重要度に基づいてスプログを精度良く検知できる可能性がある。 According to such a configuration, the importance level determination device determines the importance level based on the ratio of the web page (blog) article data included in the linked file. That is, since the article data having a higher citation ratio can be determined to be less important, there is a possibility that splogs can be detected with high accuracy based on the importance.

（５）前記判定手段は、前記Ｗｅｂページ内において前記リンクデータが記述されている位置と前記記事データが記述されている位置との距離に基づいて、当該記事データの重要度を判定することを特徴とする（１）から（４）のいずれかに記載の重要度判定装置。 (5) The determination means determines the importance of the article data based on a distance between a position where the link data is described in the Web page and a position where the article data is described. The importance determination device according to any one of (1) to (4), which is characterized.

このような構成によれば、当該重要度判定装置は、リンクデータ（ＵＲＬ）と記事データとの距離に基づいて重要度を判定する。ここで、リンクデータに近い記事であるほどリンクデータとの関連性は高く、引用された可能性が高いと考えられる。当該重要度判定装置は、このような引用された可能性が高い記事の重要度を低く判定することができる。 According to such a configuration, the importance level determination device determines the importance level based on the distance between the link data (URL) and the article data. Here, it is considered that the closer the article is to the link data, the higher the relevance with the link data and the higher the possibility of being cited. The importance level determination apparatus can determine the importance level of articles that have a high possibility of being cited.

（６）前記判定手段は、前記Ｗｅｂページの所定領域に含まれる複数の前記リンクデータそれぞれに関する判定結果に基づいて、当該所定領域における記事データの重要度を判定することを特徴とする（１）から（５）のいずれかに記載の重要度判定装置。 (6) The determination unit determines the importance of article data in the predetermined area based on a determination result regarding each of the plurality of link data included in the predetermined area of the Web page (1) To the importance determining device according to any one of (5).

このような構成によれば、当該重要度判定装置は、複数のリンクデータに対する判定結果に基づいて、Ｗｅｂページ（ブログ）の所定領域、例えば所定の期間に投稿されたブログ記事やブログ全体の重要度を判定することができる。したがって、局所的に引用されているだけで、その他の部分も含めて重要度が低く判定されることを抑制でき、スプログ検知の精度を向上することができる。 According to such a configuration, the importance level determination device determines the importance of a blog article or a whole blog posted in a predetermined area of a Web page (blog), for example, a predetermined period, based on determination results for a plurality of link data. Degree can be determined. Therefore, it is possible to suppress the determination that the degree of importance is low including other parts only by being quoted locally, and the accuracy of splog detection can be improved.

（７）Ｗｅｂページの更新情報を受信する受信手段をさらに備え、
前記受信手段は、前記更新情報に基づいて前記重要度を判定する記事データを受信することを特徴とする（１）から（６）のいずれかに記載の重要度判定装置。 (7) It further comprises receiving means for receiving update information of the Web page,
The importance determination apparatus according to any one of (1) to (6), wherein the reception unit receives article data for determining the importance based on the update information.

このような構成によれば、当該重要度判定装置は、Ｗｅｂページが更新されたことを示す情報を受信するので、新しく生成された、または更新されたＷｅｂページ（ブログ）の記事データを受信することができる。したがって、未判定のＷｅｂページを対象として効率的にスプログか否かを判定することができる。 According to such a configuration, the importance determination apparatus receives information indicating that the Web page has been updated, and therefore receives article data of a newly generated or updated Web page (blog). be able to. Therefore, it is possible to efficiently determine whether or not a splog is targeted for an undetermined Web page.

（８）前記判定手段は、前記受信手段により受信された更新情報に基づいて、所定の時間帯に更新されたＷｅｂページに関して、前記記事データの重要度を判定することを特徴とする（７）に記載の重要度判定装置。 (8) The determination unit determines the importance of the article data with respect to a Web page updated in a predetermined time zone based on the update information received by the reception unit (7) Importance determination device described in 1.

このような構成によれば、当該重要度判定装置は、所定の時間帯に更新されたＷｅｂページ（ブログ）に関してスプログ判定を行う。したがって、例えば深夜の時間帯に更新されたＷｅｂページや、一定周期で更新されているＷｅｂページ等、自動的に更新された可能性の高いＷｅｂページを選択することができる。その結果、当該重要度判定装置は、効率的にスプログを検知できる可能性がある。 According to such a configuration, the importance level determination device performs splog determination for a Web page (blog) updated in a predetermined time zone. Therefore, for example, it is possible to select a Web page that is highly likely to be automatically updated, such as a Web page updated at midnight, or a Web page updated at a constant cycle. As a result, the importance level determination apparatus may be able to detect splogs efficiently.

（９）コンピュータがＷｅｂページに表示されている記事データの重要度を判定する重要度判定方法であって、
前記Ｗｅｂページに含まれるリンクデータおよび記事データを抽出する抽出ステップと、
前記抽出ステップにより抽出されたリンクデータが指し示すリンク先のファイルを取得する取得ステップと、
前記取得ステップにより取得されたファイル内に、前記記事データの少なくとも一部分が含まれる場合、当該記事データの重要度を低く判定する判定ステップと、を含む重要度判定方法。 (9) An importance determination method for determining the importance of article data displayed on a web page by a computer,
An extraction step of extracting link data and article data included in the web page;
An acquisition step of acquiring a link destination file indicated by the link data extracted by the extraction step;
And a determination step of determining a low importance level of the article data when at least a part of the article data is included in the file acquired by the acquisition step.

このような構成によれば、当該方法を実行することにより、（１）と同様の効果が期待できる。 According to such a configuration, the same effect as in (1) can be expected by executing the method.

（１０）Ｗｅｂページに表示されている記事データの重要度をコンピュータに判定させるプログラムであって、
前記Ｗｅｂページに含まれるリンクデータおよび記事データを抽出する抽出ステップと、
前記抽出ステップにより抽出されたリンクデータが指し示すリンク先のファイルを取得する取得ステップと、
前記取得ステップにより取得されたファイル内に、前記記事データの少なくとも一部分が含まれる場合、当該記事データの重要度を低く判定する判定ステップと、を実行させるプログラム。 (10) A program for causing a computer to determine the importance of article data displayed on a web page,
An extraction step of extracting link data and article data included in the web page;
An acquisition step of acquiring a link destination file indicated by the link data extracted by the extraction step;
When the file acquired by the acquisition step includes at least a part of the article data, a program for executing a determination step of determining the importance of the article data to be low.

このような構成によれば、当該プログラムをコンピュータに実行させることにより、（１）と同様の効果が期待できる。 According to such a configuration, the same effect as in (1) can be expected by causing the computer to execute the program.

本発明によれば、事前のデータ収集を必要とせず、対象の記事に対する重要度を判定できる。 According to the present invention, it is possible to determine the importance level of a target article without requiring prior data collection.

本発明の実施形態に係る判定サーバと関連要素とを含んだシステムの全体構成を示す図である。It is a figure which shows the whole structure of the system containing the determination server and related element which concern on embodiment of this invention. 本発明の実施形態に係る判定サーバのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the determination server which concerns on embodiment of this invention. 本発明の実施形態に係る判定サーバの機能構成を示す図である。It is a figure which shows the function structure of the determination server which concerns on embodiment of this invention. 本発明の実施形態に係るスプログ判定テーブルを示す図である。It is a figure which shows the splog determination table which concerns on embodiment of this invention. 本発明の実施形態に係る判定サーバによりスプログと判定されるＷｅｂページの例を示す図である。It is a figure which shows the example of the web page determined as a splog by the determination server which concerns on embodiment of this invention. 本発明の実施形態に係る判定サーバの制御部における処理を示すフローチャートである。It is a flowchart which shows the process in the control part of the determination server which concerns on embodiment of this invention.

以下、本発明の実施形態の一例について図を参照しながら説明する。なお、重要度を判定する対象は、ブログの記事データであるとする。本実施形態では、重要度に応じて、このブログがスプログであるか否かを判定する。 Hereinafter, an example of an embodiment of the present invention will be described with reference to the drawings. It is assumed that the target for determining the importance is blog article data. In this embodiment, it is determined whether this blog is a splog according to the importance.

［システム全体構成］
図１は、本実施形態に係る判定サーバ１０（重要度判定装置）と関連要素とを含んだシステムの全体構成を示す図である。判定サーバ１０と、Ｗｅｂサーバ２０および２１と、ユーザ端末３０とは、インターネット等の所定のネットワークを介して接続されている。 [Entire system configuration]
FIG. 1 is a diagram illustrating an overall configuration of a system including a determination server 10 (importance determination device) and related elements according to the present embodiment. The determination server 10, the web servers 20 and 21, and the user terminal 30 are connected via a predetermined network such as the Internet.

ユーザ端末３０は、所定の検索サービスの検索結果から選択される等の指示入力に応じて、Ｗｅｂサーバ２０より、Ｗｅｂページ（ブログ）を受信して表示する。このＷｅｂページには、別のＷｅｂサーバ２１に記憶されているファイルに対するＵＲＬが記述されており、このファイルを引用（コピー）しているものとする。なお、リンク先は、同一のＷｅｂサーバ２０内であってもよい。 The user terminal 30 receives and displays a web page (blog) from the web server 20 in response to an instruction input such as being selected from a search result of a predetermined search service. In this Web page, a URL for a file stored in another Web server 21 is described, and this file is cited (copied). The link destination may be in the same Web server 20.

判定サーバ１０は、ユーザ端末３０に表示されたブログ記事の重要度を判定し、このブログがスパムブログであるか否かを判定する。このとき、判定サーバ１０は、Ｗｅｂサーバ２１より、引用ファイルを受信し、ブログ記事との一致度に基づいて重要度を判定する（処理の詳細は後述する）。 The determination server 10 determines the importance of the blog article displayed on the user terminal 30 and determines whether or not this blog is a spam blog. At this time, the determination server 10 receives the cited file from the Web server 21 and determines the importance based on the degree of coincidence with the blog article (details of the process will be described later).

［ハードウェア構成］
図２は、本実施形態に係る判定サーバ１０のハードウェア構成を示す図である。判定サーバ１０は、制御部１１０と、記憶部１２０と、入力部１３０と、表示部１４０と、通信部１５０と、を備え、各ハードウェアは、バス１６０を介して接続されている。 [Hardware configuration]
FIG. 2 is a diagram illustrating a hardware configuration of the determination server 10 according to the present embodiment. The determination server 10 includes a control unit 110, a storage unit 120, an input unit 130, a display unit 140, and a communication unit 150, and each hardware is connected via a bus 160.

制御部１１０は、判定サーバ１０の全体を制御する部分であり、記憶部１２０に記憶された各種プログラムを適宜読み出して実行することにより、上述のハードウェアと協働し、本発明に係る各種機能を実現している。制御部１１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）であってよい。 The control unit 110 is a part that controls the entire determination server 10, and appropriately reads and executes various programs stored in the storage unit 120, thereby cooperating with the above-described hardware and various functions according to the present invention. Is realized. The control unit 110 may be a CPU (Central Processing Unit).

記憶部１２０は、ハードウェア群を判定サーバ１０として機能させるための各種プログラムや、本発明の機能を制御部１１０に実行させるプログラム、データベース等を記憶する。記憶部１２０は、ハードディスク、光ディスクドライブ、あるいは半導体メモリ等、様々な記憶装置のいずれかにより構成されてよい。 The storage unit 120 stores various programs for causing the hardware group to function as the determination server 10, programs for causing the control unit 110 to execute the functions of the present invention, a database, and the like. The storage unit 120 may be configured by any of various storage devices such as a hard disk, an optical disk drive, or a semiconductor memory.

入力部１３０は、判定サーバ１０に対するユーザ（判定サーバ１０の管理者）からの指示入力を受け付けるインタフェース装置である。入力部１３０は、例えばキーボードやマウス等により構成される。 The input unit 130 is an interface device that receives an instruction input from a user (an administrator of the determination server 10) to the determination server 10. The input unit 130 is configured by, for example, a keyboard and a mouse.

表示部１４０は、ユーザ（判定サーバ１０の管理者）にデータの入力を受け付ける画面を表示したり、判定サーバ１０による処理結果の画面を表示したりするものである。表示部１４０は、ブラウン管表示装置（ＣＲＴ）、液晶表示装置（ＬＣＤ）等のディスプレイ装置であってよい。 The display unit 140 displays a screen for accepting data input to the user (administrator of the determination server 10) or displays a processing result screen by the determination server 10. The display unit 140 may be a display device such as a cathode ray tube display device (CRT) or a liquid crystal display device (LCD).

通信部１５０は、判定サーバ１０が、ネットワーク４０（図１）を介してＷｅｂサーバ２０、２１や他の情報端末等と情報を送受信する場合のネットワーク・アダプタである。 The communication unit 150 is a network adapter when the determination server 10 transmits and receives information to and from the Web servers 20 and 21 and other information terminals via the network 40 (FIG. 1).

なお、本発明でいうコンピュータとは、制御装置や記憶装置等を備えた情報処理装置である。判定サーバ１０は、制御部１１０や記憶部１２０等を備えた情報処理装置であり、この情報処理装置は、本発明のコンピュータの概念に含まれる。 The computer referred to in the present invention is an information processing apparatus including a control device, a storage device, and the like. The determination server 10 is an information processing apparatus including the control unit 110, the storage unit 120, and the like, and this information processing apparatus is included in the concept of the computer of the present invention.

［機能構成］
図３は、本実施形態に係る判定サーバ１０の機能構成を示す図である。判定サーバ１０の制御部１１０は、ブログ受信部１１（受信手段）と、ＵＲＬ抽出部１２（抽出手段）と、引用ファイル取得部１３（取得手段）と、引用解析部１４（判定手段）と、スプログ判定部１５と、を備える。また、記憶部１２０は、ブログＤＢ１６を備える。 [Function configuration]
FIG. 3 is a diagram illustrating a functional configuration of the determination server 10 according to the present embodiment. The control unit 110 of the determination server 10 includes a blog reception unit 11 (reception unit), a URL extraction unit 12 (extraction unit), a citation file acquisition unit 13 (acquisition unit), a citation analysis unit 14 (determination unit), A splog determination unit 15. The storage unit 120 includes a blog DB 16.

ブログ受信部１１は、Ｗｅｂサーバ２０からスプログ判定の対象であるブログのページデータ（ＨＴＭＬファイル）を受信する。ここで、ブログ受信部１１は、新規に作成または更新されたブログを受信することとする。すなわち、ブログ受信部１１は、ＲＳＳ等により配信されるブログの更新情報を受信したことに応じて、対象のブログを受信する。 The blog receiving unit 11 receives page data (HTML file) of a blog that is a target of splog determination from the Web server 20. Here, the blog receiving unit 11 receives a newly created or updated blog. That is, the blog receiving unit 11 receives the target blog in response to receiving the update information of the blog distributed by RSS or the like.

ＵＲＬ抽出部１２は、ブログ受信部１１により受信されたブログのページデータから、リンクデータとしてのＵＲＬを抽出する。具体的には、ＵＲＬ抽出部１２は、「ｈｔｔｐ」から始まる文字列の、「”」や「＞」や改行までの部分を抽出する。これにより、ＵＲＬ抽出部１２は、実際のリンク項目としてタグを付与されていないＵＲＬも抽出することができる。なお、「．ｈｔｍｌ」や「．ｈｔｍ」等のファイル拡張子をもつもの以外を除外することとしてもよい。 The URL extracting unit 12 extracts a URL as link data from the blog page data received by the blog receiving unit 11. Specifically, the URL extraction unit 12 extracts a part of a character string starting from “http” up to “” ”,“> ”, and a line feed. As a result, the URL extraction unit 12 can also extract URLs that are not tagged as actual link items. It should be noted that those other than those having file extensions such as “.html” and “.html” may be excluded.

さらに、ＵＲＬ抽出部１２は、抽出したＵＲＬの近傍にある記事データを抽出する。具体的には、ＵＲＬの前後の所定量の記事データについて、タグ部分や、改行、句読点や「．．．」等、所定の文字列を区切りとして、分割して抽出する。ＵＲＬ抽出部１２は、抽出したＵＲＬおよび記事データをブログＤＢ１６に記憶する。 Further, the URL extraction unit 12 extracts article data in the vicinity of the extracted URL. Specifically, a predetermined amount of article data before and after the URL is divided and extracted with a predetermined character string such as a tag part, a line feed, a punctuation mark, or “. The URL extraction unit 12 stores the extracted URL and article data in the blog DB 16.

図４は、本実施形態に係るブログＤＢ１６に格納されるスプログ判定テーブルを示す図である。スプログ判定テーブルには、対象ブログの更新日時と共に、抽出されたＵＲＬおよびＵＲＬ近傍の記事データが記憶される。さらに、各記事データとＵＲＬとの距離データ、および後述の重要度の低さを示す引用判定値が記憶される。 FIG. 4 is a diagram showing a splog determination table stored in the blog DB 16 according to the present embodiment. The splog determination table stores the extracted URL and article data in the vicinity of the URL together with the update date and time of the target blog. Further, distance data between each article data and URL, and a quotation determination value indicating low importance described later are stored.

引用ファイル取得部１３は、スプログ判定テーブルに記憶されたＵＲＬ、すなわちスプログ判定対象のブログに記述されているＵＲＬが指し示すリンク先の引用ファイルを、Ｗｅｂサーバ２１から取得する。 The citation file acquisition unit 13 acquires from the Web server 21 the URL stored in the splog determination table, that is, the link destination citation file indicated by the URL described in the blog subject to splog determination.

引用解析部１４は、引用ファイル取得部１３により取得した引用ファイルと、スプログ判定テーブルに記憶されている記事データとを比較し、引用ファイル内に記事データと一致する部分が存在する場合には、引用（コピー）されたと判断する。さらに、ＵＲＬと記事データとの距離を考慮し、距離が近いほど引用判定値を大きく設定し、スプログ判定テーブルに記憶する。ここで、引用判定値が大きいほど記事データの重要度は低く、Ｗｅｂページがスプログである可能性が高いことを示している。 The citation analysis unit 14 compares the citation file acquired by the citation file acquisition unit 13 with the article data stored in the splog determination table, and if there is a portion that matches the article data in the citation file, Judge that it was quoted (copied). Further, considering the distance between the URL and the article data, the citation determination value is set larger as the distance is shorter, and stored in the splog determination table. Here, the greater the citation determination value, the lower the importance of the article data, and the higher the possibility that the Web page is a splog.

スプログ判定部１５は、引用解析部１４により判定された記事データの重要度、すなわちスプログ判定テーブルの引用判定値を統計処理することにより、スプログ判定を行う。具体的には、例えば引用判定値の合計や平均、あるいは所定以上の引用判定値となっている記事データの量や割合等により引用度合いを算出し、この度合いが所定の閾値以上である場合にスプログであると判定する。 The splog determination unit 15 performs splog determination by statistically processing the importance of the article data determined by the citation analysis unit 14, that is, the citation determination value of the splog determination table. Specifically, for example, when the citation degree is calculated based on the total or average of the citation determination values, or the amount or ratio of article data having a predetermined or higher citation determination value, and the degree is equal to or higher than a predetermined threshold Judged as splog.

図５は、本実施形態に係る判定サーバ１０によりスプログと判定されるＷｅｂページの例を示す図である。 FIG. 5 is a diagram illustrating an example of a Web page that is determined as a splog by the determination server 10 according to the present embodiment.

ブログページ５０には、ＵＲＬを含むリンクデータ５１と共に記事データ５２が配置されている。リンクデータ５１は、別のＷｅｂページ６０へのハイパーリンクであり、ＵＲＬが指し示すリンク先のＷｅｂページ６０には、リンクデータ５１と同一のタイトル文字列６１と、記事データ５２と同一の文章である記事データ６２が配置されている。 In the blog page 50, article data 52 is arranged together with link data 51 including a URL. The link data 51 is a hyperlink to another web page 60, and the link destination web page 60 indicated by the URL has the same title character string 61 as the link data 51 and the same text as the article data 52. Article data 62 is arranged.

判定サーバ１０は、リンクデータ５１の近傍の文字列がＷｅｂページ６０に含まれることにより、この領域の記事データについて、重要度が低いと判定する。他の領域についても、同様にリンクデータ近傍の記事データが別のＷｅｂページに存在することが判明すると、ブログページ５０全体の重要度は低くなる。その結果、判定サーバ１０は、ブログページ５０はスプログであると判定する。 The determination server 10 determines that the importance of the article data in this area is low when the character string in the vicinity of the link data 51 is included in the Web page 60. Similarly, when it is found that article data in the vicinity of the link data exists in another Web page in the other areas, the importance of the entire blog page 50 is lowered. As a result, the determination server 10 determines that the blog page 50 is a splog.

なお、重要度を判定する記事データは、テキストには限られない。例えば、動画像５３や、静止画、音声データ等であってもよく、判定サーバ１０は、リンク先に同一のデータが存在することにより、これらが引用（コピー）されたものとして重要度を低く設定する。 The article data for determining the importance is not limited to text. For example, it may be a moving image 53, a still image, audio data, or the like, and the determination server 10 is less important because the same data exists at the link destination and these are cited (copied). Set.

［処理フロー］
図６は、本実施形態に係る判定サーバ１０の制御部１１０における処理を示すフローチャートである。 [Processing flow]
FIG. 6 is a flowchart showing processing in the control unit 110 of the determination server 10 according to the present embodiment.

ステップＳ１では、制御部１１０は、ＲＳＳ等により取得したＷｅｂページの更新情報に基づいて、スプログ判定を行うブログのページデータを取得する。 In step S <b> 1, the control unit 110 acquires page data of a blog that performs splog determination based on update information of a Web page acquired by RSS or the like.

ステップＳ２では、制御部１１０は、ステップＳ１で取得したページデータから、ＵＲＬの記述を抽出する。 In step S2, control unit 110 extracts a URL description from the page data acquired in step S1.

ステップＳ３では、制御部１１０は、ステップＳ２で抽出したＵＲＬの付近の記事データを抽出する。抽出されたＵＲＬおよび記事データは、記憶部１２０のスプログ判定テーブル（図４）に記憶される。 In step S3, the control unit 110 extracts article data near the URL extracted in step S2. The extracted URL and article data are stored in a splog determination table (FIG. 4) in the storage unit 120.

ステップＳ４では、制御部１１０は、ステップＳ２で抽出したＵＲＬが指し示す引用ファイルを取得する。 In step S4, the control unit 110 acquires the citation file indicated by the URL extracted in step S2.

ステップＳ５では、制御部１１０は、ステップＳ４で取得した引用ファイル内に、ステップＳ３で抽出した記事データと一致する部分が含まれるか否かを解析する。解析結果として、記事データの引用判定値を設定し、記憶部１２０のスプログ判定テーブル（図４）に記憶する。さらに、制御部１１０は、判定領域に含まれる複数のＵＲＬに関する引用判定値に基づいて、統計処理により全体の引用度合いを算出する。 In step S5, the control unit 110 analyzes whether or not the citation file acquired in step S4 includes a portion that matches the article data extracted in step S3. As the analysis result, the citation determination value of the article data is set and stored in the splog determination table (FIG. 4) of the storage unit 120. Furthermore, the control unit 110 calculates the overall citation degree by statistical processing based on the citation determination values regarding a plurality of URLs included in the determination area.

ステップＳ６では、制御部１１０は、ステップＳ５で算出した引用度合いが所定の閾値以上であるか否かを判定する。この判定がＹＥＳの場合はステップＳ７に移り、判定がＮＯの場合はステップＳ８に移る。 In step S6, the control unit 110 determines whether or not the citation degree calculated in step S5 is equal to or greater than a predetermined threshold value. If this determination is YES, the process proceeds to step S7, and if the determination is NO, the process proceeds to step S8.

ステップＳ７では、制御部１１０は、記事の引用度合いが高く、ページの重要度が低いと判断し、ステップＳ１で取得したブログはスプログであると判定する。 In step S7, the control unit 110 determines that the citation level of the article is high and the importance level of the page is low, and determines that the blog acquired in step S1 is a splog.

ステップＳ８では、制御部１１０は、記事の引用度合いが低く、ページの重要度が高いと判断し、ステップＳ１で取得したブログはスプログではないと判定する。 In step S8, the control unit 110 determines that the citation degree of the article is low and the importance of the page is high, and determines that the blog acquired in step S1 is not a splog.

このように、本実施形態によれば、記事の中に記述されたＵＲＬに基づいて、リンク先との一致度合いを解析することにより、ブログの重要度を判定する。その結果、重要度の低いスプログを検知することができる。このとき、事前にデータ収集する必要がないため、簡便にスプログか否かを判定することができる。 Thus, according to this embodiment, the importance level of a blog is determined by analyzing the degree of matching with a link destination based on the URL described in the article. As a result, splogs with low importance can be detected. At this time, since it is not necessary to collect data in advance, it can be easily determined whether or not it is a splog.

なお、本実施形態では、他の記事を引用しているスプログを検知できる。すなわち、スプログと判定されるブログは、アクセス数を稼ぐためにコンテンツがコピーされたブログの他、例えばアフィリエイト収入を目的として商品説明文等をコピーしているのみのブログ等を精度良く検知することができる。 In the present embodiment, it is possible to detect splogs quoting other articles. In other words, blogs that are determined to be splogs can accurately detect, for example, blogs whose contents have been copied in order to increase the number of accesses, as well as blogs that only have product descriptions copied for the purpose of affiliate income, for example. Can do.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

上述の実施形態では、新規に作成または更新されたブログを対象として重要度の判定を行ったが、本発明はこれには限られない。例えば、管理者からの指示入力を受け付けて、指示されたＷｅｂページや、ページ内の指示された領域について、重要度を判定してもよい。 In the above-described embodiment, the importance level is determined for a newly created or updated blog, but the present invention is not limited to this. For example, it is possible to receive an instruction input from the administrator and determine the importance of the instructed Web page or the instructed area in the page.

また、重要度に関する様々な指標を組み合わせ、総合的に重要度を判定してもよい。例えば、所定の時間帯（例えば、深夜）に更新されたＷｅｂページや、一定周期で（決まった時間に）更新されているＷｅｂページ等は、人手によらず自動的に生成、更新されている可能性が高い。このようなＷｅｂページを優先して重要度判定の対象としてもよいし、重要度を低く重み付けして判定してもよい。 Also, the importance may be determined comprehensively by combining various indexes related to the importance. For example, a web page updated at a predetermined time zone (for example, midnight), a web page updated at a fixed period (at a fixed time), and the like are automatically generated and updated without human intervention. Probability is high. Such Web pages may be prioritized for importance determination, or may be determined by weighting the importance low.

上述の実施形態では、判定サーバ１０を説明したが、本発明の重要度判定装置の構成はこれには限られない。判定サーバ１０の各機能は、複数のサーバに分散されてもよい。また、判定サーバ１０は、Ｗｅｂサーバ２０等の他のサーバと統合されていてもよい。 Although the determination server 10 has been described in the above-described embodiment, the configuration of the importance determination device of the present invention is not limited to this. Each function of the determination server 10 may be distributed to a plurality of servers. The determination server 10 may be integrated with other servers such as the Web server 20.

１０判定サーバ（重要度判定装置）
１１ブログ受信部（受信手段）
１２ＵＲＬ抽出部（抽出手段）
１３引用ファイル取得部（取得手段）
１４引用解析部（判定手段）
１５スプログ判定部
１６ブログＤＢ 10 judgment server (importance judgment device)
11 Blog receiving part (receiving means)
12 URL extraction unit (extraction means)
13 Citation file acquisition unit (acquisition means)
14 Citation analysis section (determination means)
15 splog judging unit 16 blog DB

Claims

An importance level determination device for determining the importance level of article data displayed on a web page,
Extraction means for extracting link data and article data included in the web page;
An acquisition unit that acquires a link destination file indicated by the link data extracted by the extraction unit;
An importance level determination apparatus comprising: a determination unit that determines, when the file acquired by the acquisition unit includes at least a part of the article data, the importance level of the article data to be low.

2. The importance determination device according to claim 1, wherein the extraction unit extracts and divides article data in the vicinity of the link data by dividing the article data by a predetermined character string.

The importance determination according to claim 2, wherein the determination means determines the importance of the article data based on an amount of the article data contained in the file acquired by the acquisition means. apparatus.

The importance determination according to claim 2, wherein the determination unit determines the importance of the Web page based on a ratio of the article data included in the file acquired by the acquisition unit. apparatus.

The determination means determines the importance of the article data based on a distance between a position where the link data is described in the Web page and a position where the article data is described. The importance determination apparatus according to any one of claims 1 to 4.

2. The determination unit according to claim 1, wherein the determination unit determines the importance of the article data in the predetermined area based on a determination result regarding each of the plurality of link data included in the predetermined area of the Web page. 5. The importance determination device according to any one of 5 above.

A receiving means for receiving update information of the Web page;
The importance determination apparatus according to claim 1, wherein the reception unit receives article data for determining the importance based on the update information.

The said determination means determines the importance of the said article data regarding the Web page updated in the predetermined time slot | zone based on the update information received by the said reception means. Importance determination device.

An importance determination method for determining the importance of article data displayed on a web page by a computer,
An extraction step of extracting link data and article data included in the web page;
An acquisition step of acquiring a link destination file indicated by the link data extracted by the extraction step;
And a determination step of determining a low importance level of the article data when at least a part of the article data is included in the file acquired by the acquisition step.

A program for causing a computer to determine the importance of article data displayed on a web page,
An extraction step of extracting link data and article data included in the web page;
An acquisition step of acquiring a link destination file indicated by the link data extracted by the extraction step;
When the file acquired by the acquisition step includes at least a part of the article data, a program for executing a determination step of determining the importance of the article data to be low.