JP2009230663A

JP2009230663A - Apparatus for detecting abnormal condition in web page, program, and recording medium

Info

Publication number: JP2009230663A
Application number: JP2008078069A
Authority: JP
Inventors: Keisuke Takemori; 敬祐竹森; Akira Baba; 昭馬場
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2008-03-25
Filing date: 2008-03-25
Publication date: 2009-10-08

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus for detecting an abnormal condition in a web page, a program, and a recording medium for reducing burdens on monitoring the web page. <P>SOLUTION: A link information extraction part 11d extracts link information indicating a link to other web pages from web page information stored in a web page information storage part 12. A web page information acquisition part 11a accesses a web server indicated by the link information to acquire the web page information. A track back spam determination part 11f calculates feature quantity of the web page on the basis of the web page information acquired by the web page information acquisition part 11a, and detects whether the web page is in an abnormal condition or not on the basis of the feature quantity. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、ウェブページの異常を検知するウェブページの異常検知装置に関する。また、本発明は、ウェブページの異常検知装置としてコンピュータを機能させるためのプログラム、およびこのプログラムを記録した記録媒体にも関する。 The present invention relates to a web page abnormality detection device that detects a web page abnormality. The present invention also relates to a program for causing a computer to function as a web page abnormality detection device, and a recording medium on which the program is recorded.

ウェブ（Ｗｅｂ）上で公開される日記（以降、ブログと呼ぶ）に対して、そのブログに関連するブログのＵＲＬ（Uniform Resource Locator）を読者が追記する機能（以降、トラックバックと呼ぶ）や、コメントを追記する機能がある。昨今では、ブログに対して、記事とは関係のない迷惑なＵＲＬを追記するトラックバックスパムや、迷惑なコメントを追記するコメントスパムによる攻撃が問題となっている。 A function that allows readers to add a URL (Uniform Resource Locator) of a blog related to the blog to a diary published on the Web (hereinafter referred to as a blog) (hereinafter referred to as a trackback) or comment There is a function to add. In recent years, attacks by trackback spam that adds annoying URLs unrelated to articles and comment spam that adds annoying comments to blogs have become problems.

トラックバックスパムやコメントスパムを防御する手法として、一般的に以下の５通りの方法が用いられている。
（１）海外からの攻撃を想定し、半角英数字のみのトラックバックやコメントを拒否する（非特許文献１参照）。
（２）トラックバック元のブログ（トラックバックが追記されたブログと片方向にリンクされているブログ）に、トラックバック先のブログ（トラックバックが追記されたブログ）へのリンクが無い場合にトラックバックを拒否する（非特許文献１参照）。
（３）コメントの投稿を自動的に行うツールによる攻撃を想定し、コメントの投稿時に絵文字認証を行う（非特許文献１参照）。
（４）禁止ＩＰアドレスや禁止ＵＲＬからのトラックバックやコメントを拒否する（非特許文献１、特許文献１，２参照）。
（５）禁止キーワードを含むトラックバックやコメントを拒否する（非特許文献１、特許文献１，２参照）。
“au one net インターネットガイド”，［online］，［平成２０年３月１２日検索］，インターネット＜URL: http://www.auone-net.jp/netguide/feature/020/0200208.html＞特開２００７−２６５３６８号公報特開２００７−１１５１７３号公報 The following five methods are generally used as a method for protecting track back spam and comment spam.
(1) Reject trackbacks and comments consisting of only single-byte alphanumeric characters, assuming an attack from overseas (see Non-Patent Document 1).
(2) Reject the trackback if the trackback source blog (blog linked to the trackback and the blog linked in one direction) does not have a link to the trackback destination blog (trackback added blog) ( Non-patent document 1).
(3) Assuming an attack by a tool that automatically posts a comment, pictogram authentication is performed when a comment is posted (see Non-Patent Document 1).
(4) Reject trackbacks and comments from prohibited IP addresses and prohibited URLs (see Non-Patent Document 1, Patent Documents 1 and 2).
(5) Reject trackbacks and comments that contain prohibited keywords (see Non-Patent Document 1, Patent Documents 1 and 2).
“Au one net Internet Guide”, [online], [Search March 12, 2008], Internet <URL: http://www.auone-net.jp/netguide/feature/020/0200208.html> JP 2007-265368 A JP 2007-115173 A

しかし、一般的にブログの管理は個人ユーザに任されており、上記の設定を行わないユーザのページには、トラックバックスパムやコメントスパムが跡を絶たない。特に、上記（３）の絵文字認証を設定しているユーザは稀である。また、様々なＰＣを踏み台にして攻撃を仕掛ける場合、その攻撃が上記（４）の禁止ＩＰや禁止ＵＲＬに該当しない問題がある。また、上記（５）の禁止キーワードに該当しない用語によるトラックバックスパムやコメントスパムを防御できない問題もある。さらに、本発明者による調査では、トラックバック先のブログへのリンクをスパムページ中に記載する攻撃者もあり、上記（６）をすり抜けるトラックバックスパムもある。 However, in general, management of a blog is left to an individual user, and trackback spam and comment spam are not traced on a page of a user who does not perform the above setting. In particular, there are few users who set the pictogram authentication of (3) above. Further, when an attack is performed using various PCs as a stepping stone, there is a problem that the attack does not correspond to the prohibited IP or the prohibited URL described in (4) above. There is also a problem that trackback spam and comment spam due to terms that do not correspond to the prohibited keyword (5) cannot be prevented. Further, in the investigation by the present inventor, there is an attacker who describes a link to a blog of a trackback destination in a spam page, and there is also a trackback spam that bypasses the above (6).

こうしたスパムを完全に防御できないことを前提に、攻撃を受けたことをいち早く検知する必要がある。ブログサービスを提供する企業では、人の目でトラックバックスパムやコメントスパムが発生していないか検知する作業を余儀なくされており、その監視運用に莫大なコストを要しているという問題がある。 It is necessary to detect the attack as soon as possible, assuming that such spam cannot be completely prevented. Companies that provide blog services are forced to detect whether trackback spam or comment spam is generated by human eyes, and there is a problem that the monitoring operation requires enormous costs.

本発明は、上述した課題に鑑みてなされたものであって、ウェブページの監視に掛かる負担を軽減することができるウェブページの異常検知装置、プログラム、および記録媒体を提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a web page abnormality detection device, a program, and a recording medium that can reduce the burden on web page monitoring. .

本発明は、上記の課題を解決するためになされたもので、ウェブページ情報を記憶する情報記憶手段（図１のウェブページ情報記憶部１２に対応）と、前記情報記憶手段が記憶する前記ウェブページ情報から、他のウェブページへのリンクを示すリンク情報を抽出するリンク情報抽出手段（図１のリンク情報抽出部１１ｄに対応）と、前記リンク情報が示すウェブサーバに接続し、ウェブページ情報を取得する情報取得手段（図１の通信部１０、ウェブページ情報取得部１１ａに対応）と、前記情報取得手段が取得した前記ウェブページ情報に基づいて、ウェブページの特徴量を算出する特徴量算出手段（図１のトラックバックスパム判定部１１ｆに対応）と、前記特徴量算出手段が算出した前記特徴量に基づいてウェブページの異常の有無を検知する異常検知手段（図１のトラックバックスパム判定部１１ｆに対応）とを備えたことを特徴とするウェブページの異常検知装置である。 The present invention has been made to solve the above-described problems, and includes information storage means for storing web page information (corresponding to the web page information storage unit 12 in FIG. 1), and the web stored by the information storage means. Link information extracting means (corresponding to the link information extracting unit 11d in FIG. 1) for extracting link information indicating a link to another web page from the page information, and a web server indicated by the link information are connected to the web page information. Based on the information acquisition unit (corresponding to the communication unit 10 and the web page information acquisition unit 11a in FIG. 1) and the web page information acquired by the information acquisition unit. Calculation means (corresponding to the trackback spam determination unit 11f in FIG. 1) and presence / absence of abnormality of the web page based on the feature amount calculated by the feature amount calculation means Detecting an abnormal detection means is abnormal detection apparatus of a web page, characterized in that a (corresponding to the trackback spam determination unit 11f of Fig. 1).

また、本発明のウェブページの異常検知装置において、前記特徴量算出手段は、前記情報取得手段が取得した前記ウェブページ情報に基づいて、ウェブページの特徴を示す複数の条件を基準としてウェブページの特徴量を算出することを特徴とする。 Further, in the web page abnormality detection device of the present invention, the feature amount calculation means is based on the web page information acquired by the information acquisition means on the basis of a plurality of conditions indicating the characteristics of the web page. The feature amount is calculated.

また、本発明のウェブページの異常検知装置において、前記特徴量算出手段は、ウェブページの特徴を示す複数の条件に対して重み付けを行って前記特徴量を算出することを特徴とする。 In the web page abnormality detection apparatus according to the present invention, the feature amount calculating means calculates the feature amount by weighting a plurality of conditions indicating features of the web page.

また、本発明は、ウェブページ情報を記憶する第１の情報記憶手段（図１のウェブページ情報記憶部１２に対応）と、前記第１の情報記憶手段が記憶する前記ウェブページ情報から、過去に追加された追加情報を抽出する情報抽出手段（図１のリンク情報抽出部１１ｄ、コメント抽出部１１ｅに対応）と、前記追加情報と、前記ウェブページ情報の更新時刻を示す時刻情報とを関連付けて記憶する第２の情報記憶手段（図１のリンク情報記憶部１３、コメント記憶部１４に対応）と、前記第２の記憶手段が記憶する前記追加情報と前記時刻情報とに基づいて、時刻毎の前記追加情報の出現頻度を示すヒストグラムを生成するヒストグラム生成手段（図１のトラックバックスパム判定部１１ｆ、コメントスパム判定部１１ｇに対応）と、前記ヒストグラムに基づいてウェブページの異常の有無を検知する異常検知手段（図１のトラックバックスパム判定部１１ｆ、コメントスパム判定部１１ｇに対応）とを備えたことを特徴とするウェブページの異常検知装置である。 Further, the present invention provides a first information storage unit (corresponding to the web page information storage unit 12 in FIG. 1) for storing web page information and a past information from the web page information stored in the first information storage unit. The information extraction means (corresponding to the link information extraction unit 11d and the comment extraction unit 11e in FIG. 1) for extracting the additional information added to the URL is associated with the time information indicating the update time of the web page information. Based on the second information storage means (corresponding to the link information storage section 13 and the comment storage section 14 in FIG. 1) and the additional information and the time information stored in the second storage means. Histogram generating means (corresponding to the trackback spam determining unit 11f and the comment spam determining unit 11g in FIG. 1) for generating a histogram indicating the appearance frequency of the additional information for each, An abnormality detection device for a web page, comprising an abnormality detection means (corresponding to the trackback spam determination unit 11f and the comment spam determination unit 11g in FIG. 1) that detects whether there is an abnormality in the web page based on the program. is there.

また、本発明のウェブページの異常検知装置において、前記情報抽出手段（図１のリンク情報抽出部１１ｄに対応）は、前記第１の情報記憶手段が記憶する前記ウェブページ情報から、前記追加情報として、他のウェブページへのリンクを示すリンク情報を抽出することを特徴とする。 In the web page abnormality detection device of the present invention, the information extraction unit (corresponding to the link information extraction unit 11d in FIG. 1) is configured to use the additional information from the web page information stored in the first information storage unit. As described above, link information indicating a link to another web page is extracted.

また、本発明のウェブページの異常検知装置において、前記情報抽出手段（図１のコメント抽出部１１ｅに対応）は、前記第１の情報記憶手段が記憶する前記ウェブページ情報から、前記追加情報として、ウェブページに追記されたコメントを抽出することを特徴とする。 In the web page abnormality detection device of the present invention, the information extraction unit (corresponding to the comment extraction unit 11e in FIG. 1) is used as the additional information from the web page information stored in the first information storage unit. A comment added to a web page is extracted.

また、本発明は、ウェブページ情報を記憶する情報記憶手段（図１のウェブページ情報記憶部１２に対応）と、前記情報記憶手段が記憶する前記ウェブページ情報から、ウェブページに追記されたコメントを抽出するコメント抽出手段（図１のコメント抽出部１１ｅに対応）と、前記コメント抽出手段が抽出したコメントに含まれる単語と、ウェブページに表示されるコメント以外の部分に含まれる単語とを比較する比較手段（図１のコメントスパム判定部１１ｇに対応）と、前記比較手段による比較の結果に基づいてウェブページの異常の有無を検知する異常検知手段（図１のコメントスパム判定部１１ｇに対応）とを備えたことを特徴とするウェブページの異常検知装置である。 Further, the present invention provides an information storage means for storing web page information (corresponding to the web page information storage section 12 in FIG. 1) and a comment added to the web page from the web page information stored in the information storage means. A comment extraction means (corresponding to the comment extraction unit 11e in FIG. 1), a word included in the comment extracted by the comment extraction means, and a word included in a portion other than the comment displayed on the web page Comparing means (corresponding to the comment spam judging unit 11g in FIG. 1) and an abnormality detecting means (corresponding to the comment spam judging unit 11g in FIG. 1) for detecting the presence / absence of an abnormality of the web page based on the comparison result by the comparing means And a web page abnormality detection device.

また、本発明は、上記のウェブページの異常検知装置としてコンピュータを機能させるためのプログラムである。 The present invention is also a program for causing a computer to function as the above-described web page abnormality detection device.

また、本発明は、上記のプログラムを記録したコンピュータ読み取り可能な記録媒体である。 The present invention is a computer-readable recording medium on which the above program is recorded.

上記において、括弧で括った部分の記述は、後述する本発明の実施形態と本発明の構成要素とを便宜的に対応付けるためのものであり、この記述によって本発明の内容が限定されるわけではない。 In the above description, the description in parentheses is for the purpose of associating the embodiment of the present invention described later with the components of the present invention for convenience, and the contents of the present invention are not limited by this description. Absent.

本発明によれば、ウェブページの特徴量に基づいて、あるいは時刻毎の追加情報の出現頻度を示すヒストグラムに基づいて、あるいはウェブページに追記されたコメントに含まれる単語と、当該コメント以外の部分に含まれる単語とが一致するか否かを判定した結果に基づいて、ウェブページの異常の有無を検知することによって、人の目によるウェブページの監視が必要なくなるので、ウェブページの監視に掛かる負担を軽減することができるという効果が得られる。 According to the present invention, a word included in a comment added to a web page based on a feature amount of the web page, a histogram indicating the appearance frequency of additional information for each time, or a portion other than the comment Since it is not necessary to monitor the web page by human eyes by detecting the presence or absence of the web page based on the result of determining whether or not the word included in the word matches, it is necessary to monitor the web page The effect that a burden can be reduced is acquired.

以下、図面を参照し、本発明の実施形態を説明する。図１は、本発明の一実施形態によるウェブ監視装置（本発明のウェブページの異常検知装置に対応）の構成を示している。図１において、ウェブ監視装置１は、ブログのウェブページを管理しているウェブサーバ２と、ネットワーク３を介して接続されている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows the configuration of a web monitoring device (corresponding to the web page abnormality detection device of the present invention) according to an embodiment of the present invention. In FIG. 1, a web monitoring device 1 is connected to a web server 2 that manages a web page of a blog via a network 3.

ウェブ監視装置１において、通信部１０は、ネットワーク３を介してウェブサーバ２と通信を行う。監視処理部１１は、ウェブサーバ２から定期的にウェブページ情報を取得し、ウェブページの異常を検知する監視処理を実行する。ウェブページ情報は、ウェブページを表示するのに必要な各種ファイルに含まれる情報であり、本実施形態では、「.html」、「.htm」、「.txt」等の拡張子を有するテキストベースのファイルに含まれる情報であるものとする。具体的には、ウェブページ情報は、ウェブページ記述言語（ＨＴＭＬ等）のタグや、タグに含まれる各種情報（テキストやＵＲＬ等）である。 In the web monitoring device 1, the communication unit 10 communicates with the web server 2 via the network 3. The monitoring processing unit 11 periodically acquires web page information from the web server 2 and executes a monitoring process for detecting a web page abnormality. The web page information is information included in various files necessary to display the web page. In the present embodiment, the text base having extensions such as “.html”, “.htm”, and “.txt” is used. It is assumed that the information is included in the file. Specifically, the web page information is a tag of a web page description language (HTML or the like) or various information (text, URL or the like) included in the tag.

ウェブページ情報記憶部１２は、ウェブサーバ２から取得されたウェブページ情報を記憶する。また、ウェブページ情報記憶部１２は、異なる２つの時点で取得された２つのウェブページ情報の差分を示す差分情報も記憶する。リンク情報記憶部１３は、トラックバック元のウェブページへのリンクを示すリンク情報とトラックバック先のウェブページの更新日時を示す時刻情報とを関連付けて記憶する。コメント記憶部１４は、ウェブページに追記されたコメントとウェブページの更新日時を示す時刻情報とを関連付けて記憶する。係数記憶部１５は、後述する重み付け処理に用いる係数を記憶する。 The web page information storage unit 12 stores web page information acquired from the web server 2. The web page information storage unit 12 also stores difference information indicating a difference between two web page information acquired at two different times. The link information storage unit 13 stores link information indicating a link to the trackback source web page and time information indicating the update date and time of the trackback destination web page in association with each other. The comment storage unit 14 stores a comment added to the web page and time information indicating the update date and time of the web page in association with each other. The coefficient storage unit 15 stores coefficients used for weighting processing described later.

監視処理部１１において、ウェブページ情報取得部１１ａは、例えばgetコマンドによる処理を実行し、通信部１０による通信処理を介してウェブサーバ２にアクセスし、ウェブページ情報をウェブサーバ２から取得する。ページ変化検出部１１ｂは、ウェブページ情報取得部１１ａによって取得されたウェブページ情報の変化の有無を検出する。ウェブページ情報のハッシュ値の変化を検出することによって、ウェブページ情報の変化が検出される。 In the monitoring processing unit 11, the web page information acquisition unit 11 a executes, for example, processing by a get command, accesses the web server 2 through communication processing by the communication unit 10, and acquires web page information from the web server 2. The page change detection unit 11b detects whether there is a change in the web page information acquired by the web page information acquisition unit 11a. A change in web page information is detected by detecting a change in the hash value of the web page information.

差分抽出部１１ｃは、ページ変化検出部１１ｂによってウェブページ情報の変化が検出された場合に、変化前と変化後のウェブページ情報から、差分の情報を抽出する。抽出された情報は差分情報としてウェブページ情報記憶部１２に格納される。リンク情報抽出部１１ｄは、差分情報から、トラックバック元のウェブページへのリンクを示すリンク情報を抽出する。コメント抽出部１１ｅは、差分情報から、ウェブページに表示されるコメントを抽出する。 The difference extraction unit 11c extracts difference information from the web page information before and after the change when the change in the web page information is detected by the page change detection unit 11b. The extracted information is stored in the web page information storage unit 12 as difference information. The link information extraction unit 11d extracts link information indicating a link to the trackback source web page from the difference information. The comment extraction unit 11e extracts comments displayed on the web page from the difference information.

トラックバックスパム判定部１１ｆは、ウェブページに追記されたトラックバックがトラックバックスパムによるものであるか否かを判定する。コメントスパム判定部１１ｇは、ウェブページに追記されたコメントがコメントスパムによるものであるか否かを判定する。アラーム処理部１１ｈは、トラックバックスパムまたはコメントスパムが検知された場合に、ウェブサーバ２の管理者に対してアラームを発信して注意を促すためのアラーム情報（警告を通知する電子メール等）を生成する。 The trackback spam determination unit 11f determines whether or not the trackback added to the web page is due to trackback spam. The comment spam determination unit 11g determines whether or not the comment added to the web page is due to comment spam. The alarm processing unit 11h generates alarm information (e-mail or the like for notifying a warning) for issuing an alarm to the administrator of the web server 2 to call attention when trackback spam or comment spam is detected. To do.

次に、本実施形態によるウェブ監視装置１の動作を説明する。図２はウェブ監視装置１の動作の流れを示している。処理の開始後、ウェブページ情報取得部１１ａは、監視対象のブログからウェブページ情報を取得するため、通信部１０による通信処理を介してウェブサーバ２にアクセスし、ウェブページ情報を含むファイルを取得する。ウェブページ情報取得部１１ａは、取得したウェブページ情報をウェブページ情報記憶部１２に格納する（ステップＳ１００）。 Next, the operation of the web monitoring device 1 according to the present embodiment will be described. FIG. 2 shows an operation flow of the web monitoring apparatus 1. After starting the process, the web page information acquisition unit 11a accesses the web server 2 through the communication process by the communication unit 10 and acquires a file including the web page information in order to acquire the web page information from the monitored blog. To do. The web page information acquisition unit 11a stores the acquired web page information in the web page information storage unit 12 (step S100).

続いて、ウェブページ情報取得部１１ａは、ステップＳ１００で取得したウェブパージ情報が、初めて取得したウェブページ情報であるか否かを判定する（ステップＳ１０１）。過去に同じウェブページに関するウェブページ情報を取得したことがない場合には、処理が終了する。また、過去に同じウェブページに関するウェブページ情報を取得していた場合には、処理がステップＳ１０２に進む。 Subsequently, the web page information acquisition unit 11a determines whether the web purge information acquired in step S100 is the web page information acquired for the first time (step S101). If web page information related to the same web page has never been acquired, the process ends. If web page information related to the same web page has been acquired in the past, the process proceeds to step S102.

過去に同じウェブページに関するウェブページ情報を取得していた場合、ページ変化検出部１１ｂは、新たに取得したウェブページ情報に対応する、過去に取得したウェブページ情報をウェブページ情報記憶部１２から読み出し、各々のウェブページ情報のハッシュ値を算出する（ステップＳ１０２）。ページ変化検出部１１ｂは、新たに取得したウェブページ情報のハッシュ値と、過去に取得したウェブページ情報のハッシュ値とを比較し（ステップＳ１０３）、比較結果に基づいて、ウェブページ情報の変化の有無を検出する（ステップＳ１０４）。上記において、算出したハッシュ値をいずれかの記憶部に記憶させておき、次回のハッシュ値同士の比較に用いてもよい。 When web page information related to the same web page has been acquired in the past, the page change detection unit 11b reads web page information acquired in the past corresponding to the newly acquired web page information from the web page information storage unit 12. The hash value of each web page information is calculated (step S102). The page change detection unit 11b compares the hash value of the newly acquired web page information with the hash value of the web page information acquired in the past (step S103), and based on the comparison result, the change of the web page information. The presence or absence is detected (step S104). In the above, the calculated hash value may be stored in any storage unit and used for the next comparison between hash values.

ハッシュ値を比較した結果、２つのハッシュ値が同じ値であった場合には、ウェブページ情報は変化していない。この場合には、処理が終了する。また、２つのハッシュ値が異なる値であった場合には、ウェブページ情報が変化している。この場合には、差分抽出部１１ｃは、前回取得したウェブページ情報と、今回取得したウェブページ情報との差分である差分情報を抽出し、ウェブページ情報記憶部１２に格納する（ステップＳ１０５）。 As a result of comparing the hash values, if the two hash values are the same, the web page information has not changed. In this case, the process ends. When the two hash values are different, the web page information has changed. In this case, the difference extraction unit 11c extracts difference information that is a difference between the web page information acquired last time and the web page information acquired this time, and stores the difference information in the web page information storage unit 12 (step S105).

差分情報の抽出は、diffコマンドの実行によって行われる。diffコマンドは、２つのファイルのテキストを比較して、異なるテキストの部分を抽出する処理を実行するコマンドである。diffコマンドにより、変化前と変化後の両方のテキストの部分が抽出されるが、本実施形態では、変化後のテキストの部分が差分情報としてウェブページ情報記憶部１２に格納される。 The extraction of difference information is performed by executing a diff command. The diff command is a command for executing processing for comparing texts of two files and extracting portions of different texts. The text portion before and after the change is extracted by the diff command. In this embodiment, the text portion after the change is stored in the web page information storage unit 12 as difference information.

ステップＳ１０５に続いて、リンク情報抽出部１１ｄはウェブページ情報記憶部１２から差分情報を読み出し、差分情報からリンク情報を抽出する。具体的には、リンク情報抽出部１１ｄは、差分情報に含まれるウェブページ記述言語（ＨＴＭＬ等）のタグの中からトラックバック用の所定のタグを抽出し、さらにそのタグに含まれるＵＲＬを抽出し、そのＵＲＬをリンク情報とする（ステップＳ１０６）。 Following step S105, the link information extraction unit 11d reads the difference information from the web page information storage unit 12, and extracts the link information from the difference information. Specifically, the link information extraction unit 11d extracts a predetermined tag for trackback from tags of a web page description language (HTML or the like) included in the difference information, and further extracts a URL included in the tag. The URL is used as link information (step S106).

続いて、リンク情報抽出部１１ｄは、ウェブページ情報に新しいリンク情報が追加されたか否かを判定する（ステップＳ１０７）。ステップＳ１０６において、差分情報からリンク情報を抽出できた場合には、ウェブページ情報に新しいリンク情報が追加されたことになる。この場合には、処理がステップＳ１０８に進む。また、ステップＳ１０６において、差分情報からリンク情報を抽出できなかった場合には、ウェブページ情報に新しいリンク情報が追加されていないことになる。この場合には、処理がステップＳ１０９に進む。 Subsequently, the link information extraction unit 11d determines whether new link information has been added to the web page information (step S107). In step S106, if link information can be extracted from the difference information, new link information is added to the web page information. In this case, the process proceeds to step S108. In step S106, if link information cannot be extracted from the difference information, new link information is not added to the web page information. In this case, the process proceeds to step S109.

ウェブページ情報に新しいリンク情報が追加されたと判定された場合、トラックバックスパム判定部１１ｆは、ウェブページに追記されたトラックバックがトラックバックスパムによるものであるか否かを判定する（ステップＳ１０８）。ステップＳ１０８の詳細は後述する。続いて、コメント抽出部１１ｅはウェブページ情報記憶部１２から差分情報を読み出し、差分情報からコメントを抽出する。具体的には、コメント抽出部１１ｅは、差分情報に含まれるタグの中からコメント用の所定のタグを抽出し、さらにそのタグに含まれるテキストを抽出し、そのテキストをコメントとする（ステップＳ１０９）。 When it is determined that new link information has been added to the web page information, the trackback spam determination unit 11f determines whether or not the trackback added to the webpage is due to trackback spam (step S108). Details of step S108 will be described later. Subsequently, the comment extraction unit 11e reads the difference information from the web page information storage unit 12, and extracts a comment from the difference information. Specifically, the comment extraction unit 11e extracts a predetermined tag for comment from the tags included in the difference information, further extracts text included in the tag, and sets the text as a comment (step S109). ).

続いて、コメント抽出部１１ｅは、ウェブページ情報に新しいコメントが追加されたか否かを判定する（ステップＳ１１０）。ステップＳ１０９において、差分情報からコメントを抽出できた場合には、ウェブページ情報に新しいコメントが追加されたことになる。この場合には、処理がステップＳ１１１に進む。また、ステップＳ１０９において、差分情報からコメントを抽出できなかった場合には、ウェブページ情報に新しいコメントが追加されていないことになる。この場合には、処理が終了する。 Subsequently, the comment extraction unit 11e determines whether a new comment has been added to the web page information (step S110). In step S109, if a comment can be extracted from the difference information, a new comment is added to the web page information. In this case, the process proceeds to step S111. In step S109, when a comment cannot be extracted from the difference information, a new comment is not added to the web page information. In this case, the process ends.

ウェブページ情報に新しいコメントが追加されたと判定された場合、コメントスパム判定部１１ｇは、ウェブページに追記されたコメントがコメントスパムによるものであるか否かを判定する（ステップＳ１１１）。ステップＳ１１１の詳細は後述する。続いて、アラーム処理部１１ｈはアラーム情報を生成し、通信部１０へ出力する。通信部１０は、ネットワーク３を介してアラーム情報をウェブサーバ２へ送信する（ステップＳ１１２）。 If it is determined that a new comment has been added to the web page information, the comment spam determination unit 11g determines whether or not the comment added to the web page is due to comment spam (step S111). Details of step S111 will be described later. Subsequently, the alarm processing unit 11 h generates alarm information and outputs it to the communication unit 10. The communication unit 10 transmits alarm information to the web server 2 via the network 3 (step S112).

上記の処理を定期的に繰り返すことにより、ブログのウェブページが監視される。この際に、ウェブページ情報から同じブログ上の他のウェブページへのリンクを抽出し、リンク先のウェブページ情報を取得することを繰り返し行うことによって、多数のページで構成されるブログのウェブページをくまなく検査することができるようになり、異常の検知漏れを防止することができる。また、上記の処理を自動的に繰り返すことにより、２４時間３６５日監視を行えるようになる。 The blog web page is monitored by periodically repeating the above process. At this time, a web page of a blog composed of a large number of pages is obtained by repeatedly extracting links to other web pages on the same blog from the web page information and acquiring the linked web page information. It is possible to inspect all of the above, and it is possible to prevent omission of detection of abnormalities. Further, by automatically repeating the above processing, it becomes possible to monitor for 24 hours 365 days.

次に、ステップＳ１０８におけるトラックバックスパム判定の詳細を説明する。まず、第１の動作例を説明する。ボット（Ｂｏｔ）と呼ばれるコンピュータウィルスに感染したサーバを踏み台にして行う攻撃では、ボットを短時間しか利用できないことを前提に、ウェブサーバで管理されている複数のブログのウェブページに対して、同時期に同じ内容を書き込むことが多い。第１の動作例では、この性質を利用し、同時期に同一のトラックバックが複数のウェブページに追加された場合に、そのトラックバックがトラックバックスパムによるものであると判定される。 Next, details of the trackback spam determination in step S108 will be described. First, a first operation example will be described. In an attack that uses a server infected with a computer virus called a bot as a stepping stone, the bot can be used only for a short period of time on the web pages of multiple blogs managed by the web server. Often the same content is written at the same time. In the first operation example, using this property, when the same trackback is added to a plurality of web pages at the same time, it is determined that the trackback is caused by trackback spam.

トラックバックスパム判定部１１ｆは、ステップＳ１００で取得されたウェブページ情報から時刻情報を抽出し、ステップＳ１０６で抽出されたリンク情報と関連付けてリンク情報記憶部１３に格納する。本実施形態では、ウェブページが更新された時刻（最終更新時刻）を、トラックバックがウェブページに追加された時刻とみなす。 The trackback spam determination unit 11f extracts time information from the web page information acquired in step S100, and stores it in the link information storage unit 13 in association with the link information extracted in step S106. In the present embodiment, the time when the web page is updated (last update time) is regarded as the time when the trackback is added to the web page.

図３（ａ）は、リンク情報記憶部１３に格納されるリンク情報および時刻情報の内容を示している。図３（ａ）に示すように、リンク情報が示すＵＲＬ毎に時刻情報が関連付けられている。複数のブログを対象として、図２に示した処理をブログ毎に実行することにより、複数のブログから同じトラックバック元のＵＲＬが検出されることがある。このため、図２に示した処理を繰り返し実行すると、複数の時刻情報と関連付けられるＵＲＬが出現することになる。 FIG. 3A shows the contents of link information and time information stored in the link information storage unit 13. As shown in FIG. 3A, time information is associated with each URL indicated by the link information. By executing the process shown in FIG. 2 for each blog for a plurality of blogs, the same trackback source URL may be detected from the plurality of blogs. For this reason, when the process shown in FIG. 2 is repeatedly executed, URLs associated with a plurality of pieces of time information appear.

続いて、トラックバックスパム判定部１１ｆは、リンク情報記憶部１３からリンク情報および時刻情報を読み出し、時刻毎の同一リンクの出現頻度を示すヒストグラムを生成する。図３（ｂ）はヒストグラムの一例を示している。このヒストグラムから、同じＵＲＬをトラックバック元のＵＲＬとするトラックバックが、いつ、どれだけ検出されたのかが分かる。 Subsequently, the trackback spam determination unit 11f reads the link information and time information from the link information storage unit 13, and generates a histogram indicating the appearance frequency of the same link for each time. FIG. 3B shows an example of a histogram. This histogram shows when and how many trackbacks with the same URL as the trackback source URL are detected.

前述したように、トラックバックスパムによってトラックバックがブログに追記される場合、同時期に同一のトラックバックが複数のブログに追記されるため、ヒストグラムの頻度が高くなる。トラックバックスパム判定部１１ｆは、所定の区間３００を設定し、区間３００内のヒストグラムの頻度を合計した値と所定の閾値とを比較する。頻度の合計値が閾値以上であった場合には、トラックバックスパム判定部１１ｆは、トラックバックスパムによるトラックバックの追記が発生したと判定する。また、頻度の合計値が閾値未満であった場合には、トラックバックスパム判定部１１ｆは、トラックバックスパムによるトラックバックの追記は発生していないと判定する。 As described above, when a trackback is added to a blog due to trackback spam, since the same trackback is added to a plurality of blogs at the same time, the frequency of the histogram increases. The trackback spam determination unit 11f sets a predetermined section 300, and compares a value obtained by summing the frequencies of histograms in the section 300 with a predetermined threshold. If the total value of the frequencies is equal to or greater than the threshold, the trackback spam determination unit 11f determines that additional trackback due to trackback spam has occurred. If the total frequency is less than the threshold value, the trackback spam determination unit 11f determines that no additional trackback has occurred due to trackback spam.

トラックバックスパム判定部１１ｆは、区間３００を時間方向にずらしながら上記の処理を繰り返し実行する。その結果、トラックバックスパムによるトラックバックの追記が発生したと判定された区間３００が少なくとも１つ存在した場合には、トラックバックスパムによるトラックバック先のウェブページの異常が検知されたことになる。また、トラックバックスパムによるトラックバックの追記が発生したと判定された区間３００が１つも存在しなかった場合には、正規のトラックバックの追記が行われていることになる。 The trackback spam determination unit 11f repeatedly executes the above processing while shifting the section 300 in the time direction. As a result, if there is at least one section 300 in which it is determined that additional trackback due to trackback spam has occurred, an abnormality in the web page of the trackback destination due to trackback spam has been detected. In addition, when there is no section 300 that is determined to have added trackback due to trackback spam, the regular trackback is added.

次に、トラックバックスパム判定に関する第２の動作例を説明する。正規のトラックバックの追記が行われた場合、トラックバック元のウェブページはブログのウェブページであることが多い。しかし、トラックバックスパムによるトラックバックの追記が行われた場合、トラックバック元のウェブページのほとんどが、ブログとは関係のない商用目的のウェブページとなっており、トラックバック先のウェブページとトラックバック元のウェブページとでページ構成が異なる。第２の動作例では、この性質を利用し、トラックバック元のウェブページの特徴を検出し、ブログのウェブページとは異なる特徴が検出された場合に、トラックバックがトラックバックスパムによるものであると判定される。 Next, a second operation example relating to trackback spam determination will be described. When a regular trackback is added, the trackback source web page is often a blog web page. However, when a trackback is added due to trackback spam, most of the trackback source web pages are commercial web pages that are unrelated to the blog, and the track back destination web page and the track back source web page And the page structure is different. In the second operation example, this feature is used to detect the characteristics of the web page of the track back source, and when a characteristic different from the web page of the blog is detected, it is determined that the track back is caused by the track back spam. The

以下、ブログのウェブページの特徴を説明する。
（ａ）ウェブページに表示される画像が少ない（画像ファイルへのリンクが少ない）。
（ｂ）日本語のブログのウェブページでは、言語エンコードが日本語となることが多い。また、トラックバック先のウェブページとトラックバック元のウェブページの言語エンコードが同一であることが多い。
（ｃ）ウェブページに日時が表示されることが多い。
（ｄ）「日記(blog)」、「トラックバック（Trackback）」、「コメント(Comment)」などのキーワードがウェブページに表示されることが多い。 The features of the blog web page are described below.
(A) There are few images displayed on the web page (there are few links to image files).
(B) In many Japanese blog web pages, the language encoding is often Japanese. Further, the language encoding of the trackback destination web page and the trackback source web page is often the same.
(C) Dates are often displayed on web pages.
(D) Keywords such as “blog”, “trackback”, and “comment” are often displayed on web pages.

上記の特徴から、悪意のトラックバックによるトラックバック元のウェブページの特徴として、以下の特徴が挙げられる。
（Ａ）ウェブページにＮ（Ｎ：１以上の整数）個以上の画像が表示されている。
（Ｂ）言語エンコードが日本語以外の言語である。
（Ｃ）ウェブページに日時が表示されていない。
（Ｄ）「日記(blog)」、「トラックバック（Trackback）」、「コメント(Comment)」などのキーワードがウェブページに表示されていない。 From the above characteristics, the following characteristics can be cited as the characteristics of the web page of the track back source due to the malicious track back.
(A) N (N: integer greater than or equal to 1) images are displayed on the web page.
(B) The language encoding is a language other than Japanese.
(C) The date / time is not displayed on the web page.
(D) Keywords such as “Diary (blog)”, “Trackback”, “Comment” are not displayed on the web page.

第２の動作例では、トラックバック元のウェブページ情報が新たに取得される。具体的には、ウェブページ情報取得部１１ａは、ステップＳ１０６で抽出されたリンク情報が示すウェブサーバに対して、通信部１０による通信処理を介してアクセスし、トラックバック元のウェブページ情報を含むファイルを取得する。ウェブページ情報取得部１１ａは、取得したウェブページ情報をウェブページ情報記憶部１２に格納する。 In the second operation example, the trackback source web page information is newly acquired. Specifically, the web page information acquisition unit 11a accesses the web server indicated by the link information extracted in step S106 via a communication process by the communication unit 10, and includes a file containing the track back source web page information. To get. The web page information acquisition unit 11 a stores the acquired web page information in the web page information storage unit 12.

トラックバックスパム判定部１１ｆは、ウェブページ情報記憶部１２からトラックバック元のウェブページ情報を読み出し、上記の条件（Ａ）〜（Ｄ）を基準にして、以下の（１）式により、ウェブページの特徴を示す特徴量を算出する。（１）式において、添え字のｉは上記の条件（Ａ）〜（Ｄ）に対応しており、ｉ＝０が条件（Ａ）に対応し、ｉ＝１が条件（Ｂ）に対応し、ｉ＝２が条件（Ｃ）に対応し、ｉ＝３が条件（Ｄ）に対応している。また、Ｃ_ｉは各条件の特徴の有無に対応した値であり、ウェブページが各条件を満たす場合にＣ_ｉ＝１、ウェブページが各条件を満たさない場合にＣ_ｉ＝０である。ｋ_ｉは、Ｃ_ｉへの重み付けの度合いを示す係数である。ｋ_ｉの値は係数記憶部１５に格納されている。 The trackback spam determination unit 11f reads the webpage information of the trackback source from the web page information storage unit 12, and uses the following formulas (1) based on the above conditions (A) to (D) as a feature of the web page. Is calculated. In the formula (1), the subscript i corresponds to the above conditions (A) to (D), i = 0 corresponds to the condition (A), and i = 1 corresponds to the condition (B). , I = 2 corresponds to the condition (C), and i = 3 corresponds to the condition (D). C _i is a value corresponding to the presence / absence of characteristics of each condition, and C _i = 1 when the web page satisfies each condition, and C _i = 0 when the web page does not satisfy each condition. k _i is a coefficient indicating the degree of weighting to C _i . The value of k _i is stored in the coefficient storage unit 15.

条件（Ａ）に関しては、トラックバックスパム判定部１１ｆは、ウェブページ情報から、画像ファイルへのリンクを示す情報を抽出し、その情報が示すリンクの数に基づいてＣ_ｉの値を決定する。リンクの数がＮ以上であれば、Ｃ_ｉ＝１であり、リンクの数がＮ未満であれば、Ｃ_ｉ＝０である。条件（Ｂ）に関しては、トラックバックスパム判定部１１ｆは、ウェブページ情報に「charset=euc-jp」というタグが含まれているか否かを判定した結果に基づいてＣ_ｉの値を決定する。このタグが含まれていなければ、Ｃ_ｉ＝１であり、このタグが含まれていれば、Ｃ_ｉ＝０である。 For the condition (A), trackback spam determination unit 11f, from the web page information, extracts information showing the link to the image file, to determine the value of C _i, based on the number of links indicated by the information. If the number of links is N or more, C _i = 1, and if the number of links is less than N, C _i = 0. For the condition (B), trackback spam determination unit 11f determines the value of C _i on the basis of a result of determining whether it contains a tag "charset = euc-jp" to the web page information. If this tag is not included, C _i = 1, and if this tag is included, C _i = 0.

条件（Ｃ）に関しては、トラックバックスパム判定部１１ｆは、ウェブページ情報に日時の表示に関するタグが含まれているか否かを判定した結果に基づいてＣ_ｉの値を決定する。このタグが含まれていなければ、Ｃ_ｉ＝１であり、このタグが含まれていれば、Ｃ_ｉ＝０である。条件（Ｄ）に関しては、トラックバックスパム判定部１１ｆは、ウェブページ情報に特定のキーワードを示すテキストが含まれているか否かを判定した結果に基づいてＣ_ｉの値を決定する。このテキストが含まれていなければ、Ｃ_ｉ＝１であり、このテキストが含まれていれば、Ｃ_ｉ＝０である。 For the condition (C), trackback spam determination unit 11f determines the value of C _i on the basis of a result of determining whether contains a tag relating to the display of the date and time on the web page information. If this tag is not included, C _i = 1, and if this tag is included, C _i = 0. For the condition (D), trackback spam determination unit 11f determines the value of C _i on the basis of a result of determining whether it contains text that indicates a specific keyword on the web page information. If this text is not included, C _i = 1, and if this text is included, C _i = 0.

また、係数ｋ_ｉの値は以下のようにして予め算出される。条件（Ａ）に関しては、トラックバックによるトラックバック元のウェブページのうち、Ｎ個以上の画像ファイルへのリンクを含むウェブページを対象として、正規のウェブページであるのか、それとも悪意のウェブページであるのかを調査し、正規のウェブページと悪意のウェブページの出現数をカウントする。この調査結果に基づいて、以下の（２）式に従って係数ｋ_ｉの値を算出する。 The value of the coefficient k _i is calculated in advance as follows. Regarding condition (A), whether the web page is a legitimate web page or a malicious web page for web pages including links to N or more image files among the track back source web pages by track back. And count the number of occurrences of legitimate and malicious web pages. Based on this investigation result, the value of the coefficient k _i is calculated according to the following equation (2).

条件（Ｂ）に関しては、トラックバックによるトラックバック元のウェブページのうち、「charset=euc-jp」というタグが含まれていないウェブページを対象として、正規のウェブページであるのか、それとも悪意のウェブページであるのかを調査し、正規のウェブページと悪意のウェブページの出現数をカウントする。この調査結果に基づいて、上記の（２）式に従って係数ｋ_ｉの値を算出する。 Regarding the condition (B), whether or not the trackback source webpage by trackback is a legitimate webpage targeting a webpage that does not include the tag “charset = euc-jp”, or a malicious webpage. And count the number of occurrences of legitimate web pages and malicious web pages. Based on this investigation result, the value of the coefficient k _i is calculated according to the above equation (2).

条件（Ｃ）に関しては、トラックバックによるトラックバック元のウェブページのうち、日時が表示されていないウェブページを対象として、正規のウェブページであるのか、それとも悪意のウェブページであるのかを調査し、正規のウェブページと悪意のウェブページの出現数をカウントする。この調査結果に基づいて、上記の（２）式に従って係数ｋ_ｉの値を算出する。 Regarding the condition (C), we investigate whether the web page is the regular web page or the malicious web page for the web page that does not display the date / time among the track back source web pages. Count the number of occurrences of web pages and malicious web pages. Based on this investigation result, the value of the coefficient k _i is calculated according to the above equation (2).

条件（Ｄ）に関しては、トラックバックによるトラックバック元のウェブページのうち、特定のキーワードが表示されていないウェブページを対象として、正規のウェブページであるのか、それとも悪意のウェブページであるのかを調査し、正規のウェブページと悪意のウェブページの出現数をカウントする。この調査結果に基づいて、上記の（２）式に従って係数ｋ_ｉの値を算出する。 Regarding the condition (D), it is investigated whether the web page from which the specific keyword is not displayed is the legitimate web page or the malicious web page among the track back source web pages by the track back. , Count the number of appearances of legitimate web pages and malicious web pages. Based on this investigation result, the value of the coefficient k _i is calculated according to the above equation (2).

上記の条件（Ａ）〜（Ｄ）以外の条件を用いることも可能である。例えば、前述したように、トラックバックスパムによる攻撃では、同時期に同一のトラックバックが複数のウェブページに追記されるという性質がある。この性質を利用し、トラックバックにより、同一のＵＲＬがＮ個以上のウェブページに追記されたことを条件としてもよい。 Conditions other than the above conditions (A) to (D) may be used. For example, as described above, an attack by trackback spam has the property that the same trackback is added to a plurality of web pages at the same time. This property may be used on condition that the same URL is added to N or more web pages by track back.

この条件を用いる場合、トラックバックスパム判定部１１ｆは、前述したヒストグラムを生成し、所定の区間内でヒストグラムの頻度を合計した値と所定の閾値とを比較した結果に基づいてＣ_ｉの値を決定する。頻度の合計値が閾値以上であれば、Ｃ_ｉ＝１であり、頻度の合計値が閾値未満であれば、Ｃ_ｉ＝０である。 When using this condition, trackback spam determination section 11f determines the value of C _i on the basis of the result of generating a histogram as described above, was compared with the total value with a predetermined threshold the frequency of the histogram in a predetermined section To do. C _i = 1 if the total value of the frequencies is equal to or greater than the threshold value, and C _i = 0 if the total value of the frequencies is less than the threshold value.

また、係数ｋ_ｉの値に関しては、トラックバックによるトラックバック元のウェブページのうち、同時期にＮ個以上のウェブページに追記されたトラックバックによるトラックバック元のウェブページを対象として、正規のウェブページであるのか、それとも悪意のウェブページであるのかを調査し、正規のウェブページと悪意のウェブページの出現数をカウントする。この調査結果に基づいて、上記の（２）式に従って係数ｋ_ｉの値を算出する。 The value of the coefficient k _i is a regular web page for track back source web pages added to N or more web pages at the same time among track back source web pages by track back. Whether it is a malicious web page or a malicious web page, and counts the number of regular and malicious web pages. Based on this investigation result, the value of the coefficient k _i is calculated according to the above equation (2).

以上のようにして、（１）式の特徴量が算出される。トラックバックスパム判定部１１ｆは、この特徴量を所定の閾値と比較する。特徴量が閾値以上であった場合には、トラックバックスパム判定部１１ｆは、トラックバックスパムによるトラックバックの追記が発生したと判定する。また、特徴量が閾値未満であった場合には、トラックバックスパム判定部１１ｆは、トラックバックスパムによるトラックバックの追記が発生していないと判定する。 As described above, the feature amount of the equation (1) is calculated. The trackback spam determination unit 11f compares this feature amount with a predetermined threshold value. If the feature amount is greater than or equal to the threshold value, the trackback spam determination unit 11f determines that additional trackback due to trackback spam has occurred. If the feature amount is less than the threshold value, the trackback spam determination unit 11f determines that no additional trackback due to trackback spam has occurred.

上記の判定結果を既存の禁止ＵＲＬリストに反映させてもよい。すなわち、トラックバックスパムによるものであると判定されたトラックバックによって追記されたＵＲＬを禁止ＵＲＬリストに追加してもよい。これによって、最新の禁止ＵＲＬを保つことが可能となる。 The determination result may be reflected in the existing prohibited URL list. That is, the URL added by the trackback determined to be due to the trackback spam may be added to the prohibited URL list. This makes it possible to keep the latest prohibited URL.

次に、ステップＳ１１１におけるコメントスパム判定の詳細を説明する。まず、第１の動作例を説明する。前述したように、ボットに感染したサーバを踏み台にして行う攻撃では、ボットを短時間しか利用できないことを前提に、ウェブサーバで管理されている複数のブログのウェブページに対して、同時期に同じ内容を書き込むことが多い。第１の動作例では、この性質を利用し、同時期に同一のコメントが複数のウェブページに追加された場合に、そのコメントがコメントスパムによるものであると判定される。 Next, details of the comment spam determination in step S111 will be described. First, a first operation example will be described. As mentioned above, attacks that take place using a server infected with a bot as a stepping stone can be used for multiple blog web pages managed by a web server at the same time, assuming that the bot can only be used for a short time. Often the same content is written. In the first operation example, using this property, when the same comment is added to a plurality of web pages at the same time, it is determined that the comment is due to comment spam.

第１の動作例における処理の流れは、前述したトラックバックスパム判定の第１の動作例における処理の流れと同様である。トラックバックスパム判定部１１ｆは、ステップＳ１００で取得されたウェブページ情報から時刻情報を抽出し、ステップＳ１０９で抽出されたコメントと関連付けてコメント記憶部１４に格納する。続いて、トラックバックスパム判定部１１ｆは、コメント記憶部１４からコメント情報および時刻情報を読み出し、時刻毎の同一コメントの出現頻度を示すヒストグラムを生成する。これ以降の処理は前述した通りである。 The process flow in the first operation example is the same as the process flow in the first operation example of the trackback spam determination described above. The trackback spam determination unit 11f extracts time information from the web page information acquired in step S100, and stores the time information in the comment storage unit 14 in association with the comment extracted in step S109. Subsequently, the trackback spam determination unit 11f reads the comment information and the time information from the comment storage unit 14, and generates a histogram indicating the appearance frequency of the same comment for each time. The subsequent processing is as described above.

次に、コメントスパム判定における第２の動作例を説明する。正規のコメントが追記された場合、ブログに記載されている内容の趣旨とコメントの趣旨とに関連性がある。しかし、コメントスパムによってコメントが追記された場合、ブログに記載されている内容とは関係のないコメントが追記されることが多い。そこで、第２の動作例では、コメントに含まれる単語と、ブログに表示されるコメント以外の部分に含まれる単語とを比較した結果に基づいて、コメントがコメントスパムによるものであるか否かが判定される。 Next, a second operation example in comment spam determination will be described. When a regular comment is added, the purpose of the content described in the blog is related to the purpose of the comment. However, when comments are added due to comment spam, comments that are not related to the contents described in the blog are often added. Therefore, in the second operation example, based on the result of comparing the word included in the comment with the word included in the part other than the comment displayed on the blog, it is determined whether or not the comment is due to comment spam. Determined.

具体的には、コメントスパム判定部１１ｇは、ステップＳ１００で取得されたウェブページ情報をウェブページ情報記憶部１２から読み出し、ウェブページに表示されるテキストのうち、コメント以外のテキストを抽出する。続いて、コメントスパム判定部１１ｇは、ステップＳ１０９で抽出されたコメントに含まれる単語と、コメント以外のテキストに含まれる単語とを比較する。この比較の際には、予め用意した単語辞書に登録されている、キーワードとなる単語のみを比較の対象としてもよい。 Specifically, the comment spam determination unit 11g reads the web page information acquired in step S100 from the web page information storage unit 12, and extracts text other than comments from the text displayed on the web page. Subsequently, the comment spam determination unit 11g compares the word included in the comment extracted in step S109 with the word included in the text other than the comment. In this comparison, only words that are keywords registered in a word dictionary prepared in advance may be compared.

比較の結果、コメントに含まれる単語が、コメント以外のテキストに含まれるどの単語とも一致しなかった場合には、コメントスパム判定部１１ｇは、コメントスパムによるコメントの追記が発生したと判定する。また、コメントに含まれる単語が、コメント以外のテキストに含まれるいずれかの単語と一致した場合には、コメントスパム判定部１１ｇは、コメントスパムによるコメントの追記が発生していないと判定する。 As a result of the comparison, when the word included in the comment does not match any word included in the text other than the comment, the comment spam determination unit 11g determines that a comment has been added due to comment spam. Further, when the word included in the comment matches any word included in the text other than the comment, the comment spam determination unit 11g determines that no additional comment is generated due to comment spam.

あるいは、コメントに含まれる単語と、コメント以外のテキストに含まれる単語との一致数が所定の閾値未満である場合に、コメントスパムによるコメントの追記が発生したと判定し、コメントに含まれる単語と、コメント以外のテキストに含まれる単語との一致数が閾値以上である場合に、コメントスパムによるコメントの追記が発生していないと判定してもよい。 Alternatively, when the number of matches between the word included in the comment and the word included in the text other than the comment is less than a predetermined threshold, it is determined that the comment is added due to comment spam, and the word included in the comment When the number of matches with words included in text other than comments is equal to or greater than a threshold value, it may be determined that comments are not added due to comment spam.

上記の２つの動作例の他に、コメントの中にＵＲＬを記載するというコメントスパムの特徴を利用して、コメントにＵＲＬが含まれているか否かを判定することにより、コメントがコメントスパムによるものであるか否かを判定してもよい。 In addition to the above two operation examples, by using the feature of comment spam that URL is described in the comment, it is determined whether or not the URL is included in the comment. It may be determined whether or not.

上述したように、本実施形態によれば、ウェブページの異常を自動的に検知することによって、人の目によるウェブページの監視が必要なくなるので、ウェブページの監視に掛かる負担を軽減することができる。したがって、ブログ用のウェブサーバを管理する企業にとって、手動で行っていた異常検知のための人件費を削減することができる。 As described above, according to the present embodiment, it is not necessary to monitor the web page by human eyes by automatically detecting the abnormality of the web page, so that the burden on the monitoring of the web page can be reduced. it can. Therefore, it is possible to reduce labor costs for detecting anomalies that have been manually performed for a company that manages a web server for a blog.

以上、図面を参照して本発明の実施形態について詳述してきたが、具体的な構成は上記の実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。例えば、上述したウェブ監視装置の動作および機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータに読み込ませ、実行させてもよい。 As described above, the embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to the above-described embodiments, and includes design changes and the like without departing from the gist of the present invention. . For example, a program for realizing the operation and function of the web monitoring device described above may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read and executed by the computer.

ここで、「コンピュータ」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Here, the “computer” includes a homepage providing environment (or display environment) if the WWW system is used. The “computer-readable recording medium” refers to a storage device such as a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a hard disk built in the computer. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上述したプログラムは、このプログラムを記憶装置等に格納したコンピュータから、伝送媒体を介して、あるいは伝送媒体中の伝送波により他のコンピュータに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように、情報を伝送する機能を有する媒体のことをいう。また、上述したプログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能を、コンピュータに既に記録されているプログラムとの組合せで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program described above may be transmitted from a computer storing the program in a storage device or the like to another computer via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting a program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the above-described program may be for realizing a part of the above-described function. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer, what is called a difference file (difference program) may be sufficient.

本発明の一実施形態によるウェブ監視装置の構成を示すブロック図である。It is a block diagram which shows the structure of the web monitoring apparatus by one Embodiment of this invention. 本発明の一実施形態によるウェブ監視装置の動作の手順を示すフローチャートである。It is a flowchart which shows the procedure of operation | movement of the web monitoring apparatus by one Embodiment of this invention. 本発明の一実施形態におけるトラックバック判定方法を説明するための参考図である。It is a reference figure for demonstrating the track back determination method in one Embodiment of this invention.

Explanation of symbols

１・・・ウェブ監視装置、２・・・ウェブサーバ、３・・・ネットワーク、１０・・・通信部、１１・・・監視処理部、１１ａ・・・ウェブページ情報取得部、１１ｂ・・・ページ変化検出部、１１ｃ・・・差分抽出部、１１ｄ・・・リンク情報抽出部、１１ｅ・・・コメント抽出部、１１ｆ・・・トラックバックスパム判定部、１１ｇ・・・コメントスパム判定部、１１ｈ・・・アラーム処理部、１２・・・ウェブページ情報記憶部、１３・・・リンク情報記憶部、１４・・・コメント記憶部、１５・・・係数記憶部 DESCRIPTION OF SYMBOLS 1 ... Web monitoring apparatus, 2 ... Web server, 3 ... Network, 10 ... Communication part, 11 ... Monitoring process part, 11a ... Web page information acquisition part, 11b ... Page change detection unit, 11c ... difference extraction unit, 11d ... link information extraction unit, 11e ... comment extraction unit, 11f ... trackback spam determination unit, 11g ... comment spam determination unit, 11h ..Alarm processing unit, 12 ... Web page information storage unit, 13 ... Link information storage unit, 14 ... Comment storage unit, 15 ... Coefficient storage unit

Claims

Information storage means for storing web page information;
Link information extraction means for extracting link information indicating a link to another web page from the web page information stored in the information storage means;
Information acquisition means for connecting to a web server indicated by the link information and acquiring web page information;
Based on the web page information acquired by the information acquisition unit, a feature amount calculation unit that calculates a feature amount of a web page;
An abnormality detecting means for detecting presence / absence of an abnormality of a web page based on the feature quantity calculated by the feature quantity calculating means;
An apparatus for detecting an abnormality of a web page, comprising:

2. The feature amount calculation unit calculates a feature amount of a web page based on a plurality of conditions indicating the feature of the web page based on the web page information acquired by the information acquisition unit. An abnormality detection device for a web page as described in 1.

3. The web page abnormality detection device according to claim 2, wherein the feature amount calculation unit calculates the feature amount by weighting a plurality of conditions indicating features of the web page.

First information storage means for storing web page information;
Information extraction means for extracting additional information added in the past from the web page information stored in the first information storage means;
Second information storage means for storing the additional information in association with time information indicating the update time of the web page information;
Histogram generation means for generating a histogram indicating the appearance frequency of the additional information for each time based on the additional information and the time information stored in the second storage means;
An anomaly detecting means for detecting the presence or absence of an anomaly of the web page based on the histogram;
An apparatus for detecting an abnormality of a web page, comprising:

5. The information extracting unit extracts link information indicating a link to another web page as the additional information from the web page information stored in the first information storage unit. The web page abnormality detection device described.

5. The web page according to claim 4, wherein the information extraction unit extracts a comment added to the web page as the additional information from the web page information stored in the first information storage unit. Anomaly detection device.

Information storage means for storing web page information;
Comment extracting means for extracting a comment added to a web page from the web page information stored in the information storage means;
Comparison means for comparing the word included in the comment extracted by the comment extraction means with the word included in a portion other than the comment displayed on the web page;
An abnormality detection means for detecting the presence or absence of an abnormality of the web page based on the result of the comparison by the comparison means;
An apparatus for detecting an abnormality of a web page, comprising:

The program for functioning a computer as an abnormality detection apparatus of the web page in any one of Claims 1-7.

A computer-readable recording medium on which the program according to claim 8 is recorded.