JP5113206B2

JP5113206B2 - Spam blog determination apparatus and method

Info

Publication number: JP5113206B2
Application number: JP2010064447A
Authority: JP
Inventors: 千鶴 ▲高▼澤
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2010-03-19
Filing date: 2010-03-19
Publication date: 2013-01-09
Anticipated expiration: 2030-03-19
Also published as: JP2011198065A

Description

本発明は、スパムブログ判定装置及び方法に関する。 The present invention relates to a spam blog determination apparatus and method.

従来、特定のサイトへ誘導することを目的として、自動的に作成して投稿する「スパムブログ」と呼ばれるブログが存在する。スパムブログは、「スプログ」（ｓｐｌｏｇ）とも呼ばれる。このスパムブログは、インターネットにおいて様々な問題が生じる危険性がある。例えば、ブログサービス提供業者にとって、スパムブログは、サーバや回線を高負荷にするため、サービスに支障が出る危険性がある。また、企業等は、自社の製品や情報がどの程度の検索ランクになっているのか知りたいニーズが存在する。しかし、スパムブログによって検索ランクが狂されてしまうという危険性がある。そして、スパムブログによって、例えば、検索サービスからユーザの欲する情報に到達できなくなる危険性がある。 Conventionally, there is a blog called “spam blog” that is automatically created and posted for the purpose of guiding to a specific site. Spam blogs are also referred to as “splogs”. This spam blog has various risks in the Internet. For example, for a blog service provider, a spam blog places a high load on a server and a line, so there is a risk that the service will be hindered. In addition, there is a need for companies and the like to know the search rank of their products and information. However, there is a risk that spam blogs can corrupt the search rank. Then, there is a risk that the information desired by the user cannot be reached from the search service due to the spam blog, for example.

このようなスパムブログに対する対策として、例えば、スパムブログの頻出用語やスパムブログのパターン等の情報に基づき、スパムブログを判断する方法が考えられている（例えば、特許文献１（段落［０００５］）参照）。特許文献１の該当の段落に記載されたものは、機械学習を利用して特定の特徴を持つものを排除する、いわゆるスパムフィルタと呼ばれるものである。 As a countermeasure against such a spam blog, for example, a method of determining a spam blog based on information such as frequent terms of the spam blog and a spam blog pattern is considered (for example, Patent Document 1 (paragraph [0005]). reference). What is described in the corresponding paragraph of Patent Document 1 is a so-called spam filter that uses machine learning to exclude those having specific characteristics.

特開２００６−３３１２９７号公報JP 2006-331297 A

ところで、スパムブログの中には、流行している用語をそのブログ内に含むものがある。そして、流行している用語は、流行り廃りの動きが早く、次々に入れ替わり、しかも、一旦流行した用語がどの程度継続するかの予測が難しいという特徴がある。特許文献１に開示された技術によっても、このような状況に適切に対応することはできず、管理者は、スパムブログを判断するためにプログラムロジックを都度変更する必要が生じ、これは、ブログサービス提供業者の管理者にとって煩雑な作業であった。 By the way, some spam blogs include popular terms in the blog. And terms that are popular are characterized by the fact that they are rapidly becoming obsolete, are replaced one after another, and it is difficult to predict how long a term that has been once popular will continue. Even the technique disclosed in Patent Document 1 cannot appropriately cope with such a situation, and an administrator needs to change the program logic each time to determine a spam blog. It was a complicated task for the administrator of the service provider.

本発明は、管理者による作業を容易にしてスパムブログを判定するスパムブログ判定装置及び方法を提供することを目的とする。 An object of the present invention is to provide a spam blog determination apparatus and method for determining a spam blog by facilitating an operation by an administrator.

本発明者らは、流行している用語を登録することで判定対象にし、さらに流行の継続期間に追随してスパムブログを判定することが可能な環境を提供することを見出し、本発明を完成するに至った。本発明は、具体的には次のようなものを提供する。 The present inventors have found that an environment in which spam blogs can be determined by registering popular terms and determining spam blogs following the trend duration is completed. It came to do. Specifically, the present invention provides the following.

（１）所定のキーワードの登録指定を受け付けたことに応じて、前記所定のキーワードを所定キーワード記憶手段に記憶する所定キーワード記憶制御手段と、判定対象のブログ記事を受け付けたことに応じて、前記所定キーワード記憶手段に記憶した前記所定のキーワードを素性として用いて前記ブログ記事がスパムブログであるか否かを機械学習により判定する機械学習手段と、前記機械学習手段による判定対象の前記ブログ記事のうち、前記所定キーワード記憶手段に記憶された前記所定のキーワードを含むブログ記事と、スパムブログであるか否かの前記機械学習による判定の結果とを対応付けて出力するスパム判定結果出力手段と、前記所定のキーワードの削除指定を受け付けたことに応じて、前記所定キーワード記憶手段に記憶された前記所定のキーワードを削除する調整戻し手段と、前記スパム判定結果出力手段が前記機械学習による判定の結果を出力した前記ブログ記事について、前記機械学習による判定の結果の正誤判断を示す情報である正誤判断情報を受け付けたことに応じて、前記正誤判断情報を前記ブログ記事に対応付けて記憶する正誤判断情報記憶手段と、所定期間において前記正誤判断情報記憶手段が記憶した前記正誤判断情報に基づいて前記所定のキーワードごとに前記機械学習による判定の誤り率を算出し、前記誤り率が所定の閾値以上となった前記所定のキーワードについて流行が終わったと判定する第１流行終了判定手段と、を備えるスパムブログ判定装置。 (1) In response to receiving registration designation of a predetermined keyword, predetermined keyword storage control means for storing the predetermined keyword in a predetermined keyword storage means, and in response to receiving a blog article to be determined, Machine learning means for determining whether or not the blog article is a spam blog using the predetermined keyword stored in the predetermined keyword storage means as a feature, and the blog article to be determined by the machine learning means Among them, a spam determination result output means for outputting a blog article including the predetermined keyword stored in the predetermined keyword storage means and a result of the determination by the machine learning as to whether or not it is a spam blog, In response to receiving the deletion designation of the predetermined keyword, the predetermined keyword storage means stores the predetermined keyword. Information indicating the correctness / incorrectness of the determination result by the machine learning for the blog article from which the determination result by the machine learning is output by the adjustment determination return means for deleting the predetermined keyword and the spam determination result output means In response to accepting certain correctness determination information, the correctness determination information storage means for storing the correctness determination information in association with the blog article, and the correctness determination information stored in the correctness determination information storage means for a predetermined period. An error rate of determination by the machine learning based on each predetermined keyword based on the first trend end determination means for determining that the trend has ended for the predetermined keyword for which the error rate is equal to or greater than a predetermined threshold; A spam blog determination device comprising:

本発明のこのような構成によれば、スパムブログ判定装置は、例えば、流行り廃りのある所定のキーワードを処理対象にする、所定のキーワードの登録指定を受け付けたことで、所定のキーワードを含むスパムブログの判定を行い、結果を出力する。よって、スパムブログ判定装置は、管理者に所定のキーワードの登録指定を行わせるだけで、所定のキーワードを含むスパムブログを判定することができる。さらに、スパムブログ判定装置は、スパム判定結果出力手段が出力した判定結果を見た管理者から所定のキーワードを処理対象外にする削除指定を受け付けたことで、所定のキーワードを処理の対象から外す。よって、スパムブログ判定装置は、流行り廃りのある所定のキーワードを含むスパムブログの判定を適切に行うことができる。また、本発明によれば、スパムブログ判定装置は、判定の結果を出力したブログ記事についての機械学習の判定の結果の正誤判断を示す情報である正誤判断情報を受け付けてブログ記事に対応付けて記憶し、所定期間において記憶した正誤判断情報に基づいて所定のキーワードごとに機械学習による判定の誤り率を算出し、その誤り率が所定の閾値以上となった場合に、所定のキーワードについて流行が終わったと判定する。よって、スパム判定結果出力手段が出力した判定結果を見た管理者がブログ記事についての機械学習の判定の結果の正誤判断を行って正誤判断情報を入力することで、スパムブログ判定装置は、所定のキーワードごとに機械学習による判定の誤り率を計算して、予め設定した閾値と比較して自動的に所定のキーワードの「一時的」な流行の終了状態を判断できる。 According to such a configuration of the present invention, the spam blog determination apparatus receives, for example, a predetermined keyword registration specification for processing a predetermined keyword that is out of fashion and includes a predetermined keyword. And output the result. Therefore, the spam blog determination apparatus can determine a spam blog including a predetermined keyword only by allowing the administrator to specify and register a predetermined keyword. Further, the spam blog determination device removes the predetermined keyword from the processing target by receiving a deletion designation for excluding the predetermined keyword from the administrator who viewed the determination result output by the spam determination result output unit. . Therefore, the spam blog determination device can appropriately determine a spam blog that includes a predetermined keyword that has become obsolete. Further, according to the present invention, the spam blog determination device accepts correctness / incorrectness determination information, which is information indicating the correctness / incorrectness determination of the result of the machine learning determination for the blog post that outputs the determination result, and associates it with the blog post. An error rate of determination by machine learning is calculated for each predetermined keyword based on the correctness / incorrectness determination information stored in a predetermined period, and when the error rate is equal to or higher than a predetermined threshold, the prevalence of the predetermined keyword is increased. Judge that it is over. Therefore, the administrator who sees the determination result output by the spam determination result output means makes a correct / incorrect determination on the result of the machine learning determination on the blog article and inputs the correct / incorrect determination information, so that the spam blog determination apparatus is predetermined. The error rate of determination by machine learning is calculated for each keyword, and the “temporary” fashion end state of a predetermined keyword can be automatically determined by comparing with a preset threshold value.

（２）所定のキーワードの登録指定を受け付けたことに応じて、前記所定のキーワードを所定キーワード記憶手段に記憶する所定キーワード記憶制御手段と、判定対象のブログ記事を受け付けたことに応じて、前記所定キーワード記憶手段に記憶した前記所定のキーワードを素性として用いて前記ブログ記事がスパムブログであるか否かを機械学習により判定する機械学習手段と、前記機械学習手段による判定対象の前記ブログ記事のうち、前記所定キーワード記憶手段に記憶された前記所定のキーワードを含むブログ記事と、スパムブログであるか否かの前記機械学習による判定の結果とを対応付けて出力するスパム判定結果出力手段と、前記所定のキーワードの削除指定を受け付けたことに応じて、前記所定キーワード記憶手段に記憶された前記所定のキーワードを削除する調整戻し手段と、所定期間において前記所定のキーワードを含む前記判定対象の前記ブログ記事のうち、前記機械学習手段がスパムブログであると判定した前記ブログ記事の割合が所定の閾値以下となった場合に、前記所定のキーワードについて流行が終了したと判定する第２流行終了判定手段と、を備えるスパムブログ判定装置。 ( 2 ) In response to receiving registration designation of a predetermined keyword, predetermined keyword storage control means for storing the predetermined keyword in a predetermined keyword storage means, and in response to receiving a determination target blog article, Machine learning means for determining whether or not the blog article is a spam blog using the predetermined keyword stored in the predetermined keyword storage means as a feature, and the blog article to be determined by the machine learning means Among them, a spam determination result output means for outputting a blog article including the predetermined keyword stored in the predetermined keyword storage means and a result of the determination by the machine learning as to whether or not it is a spam blog, Stored in the predetermined keyword storage means in response to receiving the deletion specification of the predetermined keyword And adjusting the return means for deleting the predetermined keyword, among the posts of the determination target containing the predetermined keyword in a predetermined period, the proportion of the posts of the machine learning unit is determined to be spam blog There spam blog determination device provided in the case of equal to or less than a predetermined threshold value, and a second outbreak end determining unit determines that the epidemic has been completed for the predetermined keyword.

本発明のこのような構成によれば、スパムブログ判定装置は、所定期間において所定のキーワードを含む判定対象のブログ記事のうち、機械学習手段がスパムブログであると判定したブログ記事の割合が所定の閾値以下となった場合に、所定のキーワードについて流行が終了したと判定する。よって、スパムブログ判定装置は、所定の基準によって所定のキーワードの「一時的」な流行の終了状態を判断できる。 According to such a configuration of the present invention, the spam blog determination device has a predetermined ratio of blog articles that the machine learning means determines to be spam blogs among the blog articles to be determined that include a predetermined keyword in a predetermined period. When the threshold is less than or equal to the threshold, it is determined that the trend has ended for the predetermined keyword. Therefore, the spam blog determination apparatus can determine the “temporary” trend end state of a predetermined keyword according to a predetermined criterion.

（３）前記調整戻し手段は、前記第１流行終了判定手段又は前記第２流行終了判定手段により流行が終了したと判定された前記所定のキーワードを、前記所定キーワード記憶手段から削除する、（１）又は（２）に記載のスパムブログ判定装置。 (3) the adjusting the return means, the predetermined keyword epidemic is determined to have ended by the first outbreak termination judgment means or the second outbreak end determining unit is deleted from the predetermined keyword storage unit, (1 ) Or the spam blog determination device according to ( 2 ).

本発明のこのような構成によれば、スパムブログ判定装置は、流行が終了したと判定された所定のキーワードを、所定キーワード記憶手段から削除するので、所定キーワード記憶手段のメンテナンスを自動的に行うことができる。 According to such a configuration of the present invention, the spam blog determination device deletes the predetermined keyword determined to be end of the fashion from the predetermined keyword storage unit, so that the predetermined keyword storage unit is automatically maintained. be able to.

（４）コンピュータが、所定のキーワードの登録指定を受け付けたことに応じて、前記所定のキーワードを所定キーワード記憶手段に記憶する所定キーワード記憶ステップと、コンピュータが、判定対象のブログ記事を受け付けたことに応じて、前記所定キーワード記憶手段に記憶した前記所定のキーワードを素性として用いて前記ブログ記事がスパムブログであるか否かを機械学習により判定する機械学習ステップと、コンピュータが、スパムブログであるか否かの判定対象の前記ブログ記事のうち、前記所定キーワード記憶手段に記憶された前記所定のキーワードを含むブログ記事と、スパムブログであるか否かの前記機械学習による判定の結果とを対応付けて出力するスパム判定結果出力ステップと、コンピュータが、前記所定のキーワードの削除指定を受け付けたことに応じて、前記所定キーワード記憶手段に記憶された前記所定のキーワードを削除する調整戻しステップと、コンピュータが、前記スパム判定結果出力ステップで前記機械学習による判定の結果を出力した前記ブログ記事について、前記機械学習による判定の結果の正誤判断を示す情報である正誤判断情報を受け付けたことに応じて、前記正誤判断情報を前記ブログ記事に対応付けて記憶する正誤判断情報記憶ステップと、コンピュータが、所定期間において前記正誤判断情報記憶ステップで記憶した前記正誤判断情報に基づいて前記所定のキーワードごとに前記機械学習による判定の誤り率を算出し、前記誤り率が所定の閾値以上となった前記所定のキーワードについて流行が終わったと判定する第１流行終了判定ステップと、を含むスパムブログ判定方法。
（５）コンピュータが、所定のキーワードの登録指定を受け付けたことに応じて、前記所定のキーワードを所定キーワード記憶手段に記憶する所定キーワード記憶ステップと、コンピュータが、判定対象のブログ記事を受け付けたことに応じて、前記所定キーワード記憶手段に記憶した前記所定のキーワードを素性として用いて前記ブログ記事がスパムブログであるか否かを機械学習により判定する機械学習ステップと、コンピュータが、スパムブログであるか否かの判定対象の前記ブログ記事のうち、前記所定キーワード記憶手段に記憶された前記所定のキーワードを含むブログ記事と、スパムブログであるか否かの前記機械学習による判定の結果とを対応付けて出力するスパム判定結果出力ステップと、コンピュータが、前記所定のキーワードの削除指定を受け付けたことに応じて、前記所定キーワード記憶手段に記憶された前記所定のキーワードを削除する調整戻しステップと、コンピュータが、所定期間において前記所定のキーワードを含む前記判定対象の前記ブログ記事のうち、前記機械学習ステップでスパムブログであると判定した前記ブログ記事の割合が所定の閾値以下となった場合に、前記所定のキーワードについて流行が終了したと判定する第２流行終了判定ステップと、を含むスパムブログ判定方法。 ( 4 ) A predetermined keyword storage step of storing the predetermined keyword in the predetermined keyword storage means in response to the computer receiving registration designation of the predetermined keyword, and the computer has received the blog article to be determined And a machine learning step for determining whether or not the blog article is a spam blog by using the predetermined keyword stored in the predetermined keyword storage means as a feature, and the computer is a spam blog. Corresponding between the blog article including the predetermined keyword stored in the predetermined keyword storage unit and the result of the determination by the machine learning whether the blog article is a spam blog among the blog articles to be determined whether or not and spam determination result output step of attaching output, the computer, the predetermined key In response to reception of the deletion specified word, the a predetermined keyword adjusted return step to remove the predetermined keyword stored in the storage unit, the computer, the determination by the machine learning in the spam analysis result output step results In response to accepting correct / incorrect determination information, which is information indicating the correct / incorrect determination of the determination result by the machine learning, the correct / incorrect determination that stores the correct / incorrect determination information in association with the blog article. An information storage step; and a computer calculates an error rate of determination by the machine learning for each of the predetermined keywords based on the correctness / incorrectness determination information stored in the correctness / incorrectness determination information storage step in a predetermined period, and the error rate is predetermined First to determine that the trend has ended for the predetermined keyword that is equal to or greater than the threshold of Spam blog determination method, including, and end-of-line determination step.
(5) A predetermined keyword storage step of storing the predetermined keyword in the predetermined keyword storage means in response to the computer receiving registration designation of the predetermined keyword, and the computer has received the blog article to be determined And a machine learning step for determining whether or not the blog article is a spam blog by using the predetermined keyword stored in the predetermined keyword storage means as a feature, and the computer is a spam blog. Corresponding between the blog article including the predetermined keyword stored in the predetermined keyword storage unit and the result of the determination by the machine learning whether the blog article is a spam blog among the blog articles to be determined whether or not And a spam determination result output step for outputting, and the computer outputs the predetermined key An adjustment return step of deleting the predetermined keyword stored in the predetermined keyword storage means in response to accepting a word deletion designation; and a computer including the predetermined keyword in the predetermined period The second trend end determination that determines that the trend has ended for the predetermined keyword when the ratio of the blog articles determined to be spam blogs in the machine learning step is equal to or less than a predetermined threshold. And a spam blog determination method including steps.

本発明によれば、管理者による作業を容易にしてスパムブログを判定するスパムブログ判定装置及び方法を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the spam blog determination apparatus and method which determine a spam blog easily by the operation | work by an administrator can be provided.

第１実施形態に係るスパムブログ判定システムの全体構成及びスパムブログ判定装置の機能構成を示す図である。It is a figure which shows the whole structure of the spam blog determination system which concerns on 1st Embodiment, and the function structure of a spam blog determination apparatus. 第１実施形態に係るスパムブログ判定装置の記憶部に記憶された各種データの例を示す図である。It is a figure which shows the example of the various data memorize | stored in the memory | storage part of the spam blog determination apparatus which concerns on 1st Embodiment. 第１実施形態に係るスパムブログ判定装置の機械学習による判定結果を説明するための図である。It is a figure for demonstrating the determination result by the machine learning of the spam blog determination apparatus which concerns on 1st Embodiment. 第１実施形態に係るスパムブログ判定装置の所定キーワード反映処理のフローチャートである。It is a flowchart of the predetermined keyword reflection process of the spam blog determination apparatus according to the first embodiment. 第１実施形態に係るスパムブログ判定装置のスパム判定処理のフローチャートである。It is a flowchart of the spam determination process of the spam blog determination apparatus according to the first embodiment. 第２実施形態に係るスパムブログ判定システムの全体構成及びスパムブログ判定装置の機能構成を示す図である。It is a figure which shows the whole structure of the spam blog determination system which concerns on 2nd Embodiment, and the function structure of a spam blog determination apparatus. 第２実施形態に係るスパムブログ判定装置の所定キーワード削除処理のフローチャートである。It is a flowchart of the predetermined keyword deletion process of the spam blog determination apparatus according to the second embodiment. 第３実施形態に係るスパムブログ判定システムの全体構成及びスパムブログ判定装置の機能構成を示す図である。It is a figure which shows the whole structure of the spam blog determination system which concerns on 3rd Embodiment, and the function structure of a spam blog determination apparatus. 第３実施形態に係るスパムブログ判定装置の所定キーワード削除処理のフローチャートである。It is a flowchart of the predetermined keyword deletion process of the spam blog determination apparatus according to the third embodiment.

以下、本発明を実施するための形態について、図を参照しながら説明する。なお、これは、あくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this.

（第１実施形態）
［スパムブログ判定システム１００の全体構成及びスパムブログ判定装置１の機能構成］
図１は、第１実施形態に係るスパムブログ判定システム１００の全体構成及びスパムブログ判定装置１の機能構成を示す図である。図２は、第１実施形態に係るスパムブログ判定装置１の記憶部２０に記憶された各種データの例を示す図である。図３は、第１実施形態に係るスパムブログ判定装置１の機械学習による判定結果を説明するための図である。 (First embodiment)
[Overall Configuration of Spam Blog Determination System 100 and Functional Configuration of Spam Blog Determination Device 1]
FIG. 1 is a diagram showing an overall configuration of a spam blog determination system 100 and a functional configuration of a spam blog determination apparatus 1 according to the first embodiment. FIG. 2 is a diagram illustrating examples of various data stored in the storage unit 20 of the spam blog determination apparatus 1 according to the first embodiment. FIG. 3 is a diagram for explaining a determination result by machine learning of the spam blog determination apparatus 1 according to the first embodiment.

図１に示すように、スパムブログ判定システム１００は、スパムブログ判定装置１と、ブログサーバ３と、管理端末５と、通信ネットワーク９とにより構成される。 As shown in FIG. 1, the spam blog determination system 100 includes a spam blog determination device 1, a blog server 3, a management terminal 5, and a communication network 9.

スパムブログ判定装置１は、管理端末５から受け付けた所定キーワード（所定のキーワード）を記憶し、記憶した所定キーワードを素性として用いて、ブログサーバ３から受け付けたブログ記事がスパムブログであるか否かを機械学習により判定して、判定結果を管理端末５に対して出力する装置である。また、スパムブログ判定装置１は、管理端末５から受け付けた所定キーワードを、記憶されたキーワードから削除することで、スパムブログであるか否かの機械学習による判定処理で用いる素性から外す装置である。スパムブログ判定装置１は、制御部１０と、記憶部２０とを備える。 The spam blog determination device 1 stores a predetermined keyword (predetermined keyword) received from the management terminal 5 and uses the stored predetermined keyword as a feature to determine whether the blog article received from the blog server 3 is a spam blog. Is determined by machine learning, and the determination result is output to the management terminal 5. Further, the spam blog determination device 1 is a device that removes the predetermined keyword received from the management terminal 5 from the stored keyword, thereby removing it from the feature used in the determination process by machine learning as to whether it is a spam blog. . The spam blog determination device 1 includes a control unit 10 and a storage unit 20.

制御部１０は、所定キーワード受付手段１１と、所定キーワード記憶制御手段１２と、ブログ記事受付手段１３と、機械学習手段１４と、スパム判定結果出力手段１５と、調整戻し手段１７とを備える。 The control unit 10 includes a predetermined keyword receiving unit 11, a predetermined keyword storage control unit 12, a blog article receiving unit 13, a machine learning unit 14, a spam determination result output unit 15, and an adjustment return unit 17.

所定キーワード受付手段１１は、管理端末５から送信された、所定キーワードの登録要求を受け付ける制御部である。所定キーワードとは、流行り廃りのあるキーワードであって、スパムブログの対象になりやすいキーワードをいう。流行り廃りのあるキーワードは、「恒常的」とは対照であって、「一時的」の継続期間が予測できない。スパムブログとは、例えば、一般的に話題になっているワードを使用した意味のないブログ記事であって、あるＷｅｂページに対してリンクを張ることで、そのＷｅｂページのランキングを上位にするために用いられるものをいう。そこで、スパムブログに多く用いられている、一般的に話題になっているワードであって流行り廃りのあるワードを、所定キーワードとして管理端末５が送信することで、所定キーワード受付手段１１は、所定キーワードを受け付ける。 The predetermined keyword receiving means 11 is a control unit that receives a predetermined keyword registration request transmitted from the management terminal 5. Predetermined keywords are keywords that are out of fashion and tend to be subject to spam blogs. Keywords that are out of fashion are in contrast to “permanent” and cannot be predicted for a “temporary” duration. A spam blog is, for example, a meaningless blog article that uses a word that is generally used as a topic, and by placing a link to a certain web page, the ranking of that web page is raised. The one used in Therefore, the predetermined keyword accepting unit 11 transmits the word that is commonly used in spam blogs and is a popular topic and is out of fashion as a predetermined keyword. Accept.

所定キーワード記憶制御手段１２は、管理端末５から受け付けた所定キーワードを、所定キーワードＤＢ２１（ＤＢ：データベース）（所定キーワード記憶手段）に記憶させる制御部である。 The predetermined keyword storage control unit 12 is a control unit that stores the predetermined keyword received from the management terminal 5 in a predetermined keyword DB 21 (DB: database) (predetermined keyword storage unit).

図２（ａ）に一例を示す所定キーワードＤＢ２１は、管理端末５から受け付けた所定キーワードを記憶するＤＢである。所定キーワードＤＢ２１は、通し番号２１ａと、所定キーワード２１ｂと、登録日２１ｃと、タイプ２１ｄとの各項目からなる。 The predetermined keyword DB 21 shown as an example in FIG. 2A is a DB that stores the predetermined keyword received from the management terminal 5. The predetermined keyword DB 21 includes items of a serial number 21a, a predetermined keyword 21b, a registration date 21c, and a type 21d.

通し番号２１ａは、管理端末５から受け付けた順番に、制御部１０によって振られた１からの連番を格納する。所定キーワード２１ｂは、管理端末５から受け付けた所定キーワードを格納する。登録日２１ｃは、管理端末５から所定キーワードを受け付けた日付を格納する。タイプ２１ｄは、所定キーワードのカテゴリを格納する。タイプ２１ｄに格納するタイプは、管理端末５から所定キーワードと共に受け付けてもよい。 The serial number 21 a stores the serial numbers from 1 assigned by the control unit 10 in the order received from the management terminal 5. The predetermined keyword 21 b stores a predetermined keyword received from the management terminal 5. The registration date 21 c stores the date when the predetermined keyword is received from the management terminal 5. Type 21d stores a category of a predetermined keyword. The type stored in the type 21d may be received from the management terminal 5 together with a predetermined keyword.

図１に戻り、ブログ記事受付手段１３は、ブログサーバ３からブログ記事を受け付ける制御部である。ここで、ブログ記事とは、ブログを構成する１つ１つの記事をいう。ブログ記事受付手段１３は、ブログサーバ３においてブログ記事が更新された都度、ブログサーバ３から送信されたブログ記事を受け付けてもよいし、スパムブログ判定装置１からブログサーバ３に対して、例えば、毎日決まった時刻にブログ記事の送信を依頼することで、新たに更新されたブログ記事をブログサーバ３から受け付けてもよい。 Returning to FIG. 1, the blog article receiving means 13 is a control unit that receives a blog article from the blog server 3. Here, the blog article refers to each article constituting the blog. The blog article accepting unit 13 may accept a blog article transmitted from the blog server 3 every time the blog article is updated in the blog server 3. A newly updated blog article may be received from the blog server 3 by requesting transmission of a blog article at a fixed time every day.

機械学習手段１４は、例えば、ＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）エンジン２２を用いた学習モデルにより、ブログ記事受付手段１３が受け付けたブログ記事がスパムブログであるか否かを判定する制御部である。その仕組みとして、機械学習手段１４は、予めスパムブログのブログ記事と、スパムブログではないブログ記事（通常のブログ記事）とを学習し、統計的処理を実施し、スパムブログのブログ記事と、通常のブログ記事との区別の基準を示す基準データを生成しておく。そして、機械学習手段１４のＳＶＭエンジン２２を用いた判定は、２つのクラス（スパムブログのブログ記事の集合及び通常のブログ記事の集合）のいずれかに属する訓練事例から、未知の事例であるブログ記事がいずれかのクラスに属するかを判定するものである。ここで、機械学習手段１４は、所定キーワードＤＢ２１に記憶された所定キーワードを素性として使用することで、ブログ記事に所定キーワードを含む場合には、そのブログ記事がスパムブログであるか否かを判定する。 The machine learning unit 14 is a control unit that determines whether or not the blog article received by the blog article receiving unit 13 is a spam blog by using a learning model using, for example, an SVM (Support Vector Machine) engine 22. As the mechanism, the machine learning means 14 learns in advance a blog article of a spam blog and a blog article (a normal blog article) that is not a spam blog, performs a statistical process, The standard data which shows the standard of distinction from the blog article is generated beforehand. The determination using the SVM engine 22 of the machine learning means 14 is a blog that is an unknown case from a training case that belongs to one of two classes (a set of spam blog articles and a set of normal blog articles). This is to determine whether an article belongs to any class. Here, the machine learning unit 14 uses the predetermined keyword stored in the predetermined keyword DB 21 as a feature, and determines whether or not the blog article is a spam blog when the blog article includes the predetermined keyword. To do.

ここで、ＳＶＭエンジン２２を用いた学習モデルでの学習結果３０について、図３を用いて説明する。ＳＶＭエンジン２２を用いた機械学習手段１４は、スパムブログか否かのラベルが未知の事例であるブログ記事に対して、ラベルを推定する分類器である。スパムブログ等を事例にした上で、スパムブログのブログ記事から生成された事例と、通常のブログ記事から生成された事例との２つの識別面３１，３２同士の距離（マージン）が特徴空間上で最大になるような識別面３１，３２を算出する。このように、機械学習手段１４は、スパムブログ等を事例という形に変換した上で、学習するようになっている。 Here, the learning result 30 in the learning model using the SVM engine 22 will be described with reference to FIG. The machine learning means 14 using the SVM engine 22 is a classifier that estimates a label for a blog article in which a label indicating whether or not a spam blog is unknown. Taking a spam blog as an example, the distance (margin) between the two discriminating surfaces 31 and 32 of a case generated from a blog article of a spam blog and a case generated from a normal blog article is on the feature space. The discriminating surfaces 31 and 32 are calculated so as to be maximum. As described above, the machine learning means 14 learns after converting a spam blog or the like into a case.

機械学習手段１４は、識別面３１，３２に最も近接するスパムブログから生成された事例と、通常のブログ記事から生成された事例とを各々サポートベクタ３３，３４として、ラベルが未知の事例の分類に利用する。機械学習手段１４は、スパムブログから生成された事例と、通常のブログ記事から生成された事例とを事例にした上で、統計処理をすることによって、スパムブログから生成された事例群と、通常のブログ記事から生成された事例群とを区別するための識別データである識別面を生成する。 The machine learning means 14 categorizes cases where the labels are unknown, using the cases generated from the spam blog closest to the identification surfaces 31 and 32 and the cases generated from the normal blog articles as support vectors 33 and 34, respectively. To use. The machine learning means 14 performs a statistical process on a case generated from a spam blog and a case generated from a normal blog article as a case. An identification surface, which is identification data for distinguishing the case group generated from the blog article, is generated.

そして、機械学習手段１４は、スパムブログか否かのラベルが未知の事例であるブログ記事に対して、学習結果３０を用いて、どの位置に該当するものであるかを、所定キーワードを素性として使用することで分類する。素性とは、入力されたデータを特徴付けるものである。このように、スパムブログ判定装置１は、既知のモデルであるＳＶＭエンジン２２を使用して機械学習による判定をすることができる。そして、スパムブログ判定装置１は、機械学習手段１４を用いることで、機械学習自体をやり直すことなく、素性をメンテナンスするだけで対応できる。よって、スパムブログ判定装置１は、スパムブログであるか否かの判定に用いることができる。 Then, the machine learning means 14 uses a learning result 30 for a blog article whose label indicating whether or not it is a spam blog, and uses a learning result 30 as a feature of a predetermined keyword as a feature. Classify by use. A feature characterizes input data. As described above, the spam blog determination apparatus 1 can make a determination by machine learning using the SVM engine 22 which is a known model. And the spam blog determination apparatus 1 can respond by using the machine learning means 14 and only maintaining the features without redoing the machine learning itself. Therefore, the spam blog determination apparatus 1 can be used to determine whether or not the spam blog.

図１に戻り、スパム判定結果出力手段１５は、機械学習手段１４による判定結果として、所定キーワードを含むブログ記事とその判定結果とを、管理端末５に対して出力する制御部である。 Returning to FIG. 1, the spam determination result output unit 15 is a control unit that outputs a blog article including a predetermined keyword and a determination result thereof to the management terminal 5 as a determination result by the machine learning unit 14.

調整戻し手段１７は、管理端末５から送信された所定キーワードの削除要求を受け付けることで、所定キーワードＤＢ２１から受け付けた所定キーワードを削除する制御部である。 The adjustment return unit 17 is a control unit that deletes the predetermined keyword received from the predetermined keyword DB 21 by receiving a deletion request for the predetermined keyword transmitted from the management terminal 5.

記憶部２０は、所定キーワードＤＢ２１と、ＳＶＭエンジン２２と、判定結果テーブル２３とを備える。 The storage unit 20 includes a predetermined keyword DB 21, an SVM engine 22, and a determination result table 23.

所定キーワードＤＢ２１及びＳＶＭエンジン２２は、上述のとおりである。判定結果テーブル２３は、機械学習手段１４による判定結果であって、スパム判定結果出力手段１５が管理端末５に対して送信する、所定キーワードを含むブログ記事とその判定結果とを記憶したデータテーブルである。 The predetermined keyword DB 21 and the SVM engine 22 are as described above. The determination result table 23 is a data table storing the blog article including a predetermined keyword and the determination result, which is transmitted by the spam determination result output unit 15 to the management terminal 5 and is a determination result by the machine learning unit 14. is there.

図２（ｂ）に一例を示す判定結果テーブル２３は、所定キーワード２３ａと、ブログ記事２３ｂと、スパムブログ判定２３ｃとの各項目を有する。所定キーワード２３ａは、ブログ記事に含まれる所定キーワードを格納する。ブログ記事２３ｂは、ブログ記事を特定するブログＩＤ（ＩＤ：ｉｄｅｎｔｉｆｉｅｒ）を格納する。ブログ記事２３ｂは、ブログ記事そのものを格納してもよい。スパムブログ判定２３ｃは、機械学習手段１４によるスパムブログであるか否かの判定結果のコードを格納する。ここで、スパムブログ判定２３ｃが「１」の場合は、機械学習手段１４によってスパムブログであると判定された場合であり、スパムブログ判定２３ｃが「０」の場合は、機械学習手段１４によってスパムブログではないと判定された場合である。 The determination result table 23 shown as an example in FIG. 2B includes items of a predetermined keyword 23a, a blog article 23b, and a spam blog determination 23c. The predetermined keyword 23a stores a predetermined keyword included in the blog article. The blog article 23b stores a blog ID (ID: identifier) that identifies the blog article. The blog article 23b may store the blog article itself. The spam blog determination 23c stores a code of a determination result of whether or not the machine learning means 14 is a spam blog. Here, when the spam blog determination 23c is “1”, it is determined that the machine learning unit 14 determines that the spam blog is a spam blog, and when the spam blog determination 23c is “0”, the machine learning unit 14 determines spam. This is the case when it is determined that it is not a blog.

第１実施形態のスパムブログ判定装置１のハードウェアは、一般的なコンピュータによって構成してもよい。一般的なコンピュータは、例えば、制御部１０として、中央処理装置（ＣＰＵ）を備える他、記憶部２０として、メモリ（ＲＡＭ、ＲＯＭ）、ハードディスク（ＨＤＤ）及び光ディスク（ＣＤ、ＤＶＤ等）を、ネットワーク通信装置として、各種有線及び無線ＬＡＮ装置を、表示装置として、例えば、液晶ディスプレイ、プラズマディスプレイ等の各種ディスプレイを、入力装置として、例えば、キーボード及びポインティング・デバイス（マウス、トラッキングボール等）を適宜備え、これらはバスラインにより接続されている。このような一般的なコンピュータにおいて、ＣＰＵは、スパムブログ判定装置１を統括的に制御し、各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 The hardware of the spam blog determination apparatus 1 of the first embodiment may be configured by a general computer. For example, a general computer includes a central processing unit (CPU) as the control unit 10 and a memory (RAM, ROM), a hard disk (HDD), and an optical disk (CD, DVD, etc.) as a storage unit 20 in a network. As a communication device, various wired and wireless LAN devices, as a display device, for example, various displays such as a liquid crystal display and a plasma display, and as an input device, for example, a keyboard and a pointing device (mouse, tracking ball, etc.) are appropriately provided. These are connected by a bus line. In such a general computer, the CPU controls the spam blog determination device 1 in an integrated manner, reads and executes various programs as appropriate, and cooperates with the above-described hardware, thereby providing various functions according to the present invention. Is realized.

ブログサーバ３は、ブログ記事を記憶するサーバであり、ブログ記事を記憶する記憶部と、ブログサーバ３の全体を制御する制御部とを備える。ブログサーバ３のハードウェアは、一般的なコンピュータによって構成してよい。 The blog server 3 is a server that stores blog articles, and includes a storage unit that stores blog articles and a control unit that controls the entire blog server 3. The hardware of the blog server 3 may be configured by a general computer.

管理端末５は、例えば、パーソナルコンピュータ（ＰＣ）や、携帯電話機等の携帯端末である。管理端末５は、通信機能を有し、スパムブログ判定装置１に対してデータの送受信が可能な端末であれば、どのような装置でもよい。 The management terminal 5 is, for example, a personal computer (PC) or a mobile terminal such as a mobile phone. The management terminal 5 may be any device as long as it has a communication function and can transmit / receive data to / from the spam blog determination device 1.

なお、第１実施形態では、スパムブログ判定装置１と、ブログサーバ３とを別々の装置として説明しているが、スパムブログ判定装置１がブログサーバ３の機能をも有して、１台のコンピュータによって実現してもよい。 In the first embodiment, the spam blog determination device 1 and the blog server 3 are described as separate devices. However, the spam blog determination device 1 also has the function of the blog server 3, and one unit It may be realized by a computer.

通信ネットワーク９は、スパムブログ判定装置１と、ブログサーバ３と、管理端末５との間で通信を行うための、例えば、インターネット等の通信回線である。通信ネットワーク９は、有線であってもよいし、その一部又は全部が無線であってもよい。 The communication network 9 is a communication line such as the Internet for performing communication among the spam blog determination apparatus 1, the blog server 3, and the management terminal 5. The communication network 9 may be wired or part or all of it may be wireless.

［スパムブログ判定装置１の処理］
次に、スパムブログ判定装置１での処理について説明する。最初に、所定キーワードの反映について説明する。図４は、第１実施形態に係るスパムブログ判定装置１の所定キーワード反映処理のフローチャートである。この処理は、管理端末５から所定キーワードの指定を受け付ける都度実行される。 [Processing of Spam Blog Determination Device 1]
Next, processing in the spam blog determination apparatus 1 will be described. First, reflection of a predetermined keyword will be described. FIG. 4 is a flowchart of the predetermined keyword reflection process of the spam blog determination apparatus 1 according to the first embodiment. This process is executed each time a specified keyword is specified from the management terminal 5.

Ｓ１：制御部１０（所定キーワード受付手段１１）は、管理端末５から所定キーワードの登録の指定を受け付けたか否かを判断する。所定キーワードの登録の指定を受け付けた場合（Ｓ１：ＹＥＳ）には、制御部１０は、処理をＳ２に移す。他方、所定キーワードの登録の指定を受け付けていない場合（Ｓ１：ＮＯ）には、制御部１０は、処理をＳ３に移す。 S1: The control unit 10 (predetermined keyword accepting means 11) determines whether or not designation of registration of a predetermined keyword has been accepted from the management terminal 5. When the designation of registration of the predetermined keyword is received (S1: YES), the control unit 10 moves the process to S2. On the other hand, when the designation of the registration of the predetermined keyword is not accepted (S1: NO), the control unit 10 moves the process to S3.

Ｓ２：制御部１０（所定キーワード記憶制御手段１２）は、Ｓ１で受け付けた所定キーワードを、所定キーワードＤＢ２１に記憶させる。 S2: The control unit 10 (predetermined keyword storage control means 12) stores the predetermined keyword received in S1 in the predetermined keyword DB 21.

Ｓ３：制御部１０（調整戻し手段１７）は、管理端末５から所定キーワードの削除の指定を受け付けたか否かを判断する。所定キーワードの削除の指定を受け付けた場合（Ｓ３：ＹＥＳ）には、制御部１０は、処理をＳ４に移す。他方、所定キーワードの削除の指定を受け付けていない場合（Ｓ３：ＮＯ）には、制御部１０は、本処理を終了する。 S3: The control unit 10 (adjustment return means 17) determines whether or not the designation of deletion of the predetermined keyword has been received from the management terminal 5. When designation of deletion of the predetermined keyword is received (S3: YES), the control unit 10 moves the process to S4. On the other hand, when the designation of deletion of the predetermined keyword is not accepted (S3: NO), the control unit 10 ends this process.

Ｓ４：制御部１０（調整戻し手段１７）は、Ｓ３で受け付けた所定キーワードを、所定キーワードＤＢ２１から削除する。その後、制御部１０は、本処理を終了する。 S4: The control unit 10 (adjustment return means 17) deletes the predetermined keyword received in S3 from the predetermined keyword DB 21. Then, the control part 10 complete | finishes this process.

このように、スパムブログ判定装置１は、管理端末５から流行り廃りのある所定キーワードの登録指定を受け付けたことで、所定キーワードを含むスパムブログの機械学習による判定処理で用いる素性として、所定キーワードＤＢ２１に所定キーワードを登録できる。また、スパムブログ判定装置１は、管理端末５から所定キーワードの削除指定を受け付けたことで、所定キーワードを機械学習による判定処理で用いる素性から外すことができる。なお、管理端末５のユーザである管理者は、後述のスパム判定結果出力手段１５の出力する判定結果を見て、スパムブログではない、と判定されたブログに含まれる所定キーワードを素性から外すタイミングを検討することができる。 As described above, the spam blog determination device 1 receives the registration specification of the predetermined keyword that has become obsolete from the management terminal 5, and as a feature used in the determination process by machine learning of the spam blog including the predetermined keyword, the spam blog determination device 1 stores the specified keyword DB 21. Predetermined keywords can be registered. Moreover, the spam blog determination apparatus 1 can remove the predetermined keyword from the feature used in the determination process by machine learning by accepting the deletion designation of the predetermined keyword from the management terminal 5. Note that the administrator who is the user of the management terminal 5 sees the determination result output by the spam determination result output means 15 described later, and removes the predetermined keyword included in the blog determined as not a spam blog from the feature. Can be considered.

次に、ブログ記事のスパム判定について説明する。この処理は、ブログサーバ３からブログ記事を受け付ける都度実行される。図５は、第１実施形態に係るスパムブログ判定装置１のスパム判定処理のフローチャートである。 Next, spam determination for blog articles will be described. This process is executed every time a blog article is received from the blog server 3. FIG. 5 is a flowchart of the spam determination process of the spam blog determination apparatus 1 according to the first embodiment.

Ｓ１１：制御部１０（ブログ記事受付手段１３）は、ブログサーバ３からブログ記事を受け付ける。 S <b> 11: The control unit 10 (blog article accepting unit 13) accepts a blog article from the blog server 3.

Ｓ１２：制御部１０（機械学習手段１４）は、Ｓ１１において受け付けたブログ記事に対して機械学習処理を行う。機械学習処理とは、例えば、ＳＶＭエンジン２２を用いた学習モデルにより、所定キーワードＤＢ２１に記憶された所定キーワードを素性として使用して、Ｓ１１で受け付けたブログ記事がスパムブログであるか否かを判定する処理をいう。 S12: The control unit 10 (machine learning means 14) performs machine learning processing on the blog article received in S11. The machine learning process is, for example, using a learning model using the SVM engine 22 to determine whether or not the blog article accepted in S11 is a spam blog using the predetermined keyword stored in the predetermined keyword DB 21 as a feature. The process to do.

Ｓ１３：制御部１０（スパム判定結果出力手段１５）は、Ｓ１２において実行した機械学習処理の結果（スパム判定結果）を管理端末５に対して出力する。また、制御部１０は、スパム判定結果を判定結果テーブル２３に記憶する。その後、制御部１０は、本処理を終了する。 S13: The control unit 10 (spam determination result output means 15) outputs the result of the machine learning process (spam determination result) executed in S12 to the management terminal 5. Further, the control unit 10 stores the spam determination result in the determination result table 23. Then, the control part 10 complete | finishes this process.

なお、制御部１０は、図４で説明した所定キーワード反映処理と、図５で説明したスパム判定処理とを並行して行ってもよい。 Note that the control unit 10 may perform the predetermined keyword reflection process described in FIG. 4 and the spam determination process described in FIG. 5 in parallel.

このように、スパムブログ判定装置１は、ブログサーバ３からブログ記事を受け付けたことで、所定キーワードを含むスパムブログの判定を行い、結果を出力する。よって、スパムブログ判定装置１は、管理者に予め所定キーワードの指定を行わせるだけで、所定キーワードを含むスパムブログを判定することができる。また、スパムブログ判定装置１は、管理端末５からの指示によって所定キーワードを登録及び削除した所定キーワードＤＢ２１を用いることで、流行り廃りのある所定キーワードを含むスパムブログの判定を適切に行うことができる。 As described above, the spam blog determination apparatus 1 receives a blog article from the blog server 3, determines a spam blog including a predetermined keyword, and outputs the result. Therefore, the spam blog determination apparatus 1 can determine a spam blog including a predetermined keyword only by allowing the administrator to specify the predetermined keyword in advance. Moreover, the spam blog determination apparatus 1 can appropriately determine a spam blog including a predetermined keyword that has become obsolete by using the predetermined keyword DB 21 in which a predetermined keyword is registered and deleted according to an instruction from the management terminal 5.

（第２実施形態）
第１実施形態では、管理端末から受け付けた所定キーワードを、記憶されたキーワードから削除することで、スパムブログであるか否かの機械学習による判定処理で用いる素性から外すものであった。第２実施形態では、所定の条件を満たす所定キーワードを自動的に削除することで、スパムブログであるか否かの機械学習による判定処理で用いる素性から外すものを説明する。なお、以降の説明において、上述した第１実施形態と同様の機能を果たす部分には、同一の符号又は末尾に同一の符号を付して、重複する説明を適宜省略する。 (Second Embodiment)
In the first embodiment, the predetermined keyword received from the management terminal is deleted from the stored keyword, thereby removing it from the feature used in the determination process by machine learning as to whether it is a spam blog. In the second embodiment, a description will be given of a case where a predetermined keyword that satisfies a predetermined condition is automatically deleted, so that it is excluded from the feature used in the determination process by machine learning as to whether or not it is a spam blog. In the following description, parts that perform the same functions as those in the first embodiment described above are given the same reference numerals or the same reference numerals at the end, and redundant descriptions are omitted as appropriate.

［スパムブログ判定システム２００の全体構成及びスパムブログ判定装置２０１の機能構成］
図６は、第２実施形態に係るスパムブログ判定システム２００の全体構成及びスパムブログ判定装置２０１の機能構成を示す図である。 [Overall Configuration of Spam Blog Determination System 200 and Functional Configuration of Spam Blog Determination Device 201]
FIG. 6 is a diagram showing the overall configuration of the spam blog determination system 200 and the functional configuration of the spam blog determination apparatus 201 according to the second embodiment.

スパムブログ判定システム２００は、スパムブログ判定装置２０１と、ブログサーバ３と、管理端末５と、通信ネットワーク９とにより構成される。 The spam blog determination system 200 includes a spam blog determination apparatus 201, a blog server 3, a management terminal 5, and a communication network 9.

スパムブログ判定装置２０１は、制御部２１０と、記憶部２０とを備える。制御部２１０は、所定キーワード受付手段１１と、所定キーワード記憶制御手段１２と、ブログ記事受付手段１３と、機械学習手段１４と、スパム判定結果出力手段１５との他に、第２流行終了判定手段２１６と、調整戻し手段２１７とを備える。 The spam blog determination apparatus 201 includes a control unit 210 and a storage unit 20. In addition to the predetermined keyword receiving unit 11, the predetermined keyword storage control unit 12, the blog article receiving unit 13, the machine learning unit 14, and the spam determination result output unit 15, the control unit 210 includes a second epidemic end determination unit. 216 and adjustment return means 217.

第２流行終了判定手段２１６は、所定期間（例えば、１週間）において、所定キーワードを含む判定対象のブログ記事のうち、機械学習手段１４によりスパムブログであると判定されたブログ記事の割合が所定の閾値（例えば、２０％）以下になった場合に、所定キーワードの流行が終了したと判定する制御部である。 The second fashion end determination means 216 has a predetermined ratio (for example, one week) of the blog articles determined to be spam blogs by the machine learning means 14 among the blog articles to be determined including the predetermined keyword. This is a control unit that determines that the trend of a predetermined keyword has ended when the threshold value (for example, 20%) or less is reached.

調整戻し手段２１７は、第２流行終了判定手段２１６により流行が終了したと判定された所定キーワードを、所定キーワードＤＢ２１から削除する制御部である。 The adjustment return unit 217 is a control unit that deletes, from the predetermined keyword DB 21, the predetermined keyword determined to have ended by the second trend end determination unit 216.

［スパムブログ判定装置２０１の処理］
次に、スパムブログ判定装置２０１での処理について説明する。図７は、第２実施形態に係るスパムブログ判定装置２０１の所定キーワード削除処理のフローチャートである。なお、所定キーワードの登録については、第１実施形態のＳ１〜Ｓ２（図４）の処理と同様であり、スパム判定については、第１実施形態のＳ１１〜Ｓ１３（図５）の処理と同様である。 [Processing of Spam Blog Determination Device 201]
Next, processing in the spam blog determination apparatus 201 will be described. FIG. 7 is a flowchart of the predetermined keyword deletion process of the spam blog determination apparatus 201 according to the second embodiment. The registration of the predetermined keyword is the same as the processing of S1 to S2 (FIG. 4) of the first embodiment, and the spam determination is the same as the processing of S11 to S13 (FIG. 5) of the first embodiment. is there.

Ｓ２１：制御部２１０（第２流行終了判定手段２１６）は、所定期間において、所定キーワードを含む判定対象のブログ記事のうち、機械学習手段１４によりスパムブログであると判定されたブログ記事の割合が所定の閾値以下になっているか否かを判断する。所定の閾値以下になっている場合（Ｓ２１：ＹＥＳ）には、制御部２１０は、処理をＳ２２に移す。他方、所定の閾値以下になっていない場合（Ｓ２１：ＮＯ）には、制御部２１０は、本処理を終了する。 S21: The control unit 210 (second fashion end determination unit 216) determines the ratio of the blog articles determined to be spam blogs by the machine learning unit 14 among the determination target blog articles including the predetermined keyword in a predetermined period. It is determined whether or not it is below a predetermined threshold value. When it is below the predetermined threshold (S21: YES), the control unit 210 moves the process to S22. On the other hand, when it is not below the predetermined threshold value (S21: NO), the control unit 210 ends this process.

Ｓ２２：制御部２１０（調整戻し手段２１７）は、Ｓ２１で所定の閾値以下になっていると判定された所定キーワードを、所定キーワードＤＢ２１から削除する。その後、制御部２１０は、本処理を終了する。 S22: The control unit 210 (adjustment return means 217) deletes the predetermined keyword determined to be equal to or less than the predetermined threshold in S21 from the predetermined keyword DB 21. Thereafter, the control unit 210 ends this process.

このように、スパムブログ判定装置２０１は、所定期間において所定キーワードを含む判定対象のブログ記事のうち、機械学習手段１４がスパムブログであると判定したブログ記事の割合が所定の閾値以下となった場合に、所定キーワードについて流行が終了したと判定する。よって、スパムブログ判定装置２０１は、所定の基準によって所定キーワードの「一時的」な流行の終了状態を判断できる。 As described above, the spam blog determination apparatus 201 has a ratio of blog articles determined by the machine learning means 14 as spam blogs to be equal to or less than a predetermined threshold among blog articles to be determined that include a predetermined keyword in a predetermined period. In this case, it is determined that the trend has ended for the predetermined keyword. Therefore, the spam blog determination apparatus 201 can determine the “temporary” trend end state of a predetermined keyword based on a predetermined criterion.

そして、スパムブログ判定装置２０１は、流行が終了したと判定された所定キーワードを、所定キーワードＤＢ２１から削除するので、所定キーワードＤＢ２１のメンテナンスを自動的に行うことができる。 And since the spam blog determination apparatus 201 deletes the predetermined keyword determined that the fashion has ended from the predetermined keyword DB 21, the maintenance of the predetermined keyword DB 21 can be automatically performed.

（第３実施形態）
第３実施形態では、スパムブログ判定装置が管理端末から機械学習による判定の結果の正誤判断を受け付けて、誤り率に応じて所定の条件を満たす所定キーワードを自動的に削除して、スパムブログであるか否かの機械学習による判定処理で用いる素性から外すものを説明する。 (Third embodiment)
In the third embodiment, the spam blog determination apparatus accepts a correct / incorrect determination of the result of determination by machine learning from the management terminal, automatically deletes a predetermined keyword satisfying a predetermined condition according to the error rate, What is excluded from the features used in the determination process based on whether or not there is machine learning will be described.

［スパムブログ判定システム３００の全体構成及びスパムブログ判定装置３０１の機能構成］
図８は、第３実施形態に係るスパムブログ判定システム３００の全体構成及びスパムブログ判定装置３０１の機能構成を示す図である。 [Overall Configuration of Spam Blog Determination System 300 and Functional Configuration of Spam Blog Determination Device 301]
FIG. 8 is a diagram illustrating the overall configuration of the spam blog determination system 300 and the functional configuration of the spam blog determination apparatus 301 according to the third embodiment.

スパムブログ判定システム３００は、スパムブログ判定装置３０１と、ブログサーバ３と、管理端末５と、通信ネットワーク９とにより構成される。 The spam blog determination system 300 includes a spam blog determination device 301, a blog server 3, a management terminal 5, and a communication network 9.

スパムブログ判定装置３０１は、制御部３１０と、記憶部３２０とを備える。制御部３１０は、所定キーワード受付手段１１と、所定キーワード記憶制御手段１２と、ブログ記事受付手段１３と、機械学習手段１４と、スパム判定結果出力手段１５と、調整戻し手段２１７との他に、正誤判断情報記憶制御手段３１８と、第１流行終了判定手段３１９とを備える。 The spam blog determination apparatus 301 includes a control unit 310 and a storage unit 320. The control unit 310 includes the predetermined keyword receiving unit 11, the predetermined keyword storage control unit 12, the blog article receiving unit 13, the machine learning unit 14, the spam determination result output unit 15, and the adjustment return unit 217. Correct / error determination information storage control means 318 and first fashion end determination means 319 are provided.

正誤判断情報記憶制御手段３１８は、スパムブログである、又はスパムブログではないと判定された機械学習の結果に対しての正誤判断を示す正誤判断情報を管理端末５から受け付けて、正誤判断情報ＤＢ３２４に記憶させる制御部である。受け付ける正誤判断情報は、所定キーワードを含むブログ記事に対応付けて、スパムブログであるとの判定に対して正しいか否か、及びスパムブログではないとの判定に対して正しいか否か、の計４とおりの判断がある。また、管理端末５から受け付ける正誤判断情報は、管理者が入力を行うものである。そして、管理者は、スパム判定結果出力手段１５が出力した全ての判定結果に対して正誤判定情報の入力を行う必要はなく、数件に１件のサンプリングであってもよいし、所定キーワードごとに絞り込んで行ってもよい。 The right / wrong judgment information storage control means 318 receives right / wrong judgment information indicating the right / wrong judgment with respect to the result of machine learning determined to be a spam blog or not a spam blog from the management terminal 5, and the right / wrong judgment information DB 324. It is a control part to memorize. The correctness / incorrectness determination information to be accepted is calculated by associating with a blog article including a predetermined keyword whether it is correct for the determination that it is a spam blog and whether it is correct for the determination that it is not a spam blog. There are four judgments. In addition, the correctness determination information received from the management terminal 5 is input by the administrator. And it is not necessary for the administrator to input correct / incorrect determination information for all the determination results output by the spam determination result output means 15, and it may be one sampling for several cases, or for each predetermined keyword. You may narrow down to.

第１流行終了判定手段３１９は、所定期間（例えば、１週間）において、正誤判断情報ＤＢ３２４に記憶された正誤判断情報に基づいて、所定キーワードごとに誤り率を算出し、その誤り率が所定の閾値（例えば、７０％）以上になった場合に、所定キーワードの流行が終了したと判定する制御部である。 The first fashion end determination unit 319 calculates an error rate for each predetermined keyword based on the correctness / incorrectness determination information stored in the correctness / incorrectness determination information DB 324 in a predetermined period (for example, one week). The control unit determines that the trend of a predetermined keyword has ended when a threshold (for example, 70%) or more is reached.

正誤判断情報ＤＢ３２４は、正誤判断情報記憶制御手段３１８が受け付けた機械学習による判定の結果に対する正誤を、所定キーワードを含むブログ記事に対応付けて記憶する。 The correctness / incorrectness determination information DB 324 stores the correctness / incorrectness of the result of determination by machine learning received by the correctness / incorrectness determination information storage control unit 318 in association with a blog article including a predetermined keyword.

［スパムブログ判定装置３０１の処理］
次に、スパムブログ判定装置３０１での処理について説明する。図９は、第３実施形態に係るスパムブログ判定装置３０１の所定キーワード削除処理のフローチャートである。なお、所定キーワードの登録については、第１実施形態のＳ１〜Ｓ２（図４）の処理と同様であり、スパム判定については、第１実施形態のＳ１１〜Ｓ１３（図５）の処理と同様である。 [Processing of Spam Blog Determination Device 301]
Next, processing in the spam blog determination apparatus 301 will be described. FIG. 9 is a flowchart of predetermined keyword deletion processing of the spam blog determination apparatus 301 according to the third embodiment. The registration of the predetermined keyword is the same as the processing of S1 to S2 (FIG. 4) of the first embodiment, and the spam determination is the same as the processing of S11 to S13 (FIG. 5) of the first embodiment. is there.

Ｓ３１：制御部３１０（正誤判断情報記憶制御手段３１８）は、管理端末５から正誤判断情報を受け付ける。
Ｓ３２：制御部３１０（正誤判断情報記憶制御手段３１８）は、受け付けた正誤判断情報を正誤判断情報ＤＢ３２４に記憶させる。
Ｓ３３：制御部３１０（第１流行終了判定手段３１９）は、所定期間において、正誤判断情報ＤＢ３２４に記憶された正誤判断情報に基づいて、所定キーワードごとに機械学習による判定の誤り率を算出する。
Ｓ３４：制御部３１０（第１流行終了判定手段３１９）は、誤り率が所定の閾値以上になっているか否かを判断する。所定の閾値以上になっている場合（Ｓ３４：ＹＥＳ）には、制御部３１０は、処理をＳ３５に移す。他方、所定の閾値以上になっていない場合（Ｓ３４：ＮＯ）には、制御部３１０は、本処理を終了する。 S31: The control unit 310 (correction determination information storage control unit 318) accepts correctness determination information from the management terminal 5.
S32: The control unit 310 (correction determination information storage control unit 318) stores the received correctness determination information in the correctness determination information DB 324.
S33: The control unit 310 (first fashion end determination unit 319) calculates an error rate of determination by machine learning for each predetermined keyword based on the correctness determination information stored in the correctness determination information DB 324 in a predetermined period.
S34: The control unit 310 (first fashion end determination means 319) determines whether or not the error rate is equal to or higher than a predetermined threshold value. If it is equal to or greater than the predetermined threshold (S34: YES), the controller 310 moves the process to S35. On the other hand, when it is not equal to or greater than the predetermined threshold (S34: NO), the control unit 310 ends this process.

Ｓ３５：制御部３１０（調整戻し手段２１７）は、Ｓ３４で所定の閾値以上になっていると判定された所定キーワードを、所定キーワードＤＢ２１から削除する。その後、制御部３１０は、本処理を終了する。 S35: The control unit 310 (adjustment returning means 217) deletes the predetermined keyword determined to be equal to or greater than the predetermined threshold in S34 from the predetermined keyword DB 21. Then, the control part 310 complete | finishes this process.

このように、スパムブログ判定装置３０１は、所定キーワードを含む判定対象のブログ記事のうち、判定の結果を出力したブログ記事についての機械学習の判定の結果に対する正誤判断情報を受け付けてブログ記事に対応付けて正誤判断情報ＤＢ３２４に記憶し、所定期間において記憶された正誤判断情報に基づいて所定キーワードごとに誤り率を算出し、誤り率が所定の閾値以上となった場合に、所定キーワードについて流行が終了したと判定する。よって、管理者が機械学習による判定の正誤を入力するだけで、スパムブログ判定装置３０１は、所定の基準によって所定キーワードの「一時的」な流行の終了状態を自動的に判断できる。 As described above, the spam blog determination device 301 accepts correct / incorrect determination information on the result of the machine learning determination for the blog article that outputs the determination result among the determination target blog articles including the predetermined keyword, and handles the blog article. In addition, the error rate is calculated for each predetermined keyword based on the correctness / incorrectness determination information stored in the correct / incorrect determination information DB 324 and the error rate is equal to or greater than a predetermined threshold value. It is determined that the process has ended. Therefore, the spam blog determination apparatus 301 can automatically determine the “temporary” trend end state of a predetermined keyword based on a predetermined criterion simply by the administrator inputting correctness of determination by machine learning.

そして、スパムブログ判定装置３０１は、流行が終了したと判定された所定キーワードを、所定キーワードＤＢ２１から削除するので、所定キーワードＤＢ２１のメンテナンスを自動的に行うことができる。 And since the spam blog determination apparatus 301 deletes the predetermined keyword determined that the fashion has ended from the predetermined keyword DB 21, the maintenance of the predetermined keyword DB 21 can be performed automatically.

（変形形態）
第１実施形態は、所定キーワードの所定キーワードＤＢからの削除を管理端末からの入力に応じて行い、第２及び第３実施形態では、所定キーワードの所定キーワードＤＢからの削除をスパムブログ判定装置が自動的に行うものとして示したが、これに限定されない。いずれの削除も行えるようにしてもよい。そのようにすることで、監視不要の所定キーワードを自動的に削除でき、しかも、管理者の操作によって削除できるので、便利である。 (Deformation)
In the first embodiment, deletion of a predetermined keyword from the predetermined keyword DB is performed according to an input from the management terminal. In the second and third embodiments, the spam blog determination device deletes the predetermined keyword from the predetermined keyword DB. Although shown as being performed automatically, it is not limited to this. Any deletion may be performed. By doing so, it is convenient because a predetermined keyword that does not need to be monitored can be automatically deleted and can be deleted by the operation of the administrator.

以上、本発明の実施形態について説明したが、本発明は、上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

１，２０１，３０１スパムブログ判定装置
３ブログサーバ
５管理端末
１０，２１０，３１０制御部
１１所定キーワード受付手段
１２所定キーワード記憶制御手段
１３ブログ記事受付手段
１４機械学習手段
１５スパム判定結果出力手段
１７，２１７調整戻し手段
２０，３２０記憶部
２１所定キーワードＤＢ
２２ＳＶＭエンジン
２３判定結果テーブル
１００，２００，３００スパムブログ判定システム
２１６第２流行終了判定手段
３１８正誤判断情報記憶制御手段
３１９第１流行終了判定手段
３２４正誤判断情報ＤＢ 1, 201, 301 Spam blog determination device 3 Blog server 5 Management terminal 10, 210, 310 Control unit 11 Predetermined keyword receiving means 12 Predetermined keyword storage control means 13 Blog article receiving means 14 Machine learning means 15 Spam determination result output means 17, 217 Adjustment returning means 20,320 Storage unit 21 Predetermined keyword DB
22 SVM engine 23 Judgment result table 100, 200, 300 Spam blog judgment system 216 Second epidemic end judgment means 318 Correct / incorrect judgment information storage control means 319 First epidemic end judgment means 324 True / false judgment information DB

Claims

Predetermined keyword storage control means for storing the predetermined keyword in the predetermined keyword storage means in response to receiving registration designation of the predetermined keyword;
Machine learning means for determining whether or not the blog article is a spam blog by machine learning using the predetermined keyword stored in the predetermined keyword storage means as a feature in response to receiving the determination target blog article When,
Among the blog articles to be determined by the machine learning means, a blog article including the predetermined keyword stored in the predetermined keyword storage means, and a result of the determination by the machine learning as to whether or not it is a spam blog. A spam determination result output means for outputting in association with each other;
An adjustment return unit for deleting the predetermined keyword stored in the predetermined keyword storage unit in response to receiving the deletion specification of the predetermined keyword;
In response to accepting the correctness / incorrectness determination information, which is information indicating the correctness / incorrectness of the determination result by the machine learning, for the blog article for which the spam determination result output means has output the determination result by the machine learning, Correctness determination information storage means for storing determination information in association with the blog article;
An error rate for determination by the machine learning is calculated for each of the predetermined keywords based on the correctness / incorrectness determination information stored in the correctness / incorrectness determination information storage unit in a predetermined period, and the predetermined error rate is equal to or greater than a predetermined threshold. A first fashion end judging means for judging that the fashion has ended for the keyword;
A spam blog determination device comprising:

Predetermined keyword storage control means for storing the predetermined keyword in the predetermined keyword storage means in response to receiving registration designation of the predetermined keyword;
Machine learning means for determining whether or not the blog article is a spam blog by machine learning using the predetermined keyword stored in the predetermined keyword storage means as a feature in response to receiving the determination target blog article When,
Among the blog articles to be determined by the machine learning means, a blog article including the predetermined keyword stored in the predetermined keyword storage means, and a result of the determination by the machine learning as to whether or not it is a spam blog. A spam determination result output means for outputting in association with each other;
An adjustment return unit for deleting the predetermined keyword stored in the predetermined keyword storage unit in response to receiving the deletion specification of the predetermined keyword;
Among the blog articles to be determined that include the predetermined keyword in a predetermined period, the ratio of the blog articles that the machine learning means determines to be spam blogs is equal to or lower than a predetermined threshold value. A second fashion end determination means for determining that the fashion has ended for the keyword;
A spam blog determination device comprising:

The adjustment return means deletes the predetermined keyword determined to have ended by the first trend end determination means or the second trend end determination means from the predetermined keyword storage means;
The spam blog determination apparatus according to claim 1 or 2 .

A predetermined keyword storage step for storing the predetermined keyword in a predetermined keyword storage means in response to the computer accepting the registration specification of the predetermined keyword ;
When the computer accepts the blog article to be determined, the machine determines whether the blog article is a spam blog by using the predetermined keyword stored in the predetermined keyword storage unit as a feature. Machine learning steps,
Among the blog articles to be determined whether or not the computer is a spam blog, the blog article including the predetermined keyword stored in the predetermined keyword storage unit and the machine learning whether or not the computer is a spam blog A spam determination result output step for outputting the determination result in accordance with
An adjustment return step of deleting the predetermined keyword stored in the predetermined keyword storage means in response to the computer receiving the deletion specification of the predetermined keyword;
In response to accepting correct / incorrect determination information, which is information indicating correct / incorrect determination of the result of determination by machine learning, for the blog article that has output the determination result by machine learning in the spam determination result output step , A correct / incorrect determination information storing step for storing the correct / incorrect determination information in association with the blog article;
The computer calculates an error rate for determination by the machine learning for each of the predetermined keywords based on the accuracy determination information stored in the accuracy determination information storage step in a predetermined period, and the error rate is equal to or greater than a predetermined threshold. A first fashion end determination step for determining that the fashion has ended for the predetermined keyword;
Spam blog determination method including.

A predetermined keyword storage step for storing the predetermined keyword in a predetermined keyword storage means in response to the computer accepting the registration specification of the predetermined keyword;
When the computer accepts the blog article to be determined, the machine determines whether the blog article is a spam blog by using the predetermined keyword stored in the predetermined keyword storage unit as a feature. Machine learning steps,
Among the blog articles to be determined whether or not the computer is a spam blog, the blog article including the predetermined keyword stored in the predetermined keyword storage unit and the machine learning whether or not the computer is a spam blog A spam determination result output step for outputting the determination result in accordance with
An adjustment return step of deleting the predetermined keyword stored in the predetermined keyword storage means in response to the computer receiving the deletion specification of the predetermined keyword;
When the ratio of the blog articles determined to be spam blogs in the machine learning step out of the blog articles to be determined that include the predetermined keyword in a predetermined period is equal to or less than a predetermined threshold, A second trend end determination step for determining that the trend has ended for the predetermined keyword;
Spam blog determination method including.