JP2009289246A

JP2009289246A - Message determining apparatus, method and program

Info

Publication number: JP2009289246A
Application number: JP2008163805A
Authority: JP
Inventors: Tetsuya Mizukami; 哲也水上; Hisao Haraguchi; 寿夫原口; Iori Nishida; 衣織西田
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2008-05-27
Filing date: 2008-05-27
Publication date: 2009-12-10
Anticipated expiration: 2028-05-27
Also published as: JP4926130B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve determining accuracy in determining, using a word dictionary or the like, the harmful effect of a web page of a URL in a message posted to a blog or the like by employing different harmful effect determining means depending on the capacity of a page description file such as HTML. <P>SOLUTION: A plurality of harmful effect determining means 31, 32, 33 for determining the harmful effect of a web page are used. Based on one or more reference values, a capacity determining means 45 determines the range of the data capacity of a web page description file (HTML file or the like) of the web page read by a page reading means 20 (capacity determining process). Depending on the determined range of the data capacity, a selection means 47 selects one or more of a plurality of harmful effect determining means 31, 32, 33 to determine the harmful effect of the web page (selection process). <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、ブログなどへの投稿メッセージに含まれるＵＲＬに対応するウェブページの有害性を判定する技術の改良に関するものである。 The present invention relates to an improvement in technology for determining the harmfulness of a web page corresponding to a URL included in a message posted to a blog or the like.

近年、インターネットとともに、インターネット経由の各種メッセージサービスも急速に普及・多様化した。メッセージサービスは、インターネットや携帯電話ネットワークなどの通信ネットワーク経由で、文字を主とするさまざまなメッセージについて利用者間での交換を可能とするサービスで、例えば、電子掲示板、ブログ、電子メールなどが挙げられる。 In recent years, along with the Internet, various message services via the Internet have rapidly spread and diversified. Message service is a service that allows users to exchange various messages, mainly text, via communication networks such as the Internet and mobile phone networks. Examples include electronic bulletin boards, blogs, and e-mails. It is done.

このようなメッセージサービスにおいては、ブログ記事へのコメントやトラックバック、電子掲示板への新規投稿（親記事）やレスポンス（子記事）などを装って、アダルトサイトなど有害とされるＵＲＬを含むメッセージを投稿する、いわゆるスパム（ＳＰＡＭ）が横行し、それを検出し対策を行う技術が提案されている。 In such a message service, posting a message containing a URL that is considered harmful, such as an adult site, pretending to be a comment or trackback on a blog article, a new post (parent article) or a response (child article) on an electronic bulletin board So-called spam (SPAM) is rampant, and a technique for detecting it and taking countermeasures has been proposed.

その一例として、本出願人による特許文献１の技術では、メッセージ内のＵＲＬが、予め用意した有害ＵＲＬリストに未登録でも、そのＵＲＬが表すウェブページが、有害判定辞書すなわち禁止語リストの収録語を含む場合はそのメッセージを破棄している。
特開２００７−２６５３６８号公報 As an example, in the technique of Patent Document 1 by the present applicant, even if the URL in the message is not registered in the harmful URL list prepared in advance, the web page represented by the URL is recorded in the harmful judgment dictionary, that is, the prohibited word list. If it contains, the message is discarded.
JP 2007-265368 A

しかし、語句表現は、ウェブページの作成目的や対象とする端末の種類に応じて異なり、例えば、パソコン向けのページでは「ヘア無修正」のような表現が、携帯電話端末向けのページでは「無修正」のように短縮される傾向があり、パソコン向けであっても画像主体で文字の少ないページでは同様の短縮傾向が見られる。 However, the phrase expression varies depending on the purpose of creating the web page and the type of target device. For example, an expression such as “Unmodified hair” is displayed on a page for a personal computer, and “None” is displayed on a page for a mobile phone terminal. There is a tendency to be shortened like "Fix", and the same shortening tendency is seen in the page mainly with images and with few characters even for a personal computer.

また、有害判定辞書の収録語との一致数を基準に、ウェブページが有害か否かを判定する場合、そのウェブページの情報量が大幅に異なれば、適切な一致数は異なる。例えば、電子掲示板（ＢＢＳ）のメッセージ表示数が一画面あたり数十件程度の場合、疑わしい単語を数語含んでいるか否かを基準にできても、同じ基準を、一画面あたり一千件近いメッセージを含むページに適用すれば、有害と判定される場合が多過ぎて過剰規制となり、実用性が低下しかねない。 In addition, when determining whether or not a web page is harmful based on the number of matches with words recorded in the harm determination dictionary, the appropriate number of matches is different if the information amount of the web page is significantly different. For example, if the number of messages displayed on a bulletin board (BBS) is about several tens per screen, even if it can be based on whether or not it contains several suspicious words, the same standard is close to 1,000 per screen. If it is applied to a page containing a message, it may be judged as harmful in many cases, resulting in over-regulation, which may reduce practicality.

以上のように多様な側面を持つ多数のウェブページを、従来のように単一の有害判定辞書や一律の判定基準で適切に判定することは困難であり、判定精度を改善する技術が求められていた。 As described above, it is difficult to properly determine a large number of web pages having various aspects using a single harmful judgment dictionary or uniform judgment standard as in the past, and a technique for improving judgment accuracy is required. It was.

本発明は、上記のような従来技術の課題を解決するもので、その目的は、ブログなどへの投稿メッセージ内ＵＲＬのウェブページの有害性を単語辞書等で判定する際、ＨＴＭＬなどページ記述ファイル容量に応じて異なる有害判定手段を適用することにより、判定精度を改善することである。 The present invention solves the above-described problems of the prior art, and its purpose is to use a page description file such as HTML when determining the harmfulness of a web page of a URL in a message posted to a blog using a word dictionary or the like. By applying different harmful determination means depending on the capacity, the determination accuracy is improved.

本発明の一態様は、メッセージ判定装置において、ウェブページの有害性を判定する複数の有害判定手段と、メッセージサービスのメッセージ内に含まれるＵＲＬに基づいて、対応するウェブページを通信ネットワークを経て読み込むページ読込手段と、読み込んだ前記ウェブページのウェブページ記述ファイルのデータ容量範囲を、一又は二以上の基準値に基づいて判定する容量判定手段と、判定された前記データ容量範囲に応じて、前記複数の有害判定手段のうち一又は二以上を選択することにより、そのウェブページの有害性を判定させる、選択手段と、をコンピュータにより実現する。 According to an aspect of the present invention, in a message determination apparatus, a corresponding web page is read via a communication network based on a plurality of harmful determination means for determining the harmfulness of a web page and a URL included in a message of a message service. A page reading means, a capacity determination means for determining a data capacity range of the web page description file of the read web page based on one or more reference values, and according to the determined data capacity range, The selection means for determining the harmfulness of the web page by selecting one or more of the plurality of harmfulness determination means is realized by a computer.

このように、ＨＴＭＬなどのデータ容量に応じて、複数の有害判定手段のどれを適用するかを変化させることにより、パソコン向けか携帯電話端末向けかなどに応じたウェブページの用途や長短などに適合した高精度な有害判定が可能となる。 In this way, by changing which of the plurality of harmful determination means is applied according to the data capacity such as HTML, it is possible to use the web page according to whether it is for a personal computer or a mobile phone terminal, the length of the web page, etc. It is possible to make a harmful judgment with high accuracy.

本発明の他の態様は、上記いずれかの態様において、前記複数の有害判定手段のうち少なくとも一部は、前記データ容量範囲に対応し、互いに異なる有害語句群を収録語とした有害判定辞書を有し、前記ＵＲＬに対応する前記ウェブページ記述ファイルの内容について、前記有害判定辞書と照合し前記収録語との一致に基づいて有害か否かを判定することを特徴とする。 According to another aspect of the present invention, in any one of the above aspects, at least a part of the plurality of harmful determination means includes a harmful determination dictionary that corresponds to the data capacity range and includes different harmful phrase groups. And the content of the web page description file corresponding to the URL is checked against the noxious determination dictionary to determine whether it is harmful based on a match with the recorded word.

このように、有害判定辞書の収録語との照合という簡明な有害判定手法の採用とともに、使用する有害辞書を切り替えたり組み合わせるなど、変化させることにより、判定対象とするウェブページのデータ容量に応じて、容易かつ確実に有害判定の内容を変更可能となる。 In this way, according to the data capacity of the web page to be judged, by adopting a simple harmful judgment method such as matching with words recorded in the harmful judgment dictionary and changing the harmful dictionary to be used or changing it It is possible to easily and reliably change the contents of the harmful determination.

本発明の他の態様は、上記いずれかの態様において、前記選択手段は、前記有害判定辞書の収録語との一致語数を含む複数の有害判定基準について、判定された前記データ容量範囲に応じ、一又は二以上を選択するとともにＡＮＤもしくはＯＲの論理演算により適用するように構成されたことを特徴とする。 According to another aspect of the present invention, in any one of the above aspects, the selection unit is configured to determine a plurality of harmful determination criteria including the number of matching words with words recorded in the harmful determination dictionary according to the determined data capacity range. One or two or more are selected and applied by an AND or OR logical operation.

このように、ＨＴＭＬファイルデータ容量に応じて、語数などの有害判定基準も切り替えるとともに、複数の有害判定基準をＡＮＤやＯＲの関係で適用することにより、対象や状況などの事情に応じ、有害判定精度が一層改善可能となる。 In this way, harmful judgment criteria such as the number of words are switched according to the HTML file data capacity, and a plurality of harmful judgment criteria are applied in a relationship of AND or OR, so that the harmful judgment is made according to the situation such as the target or the situation. The accuracy can be further improved.

本発明の他の態様は、上記いずれかの態様において、前記有害判定辞書を有する前記有害判定手段は、前記有害判定辞書として、有害度の違いに応じたブラックワード辞書とグレーワード辞書を有し、予め定められた前記有害判定基準として、前記データ容量範囲ごとに、前記ブラックワード辞書での一致語数と、前記グレーワード辞書での一致語数、の一方又は双方を用いることを特徴とする。 According to another aspect of the present invention, in any one of the above aspects, the harmfulness determination unit including the harmfulness determination dictionary includes, as the harmfulness determination dictionary, a black word dictionary and a gray word dictionary corresponding to a difference in harmfulness. One or both of the number of matching words in the black word dictionary and the number of matching words in the gray word dictionary is used for each data capacity range as the predetermined harmful judgment criterion.

このように、有害判定基準の一致語数を、有害度に応じたブラックワード辞書とグレーワード辞書ごとに、かつ、ページのデータ容量範囲との組合せで定めることにより、判定精度が一層改善される。 Thus, the determination accuracy is further improved by determining the number of matching words in the harmful determination criterion for each black word dictionary and gray word dictionary corresponding to the harmful degree and in combination with the page data capacity range.

また、以上各態様との組合せが可能な他の好ましい追加の態様として、ブログなどへの投稿内のＵＲＬが有害ＵＲＬリストに無く未知の場合、その有害性チェックを、ＵＲＬ末尾側の階層を除去した状態でも行うことにより、ショートＵＲＬの過剰規制を回避しつつ、末尾に無意味な文字列を付加して偽装した有害ＵＲＬも優れた精度で検出する例を示す。 As another preferred additional mode that can be combined with the above modes, if the URL in the blog posting is not in the harmful URL list and is unknown, the harmfulness check is removed and the hierarchy at the end of the URL is removed. In this state, an example of detecting a harmful URL with an excellent meaning by adding a meaningless character string at the end while avoiding excessive restriction of a short URL is shown.

すなわち、本発明の一態様は、メッセージ判定装置において、有害とするウェブページのＵＲＬを記録した有害ＵＲＬリストと、メッセージサービスへ入力されたメッセージ内に含まれるＵＲＬについて、前記有害ＵＲＬリストと照合し一致するものがあれば有害と判定する有害判定を行う、ＵＲＬ判定手段と、前記ＵＲＬ判定手段が有害と判定しなかった前記ＵＲＬを未知ＵＲＬとして、この未知ＵＲＬに対応するウェブページについて通信ネットワークを経てページ読込みを行うページ読込手段と、読み込んだ前記ウェブページについて、有害か否かの有害判定を行う、有害判定手段と、前記有害判定手段が有害と判定した前記ウェブページに対応する前記未知ＵＲＬについて、所定の階層区切り記号で区切られた末尾側の文字列を一又は二以上の階層分除去する、末尾除去手段と、末尾を除去した前記未知ＵＲＬについても、前記ページ読込み及び前記有害判定を行わせ、有害と判定された場合に、その除去した未知ＵＲＬについて、前記有害ＵＲＬリストへの登録を行う、末尾判定登録手段と、をコンピュータにより実現することを特徴とする。 That is, according to one aspect of the present invention, in a message determination device, a harmful URL list in which URLs of harmful web pages are recorded and URLs included in a message input to a message service are checked against the harmful URL list. A URL determination unit that performs a harmful determination that determines that it is harmful if there is a match, and sets the URL that the URL determination unit has not determined as harmful as an unknown URL, and sets a communication network for a web page corresponding to the unknown URL. A page reading unit that reads the page, a harmful determination unit that determines whether or not the read web page is harmful, and the unknown URL corresponding to the web page that the harmful determination unit determines to be harmful For the end-side character string separated by the specified hierarchy delimiter Two or more layers are removed, tail removal means, and the unknown URL from which the tail has been removed are also subjected to the page reading and the harmful determination. The tail determination registration means for registering in the harmful URL list is realized by a computer.

このように、メッセージ内のＵＲＬが有害ＵＲＬリストに無く未知で、ページを読込んで有害と判定した場合、「／」「？」などで区切られたＵＲＬ末尾の下位層をカットしたＵＲＬについても有害判定することにより、下位層を付加して偽装したショートＵＲＬについても検出と有害ＵＲＬ登録が可能となり、判定精度が効果的に改善できる。 As described above, when the URL in the message is unknown in the harmful URL list and is unknown and is determined to be harmful when the page is read, it is also harmful for the URL cut from the lower layer at the end of the URL separated by “/”, “?”, Etc. By determining, it is possible to detect and register a harmful URL even for a short URL impersonated by adding a lower layer, and the determination accuracy can be effectively improved.

なお、本発明は、専用の電子回路によるほか、所定のコンピュータ・プログラムが、上記各手段に対応する各処理のステップを、メモリや入出力手段などのハードウェアを持つコンピュータの演算制御部に実行させることで実現可能である。そのような各処理を有する方法（例えば、メッセージ判定方法）と、そのような前記コンピュータ・プログラム（メッセージ判定プログラム）についても、上記及び下記の各態様に準じ、本発明の態様である。 In the present invention, in addition to a dedicated electronic circuit, a predetermined computer program executes each processing step corresponding to each of the above means to an arithmetic control unit of a computer having hardware such as a memory and input / output means. This can be realized. A method having such processes (for example, a message determination method) and such a computer program (message determination program) are also aspects of the present invention according to the above and following aspects.

本発明の他の態様として、上記いずれかの態様において、前記有害判定手段について、有害とする語を収録語として記憶した有害判定辞書を設け、前記ＵＲＬに対応する前記ウェブページから読み込んだウェブページ記述ファイルの内容について、前記有害判定辞書と照合し前記収録語との一致に基づいて有害か否かを判定する構成をとれば、辞書収録語との照合という簡明な手法で容易に有害判定が実施できる。なお、末尾の除去は、一階層について行えば多くの場合に効果的と思われるが、複数の階層について行う場合は、次のような処理手順が特に適する。 As another aspect of the present invention, in any one of the aspects described above, the harmfulness determination means is provided with a harmfulness determination dictionary in which harmful words are stored as recorded words, and the webpage read from the webpage corresponding to the URL If the content of the description file is checked against the noxious judgment dictionary and whether or not it is harmful based on the match with the recorded words, the noxious judgment can be easily performed by a simple method of matching with the dictionary recorded words. Can be implemented. Note that the removal of the tail seems to be effective in many cases if it is performed for one layer, but the following processing procedure is particularly suitable when it is performed for a plurality of layers.

本発明の他の態様は、上記いずれかの態様において、前記未知ＵＲＬについて、（１）そのＵＲＬの指すウェブページを前記ページ読込手段で読み込み、（２）読み込んだウェブページについて前記有害判定手段で有害か無害か判定し、（３）有害と判定した場合は、前記末尾除去手段で、そのＵＲＬを所定の記憶領域に一時記憶するとともに、ＵＲＬ末尾における所定の階層区切り記号以降の一階層分の文字列をカットすることにより、残る上位階層部分のＵＲＬに加工したうえ、加工したＵＲＬについて再度前記（１）からの処理を行わせるが、（４）無害と判定し、かつ、有害と判定したＵＲＬが前記一時記憶されている場合は、前記末尾判定登録手段で、その一時記憶されているＵＲＬを前記有害ＵＲＬリストに登録することを特徴とする。 According to another aspect of the present invention, in any one of the aspects described above, for the unknown URL, (1) the web page pointed to by the URL is read by the page reading means, and (2) the harmful judgment means is used for the read web page. It is determined whether it is harmful or harmless. (3) If it is determined that it is harmful, the tail removing means temporarily stores the URL in a predetermined storage area, and for one layer after a predetermined hierarchy delimiter at the end of the URL. By cutting the character string, it is processed into the URL of the remaining upper layer part, and the processed URL is processed again from the above (1), but (4) it is determined to be harmless and determined to be harmful When a URL is temporarily stored, the tail determination registration unit registers the temporarily stored URL in the harmful URL list. To.

このように、有害と判定したＵＲＬについて、末尾を除去のうえ再チェックを繰り返し、有害でなくなった場合は、最後に有害だったＵＲＬを有害ＵＲＬリストへ登録するという単純な再帰処理によって、有害なページにリダイレクトされるもっとも短いＵＲＬを特定・登録可能となり、偽装的な冗長文字列の階層数がいくつであっても、高精度な判定が実現できる。 As described above, the URL determined to be harmful is repeatedly checked after removing the tail, and when it is no longer harmful, it is harmful by a simple recursive process of registering the last harmful URL in the harmful URL list. The shortest URL redirected to the page can be specified and registered, and a highly accurate determination can be realized regardless of the number of layers of the camouflaged redundant character string.

本発明の他の態様は、上記いずれかの態様において、各々サービスを実現し呼び出し元となる情報処理のプロセスもしくはシステムから、入力されたメッセージをパラメータに含むＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍＩｎｔｅｒｆａｃｅ）の呼出しを受け付けるとともに、前記呼出しに対するＡＰＩの返り値として前記呼び出し元に対し、前記有害判定の結果を返信する、ＡＰＩ受渡し手段（ＡＰＩインターフェース部）を、前記コンピュータで実現することを特徴とする。 According to another aspect of the present invention, in any of the above aspects, an API (Application Program Interface) call including an input message as a parameter is received from an information processing process or system that implements each service and is a caller. In addition, an API delivery means (API interface unit) that returns the result of the harmful determination to the caller as an API return value for the call is realized by the computer.

このように、多様なサービスからの呼出し利用をＡＰＩ経由で可能とすることにより、より多数の判定例が基礎となるため有害ＵＲＬの情報蓄積が充実して判定精度が一層改善でき、また、そのように蓄積した情報に基づく高精度な判定を、より多くのサービスから幅広く活用可能となる。 In this way, by making it possible to use calls from various services via the API, a larger number of determination examples become the basis, so the accumulation of harmful URL information can be enhanced, and the determination accuracy can be further improved. Thus, it becomes possible to use highly accurate judgments based on accumulated information widely from more services.

以上のように、本発明によれば、ブログなどへの投稿メッセージ内ＵＲＬのウェブページの有害性を単語辞書等で判定する際、ＨＴＭＬなどページ記述ファイル容量に応じて異なる有害判定手段を適用することにより、判定精度を改善することが可能となる。 As described above, according to the present invention, when determining the harmfulness of a web page of a URL in a message posted to a blog or the like using a word dictionary or the like, different harmful determination means such as HTML are applied according to the page description file capacity. As a result, the determination accuracy can be improved.

続いて、本発明を実施するための最良の形態（以下「実施形態」と呼ぶ）について、図に沿って説明する。なお、背景技術や課題などで既に述べた内容と共通の前提事項については適宜省略する。 Next, the best mode for carrying out the present invention (hereinafter referred to as “embodiment”) will be described with reference to the drawings. It should be noted that assumptions common to those already described in the background art and problems are omitted as appropriate.

〔１．構成〕
本実施形態は、図１の機能ブロック図に示すように、ブログや電子掲示板などのメッセージサービスを実現する機能部やサーバ（１ほか）などに対し、投稿者Ａ１やＡ２などから入力されたメッセージの有害性を判定する機能を提供するメッセージ判定装置２（以下「本装置」と呼ぶ）に関するもので、本装置に対応する方法及びコンピュータプログラムとしても把握可能である。[1. Constitution〕
In the present embodiment, as shown in the functional block diagram of FIG. 1, messages input from a contributor A1 or A2 to a functional unit or server (1 or the like) that implements a message service such as a blog or an electronic bulletin board. This is related to a message determination device 2 (hereinafter referred to as “the present device”) that provides a function for determining the harmfulness of the device, and can also be grasped as a method and a computer program corresponding to the present device.

本装置２が、メッセージの有害性判定の機能を提供する経路は二つ考えられ、両方に同時に対応する実装は必須ではないが、その一つは、ブログの機能部又はサーバ１から直接、有害性判定依頼をＵＲＬ判定手段１５宛てに受け付け、判定結果は専用の有害通知手段６５から電子メールなどでブログの管理者Ｂなどへ、有害書込みの確認や削除の勧告などの形で出力するものである。 There are two possible routes for the device 2 to provide the message harmfulness judgment function, and it is not essential to implement both of them simultaneously, but one of them is harmful directly from the blog function unit or the server 1. The sex determination request is received to the URL determination means 15, and the determination result is output from the dedicated harmful notification means 65 to the blog administrator B by e-mail or the like in the form of confirmation of harmful writing or recommendation for deletion. is there.

他の一つは、複数のサービスのプロセスもしくはシステム（機能部、サーバなど）から、所定の標準化されたＡＰＩインタフェースを介して、判定依頼と結果回答を行うもので、これについては後述する。 The other one is to make a determination request and answer a result from a plurality of service processes or systems (functional units, servers, etc.) via a predetermined standardized API interface, which will be described later.

このような本装置２は、図示はしないが、ＣＰＵなどの演算制御部、主メモリや外部記憶装置（ＨＤＤなど）といった記憶装置、入出力手段などのハードウェアを持つコンピュータに、所定のコンピュータ・プログラムを予め導入することにより、サーバコンピュータなどとして実現され、図１に示す各要素１０〜８０を有する。 Although not shown in the drawings, the present apparatus 2 includes a computer having a calculation control unit such as a CPU, a storage device such as a main memory and an external storage device (such as an HDD), and a computer having hardware such as an input / output unit. By introducing the program in advance, it is realized as a server computer or the like, and has the elements 10 to 80 shown in FIG.

これら各要素１０〜８０のうち、リストや辞書などの情報群については、対応する情報をファイルなどとして前記記憶装置上に格納することで実現する。また、それ以外の各手段は、各手段に対応する情報処理のステップを、前記コンピュータ・プログラムが前記演算制御部に実行させることで実現している。もっとも、各要素の一部又は全部は、技術的条件や実装態様などに応じて、ワイヤードロジックなど電子的回路によって実現してもよい。 Among these elements 10 to 80, information groups such as lists and dictionaries are realized by storing corresponding information as files on the storage device. Each of the other means is realized by having the computer program execute the information processing step corresponding to each means by the computer program. However, part or all of each element may be realized by an electronic circuit such as wired logic in accordance with technical conditions and mounting modes.

上記のように構成された本実施形態の特徴は二つの概要に大別でき、第一の概要は、ショートＵＲＬの末尾偽装への対応であり、第二の概要は、ウェブページのデータ容量に応じた有害判定手段の使い分けである。 The features of the present embodiment configured as described above can be broadly divided into two outlines. The first outline is correspondence to the tail disguise of the short URL, and the second outline is the data capacity of the web page. It is the proper use of the appropriate harmful judgment means.

〔２．第一の概要〕
まず、第一の概要について説明する。すなわち、本装置２において、有害ＵＲＬリスト１０は、予め判明していて有害とするウェブページのＵＲＬを記録したものである。そして、ＵＲＬ判定手段１５は、メッセージサービス（ブログ、電子掲示板、電子メール、メッセンジャーなど）で入力（新規投稿や発信だけでなく到着や受信も含む）されたメッセージ（コメント、投稿、メールなど）内に含まれるＵＲＬについて、有害ＵＲＬリスト１０と照合し一致するものがあれば有害と判定する有害判定を行う（この処理を方法の観点から「ＵＲＬ判定処理」と呼ぶ。以下同様）。[2. (First overview)
First, the first outline will be described. That is, in the present apparatus 2, the harmful URL list 10 is a list of URLs of web pages that are known in advance and are harmful. Then, the URL determination means 15 is used in messages (comments, posts, mails, etc.) input (including not only new posts and outgoing calls but also arrivals and receptions) by message services (blogs, electronic bulletin boards, emails, messengers, etc.). If the URL included in the URL is matched with the harmful URL list 10, it is determined that the URL is harmful (this processing is referred to as “URL determination processing” from the viewpoint of the method; the same applies hereinafter).

そして、ページ読込手段２０は、ＵＲＬ判定手段１０が有害と判定しなかったＵＲＬを未知ＵＲＬとして、この未知ＵＲＬに対応するウェブページについて、通信ネットワーク（インターネットや携帯電話ネットワークなど）を経てページ読込み（クロール）を行い（ページ読込処理）、このように読み込んだウェブページについて有害判定手段３０が、有害かの有害判定を行う（有害判定処理）。 Then, the page reading unit 20 sets a URL that the URL determination unit 10 has not determined to be harmful as an unknown URL, and reads the web page corresponding to the unknown URL via a communication network (such as the Internet or a mobile phone network). Crawling) (page reading process), and the harmfulness determination means 30 makes a harmful determination as to whether the web page read in this way is harmful (harmful determination process).

この際、有害判定手段３０が有害と判定したウェブページに対応する未知ＵＲＬについては、末尾除去手段５０が、所定の階層区切り記号（「／」「？」など）で区切られた末尾側の文字列を一又は二以上の階層分除去し（末尾除去処理）、このように末尾を除去した未知ＵＲＬについても、末尾判定登録手段６０が、ページ読込手段２０及び有害判定手段３０に上述のページ読込み及び有害判定を行わせ、有害と判定された場合、その除去した未知ＵＲＬについて、有害ＵＲＬリスト１０への登録を行う（末尾判定登録処理）。 At this time, for the unknown URL corresponding to the web page that is determined to be harmful by the harmfulness determination means 30, the tail removal means 50 is a character on the end side that is separated by a predetermined hierarchy delimiter (“/”, “?”, Etc.). One or two or more layers are removed from the column (end removal processing), and the end determination registration unit 60 also reads the above-described page read into the page reading unit 20 and the harmful determination unit 30 for the unknown URL from which the end is removed in this way If the harmful URL is determined to be harmful, the removed unknown URL is registered in the harmful URL list 10 (tail determination registration process).

単純な例を示せば、未知ＵＲＬ「ｓｈｏｒｔ○○．ｃｏｍ／ａａ／ｘｘ１」が有害と判定された場合、末尾側の最初の階層区切り記号「／」以降の部分「／ｘｘ１」を除去して短縮し「ｓｈｏｒｔ○○．ｃｏｍ／ａａ」としても有害と判定されれば、この短縮した状態で有害ＵＲＬ登録する。以降、「ｓｈｏｒｔ○○．ｃｏｍ／ａａ／ｙｙ２」のような他の偽装部分を結合したＵＲＬについても、有害ＵＲＬリストに登録済みの「ｓｈｏｒｔ○○．ｃｏｍ／ａａ」の部分との一致判定により有害判定可能となる。For example, if the unknown URL “shortXX.com/aa/xx1” is determined to be harmful, the part “/ xx1” after the first hierarchical delimiter “/” on the end side is removed. If it is determined to be harmful even if it is shortened and "shortOO.com/aa" is registered, the harmful URL is registered in this shortened state. Thereafter, the URL obtained by combining other spoofed parts such as “ shortXX.com/aa/yy2 ” is also determined by matching with the “ shortxxx.com/aa ” part registered in the harmful URL list. Hazard determination is possible.

なお、本発明及び本実施形態において、有害判定手段は、ＳＶＭ（サポート・ベクター・マシン）等の機械学習によって実現可能であるが、有害判定手段は、有害とする語を収録語として記憶した有害判定辞書を設け、ＵＲＬに対応するウェブページ記述ファイルの内容について、前記有害判定辞書と照合しその収録語との一致の有無や数に基づいて有害か否かを判定するように構成することにより、辞書収録語との照合という簡明な手法で容易に有害判定が実施できる。 In the present invention and this embodiment, the harmful determination means can be realized by machine learning such as SVM (support vector machine), but the harmful determination means stores harmful words stored as recorded words. By providing a determination dictionary and configuring the contents of the web page description file corresponding to the URL to collate with the harmful determination dictionary and determine whether it is harmful based on the presence / absence and number of matches with the recorded words Therefore, it is possible to easily carry out a harmful determination by a simple method of matching with a dictionary recorded word.

また、末尾の除去は、一階層について行えば多くの場合に効果的と思われるが、複数の階層について行う場合は、以下のような再帰的処理手順が特に適する。 The removal of the tail seems to be effective in many cases if it is performed on one layer, but the following recursive processing procedure is particularly suitable when it is performed on a plurality of layers.

〔３．再帰的処理手順〕
すなわち、この再帰的処理手順は、制御ルーチンなどの制御手段８０による制御に基づいて、未知ＵＲＬについて、
（１）そのＵＲＬの指すウェブページをページ読込手段２０が読み込み、
（２）読み込んだウェブページについて有害判定手段３０が有害か無害か判定し、
（３）有害と判定した場合は、末尾除去手段５０が、そのＵＲＬを所定の記憶領域に一時記憶するとともに、ＵＲＬ末尾における所定の階層区切り記号以降の一階層分の文字列をカットすることにより、残る上位階層部分のＵＲＬに加工したうえ、加工したＵＲＬについて再度前記（１）からの処理を行わせるが、
（４）無害と判定し、かつ、有害と判定したＵＲＬが前記一時記憶されている場合は、前記末尾判定登録手段が、その一時記憶されているＵＲＬを前記有害ＵＲＬリストに登録する。[3. (Recursive procedure)
That is, this recursive processing procedure is performed for unknown URLs based on control by the control means 80 such as a control routine.
(1) The page reading means 20 reads the web page pointed to by the URL,
(2) Determine whether the harm determination means 30 is harmful or harmless for the read web page,
(3) When it is determined to be harmful, the tail removing means 50 temporarily stores the URL in a predetermined storage area and cuts a character string for one layer after the predetermined layer delimiter at the URL end. Then, after processing into the URL of the remaining upper layer part, the processed URL is processed again from the above (1).
(4) When the URL determined to be harmless and determined to be harmful is temporarily stored, the tail determination registration unit registers the temporarily stored URL in the harmful URL list.

一例として、図２に示すように、状態１のＵＲＬ（ｓｈｏｒｔ○○．ｃｏｍ／ａａ／ｂｂ／ｃｃ？＝１２３４）が有害と判定された場合、末尾側の最初の階層区切り記号「？」以降の部分「？＝１２３４」を除去して状態２とする。それも有害なら同様に、次の階層区切り記号「／」以降の部分「／ｃｃ」を除去して状態３、さらに状態４と判定し、状態４で初めて無害と判定されれば、最後に有害だった状態３のＵＲＬ「ｓｈｏｒｔ○○．ｃｏｍ／ａａ／ｂｂ」を有害ＵＲＬ登録する。 As an example, as shown in FIG. 2, when the URL of state 1 (shortOO.com/aa/bb/cc?=1234) is determined to be harmful, after the first hierarchy delimiter “?” On the end side The part “? = 1234” of FIG. Similarly, if it is harmful, the part “/ cc” after the next hierarchy delimiter “/” is removed to determine state 3 and then state 4, and if it is determined to be harmless for the first time in state 4, it is finally harmful. The URL “shortOO.com/aa/bb” in the state 3 is registered as a harmful URL.

〔４．全体の処理手順の例〕
ここで、第一の概要を上記のような再帰的処理手順で実現する場合の処理手順を、図３のフローチャートに示す。この例は、ブログなどに投稿されたコメントのチェック時には応答の迅速性を優先して有害ＵＲＬリストとの照合だけ行い（図３（１））、有害ＵＲＬリストと一致がなく無害と回答した未知ＵＲＬのチェックを、バッチなどで別途まとめて処理するものである（図３（２））。[4. Example of overall processing procedure)
Here, the flowchart of FIG. 3 shows a processing procedure when the first outline is realized by the recursive processing procedure as described above. In this example, when checking a comment posted on a blog or the like, priority is given only to the harmful URL list with priority given to quick response (FIG. 3 (1)), and there is no match with the harmful URL list. The URL check is separately processed in a batch or the like (FIG. 3 (2)).

すなわち、ＵＲＬ判定手段１５は、チェックが必要となったコメントなどのメッセージ内にＵＲＬが無ければ（ステップＳ５１）無害である旨を回答するが（ステップＳ５６）、ＵＲＬが有った場合は（ステップＳ５１）有害ＵＲＬリスト１０と照合し（ステップＳ５２）、リスト内のいずれかの有害ＵＲＬを含むという意味で一致すれば（ステップＳ５３）有害である旨を回答する（ステップＳ５４）。有害ＵＲＬリスト１５との照合で一致した有害ＵＲＬが無ければ（ステップＳ５３）、そのＵＲＬをチェック対象とする未知ＵＲＬとして記憶しておく（ステップＳ５５）。 That is, the URL determination means 15 replies that it is harmless if there is no URL in a message such as a comment that needs to be checked (step S51) (step S56). S51) The harmful URL list 10 is checked (step S52), and if it matches in the sense that any harmful URL in the list is included (step S53), it is answered that it is harmful (step S54). If there is no harmful URL that matches with the harmful URL list 15 (step S53), the URL is stored as an unknown URL to be checked (step S55).

その後、所定間隔（数十分ごと〜毎夜間など）でのバッチ処理などにおいて（図３（２））、上記のように記憶しておいたチェック対象の各ＵＲＬについて、その指し示すウェブページをページ読込手段２０が実際に参照して読み込み（ステップＳ６１）、有害判定手段３０が、有害判定辞書などを用いて有害ＵＲＬか否かの判定すなわち有害判定を行う（ステップＳ６２）。 Thereafter, in batch processing at a predetermined interval (every tens of minutes to every night, etc.) (FIG. 3 (2)), for each URL to be checked stored as described above, a web page indicating the page is displayed. The reading unit 20 actually refers to and reads (step S61), and the harmfulness determination unit 30 determines whether or not the URL is harmful by using a harmfulness determination dictionary or the like, that is, a harmful determination (step S62).

この結果、有害と判定した場合（ステップＳ６３）、現状のＵＲＬのトップドメイン部分よりも末尾側に、まだ「／」「？」などの階層区切り記号が有れば（ステップＳ６４）、末尾除去手段５０が現状のＵＲＬから最後の区切り記号以降の部分を削除したうえ（ステップＳ６６）、ページ読込（ステップＳ６１）からの処理を繰り返すが、この削除（ステップＳ６６）に先立って、削除直前のＵＲＬを所定の記憶領域に一時記憶しておく（ステップＳ６５）。 As a result, if it is determined that it is harmful (step S63), if there is still a hierarchy delimiter such as “/” or “?” On the end side of the top domain portion of the current URL (step S64), the end removing means 50 deletes the part after the last delimiter from the current URL (step S66) and repeats the process from page reading (step S61). Prior to this deletion (step S66), the URL immediately before the deletion is changed. Temporary storage is performed in a predetermined storage area (step S65).

また、有害と判定したＵＲＬに（ステップＳ６３）もう区切り記号が無ければ（ステップＳ６４）、トップドメイン自体が有害ＵＲＬであるから、末尾判定登録手段６０が、現在のＵＲＬすなわちトップドメインを有害ＵＲＬリスト１０に追加し（ステップＳ６７）、そのＵＲＬについては処理を終了し、次の未知ＵＲＬを処理対象とする。 If the URL determined to be harmful (step S63) no longer has a delimiter (step S64), the top domain itself is a harmful URL. Therefore, the tail determination registration means 60 uses the current URL, that is, the top domain as the harmful URL list. 10 (step S67), the processing is terminated for the URL, and the next unknown URL is set as a processing target.

有害判定（ステップＳ６２）で無害と判定された場合において（ステップＳ６３）、削除直前のＵＲＬが一時記憶されているときは（ステップＳ６８）、直前のＵＲＬが有害で（ステップＳ６３）末尾を削除した結果（ステップＳ６６）、ショートＵＲＬサービスのトップドメインなど有害でない階層まで辿り着いた場合であるから、末尾判定登録手段６０は、有害であった直前のＵＲＬを有害ＵＲＬリスト１５に追加し（ステップＳ６９）、そのＵＲＬについては処理を終了し、次の未知ＵＲＬを処理対象とする。 When the harm determination (step S62) determines harmless (step S63), if the URL immediately before deletion is temporarily stored (step S68), the previous URL is harmful (step S63) and the end is deleted. As a result (step S66), it is a case where a non-hazardous hierarchy such as the top domain of the short URL service has been reached, the tail determination registration means 60 adds the URL immediately before being harmful to the harmful URL list 15 (step S69). ), The process is terminated for the URL, and the next unknown URL is set as a processing target.

なお、ＵＲＬの末尾を除去した一階層上のＵＲＬについて、アクセスすなわち読み込みがエラーとなって不可能な場合は、有害ＵＲＬでない場合として（ステップＳ６３）、除去前のＵＲＬを有害ＵＲＬとして登録する（ステップＳ６９）。 Note that if the URL, which is one level higher than the end of the URL, cannot be accessed or read due to an error, it is determined that the URL is not harmful (step S63), and the URL before removal is registered as a harmful URL (step S63). Step S69).

〔５．処理手順の他の例〕
また、ショートＵＲＬにおけるリダイレクト元とリダイレクト先を関連付けるとともに、末尾の除去を一階層に限定して効率よく有害登録する処理手順の例を図４のフローチャートに示す。[5. Other examples of processing procedures]
FIG. 4 is a flowchart showing an example of a processing procedure for associating a redirect source and a redirect destination in a short URL and efficiently performing harmful registration by limiting the removal of the tail to one layer.

この例では、チェック対象ＵＲＬについて、読み込み時のリターンコードなどによりリダイレクトＵＲＬか否か判別し（ステップＳ２１）、リダイレクトＵＲＬでない場合（ステップＳ２１）、ページ読込手段２０がページを読み込み有害判定手段３０が有害と判定すると（ステップＳ２２）、そのＵＲＬを末尾判定登録手段６０が有害ＵＲＬリスト１０に登録する（ステップＳ２３）。 In this example, it is determined whether or not the URL to be checked is a redirect URL by a return code at the time of reading (step S21). If the URL is not a redirect URL (step S21), the page reading unit 20 reads the page and the harmfulness determining unit 30 If it is determined to be harmful (step S22), the tail determination registration means 60 registers the URL in the harmful URL list 10 (step S23).

一方、チェック対象ＵＲＬがリダイレクトＵＲＬの場合（ステップＳ２１）、ＵＲＬ判定手段１５が有害ＵＲＬリスト１０と照合した結果、未登録で（ステップＳ２４）、ページ読込手段２０がページを読み込み有害判定手段３０が有害と判定した場合に（ステップＳ２２）、末尾判定登録手段６０は、リダイレクト元ＵＲＬとリダイレクト先ＵＲＬを有害ＵＲＬリスト１０に登録するが（ステップＳ２９）、この際、複数回リダイレクトさせるＵＲＬであった場合、一番初めのリダイレクト元ＵＲＬと最終的なリダイレクト先ＵＲＬを登録する。 On the other hand, if the URL to be checked is a redirect URL (step S21), the URL determination means 15 is not registered as a result of collation with the harmful URL list 10 (step S24), and the page reading means 20 reads the page. When it is determined that it is harmful (step S22), the tail determination registration unit 60 registers the redirect source URL and the redirect destination URL in the harmful URL list 10 (step S29). In this case, the first redirect source URL and the final redirect destination URL are registered.

そのうえで末尾判定登録手段６０は、リダイレクト元ＵＲＬの一階層上のＵＲＬ（「上位階層ＵＲＬ」と呼ぶ）があれば（ステップＳ３０）、その上位階層ＵＲＬについてページ読込手段２０と有害判定手段３０にページ読み込みのうえ有害判定させ、それも有害サイトであるときにはもとのＵＲＬに代え、上位階層ＵＲＬを登録し（ステップＳ２７）、次のＵＲＬの処理に進む（ステップＳ２１〜）。 In addition, if there is a URL (referred to as “upper layer URL”) one level higher than the redirect source URL (step S30), the tail determination registration unit 60 sends a page to the page reading unit 20 and the harmful determination unit 30 for the upper layer URL. When it is a harmful site, it is determined to be harmful after reading, and an upper layer URL is registered instead of the original URL (step S27), and the process proceeds to the next URL (steps S21 to S21).

なお、この例では、「一階層上」のＵＲＬとは、ＵＲＬの最も右側にある『／』記号または『？』記号を探し、『／』であればその右側を削除したＵＲＬ、『？』であればその記号を含む右側を削除したＵＲＬであるが、階層区切り記号の種類やその取り扱いについては適宜変更実施可能である。 In this example, the “up one layer” URL is the “/” symbol or “?” ”Symbol, and if it is“ / ”, the URL with the right side deleted,“? ” ], The URL including the symbol is deleted from the right side, but the type of the hierarchy delimiter symbol and the handling thereof can be changed as appropriate.

リダイレクトＵＲＬについて（ステップＳ２１）、ＵＲＬ判定手段１５が有害ＵＲＬリスト１０と照合した結果、リダイレクト先ＵＲＬが有害サイトすなわち有害ＵＲＬだった場合は（ステップＳ２４）、末尾判定登録手段６０は、リダイレクト元ＵＲＬを有害ＵＲＬリスト１０に登録したうえ（ステップＳ２５）、一階層上のＵＲＬについて（ステップＳ２６）上記と同様の処理を行う（ステップＳ２７）。 As for the redirect URL (step S21), if the URL determination unit 15 collates with the harmful URL list 10 and the redirect destination URL is a harmful site, that is, a harmful URL (step S24), the tail determination registration unit 60 sets the redirect source URL. Is registered in the harmful URL list 10 (step S25), and the URL above one level (step S26) is subjected to the same processing as above (step S27).

このように、リダイレクト先ＵＲＬの登録状態と有害性に応じて、リダイレクト元のみ、もしくはリダイレクト元とリダイレクト先のＵＲＬを有害ＵＲＬリスト１０に有害登録するとともに、末尾の除去をリダイレクト元ＵＲＬの一階層上までに限定することにより、処理が効率化され処理負荷が軽減される。 In this way, depending on the registration status and harmfulness of the redirect destination URL, only the redirect source, or the redirect source URL and the redirect destination URL are harmfully registered in the harmful URL list 10 and the removal of the tail is performed one level of the redirect source URL. By limiting to the upper limit, the processing becomes efficient and the processing load is reduced.

〔６．第二の概要〕
第二の概要は、個々のＵＲＬに対応するウェブページの有害性判定において、ウェブページのデータ容量に応じた有害判定手段を利用するものである。[6. Second outline)
The second outline is to use harmful determination means corresponding to the data capacity of the web page in determining the harmfulness of the web page corresponding to each URL.

すなわち、ウェブページの有害性を判定する複数の有害判定手段３１，３２，３３（図１）を用い、ページ読込手段２０が読み込んだウェブページのウェブページ記述ファイル（ＨＴＭＬファイルなど）のデータ容量範囲を、容量判定手段４５が、一又は二以上の基準値に基づいて判定し（容量判定処理）、ここで判定されたデータ容量範囲に応じて、選択手段４７が、複数の有害判定手段３１，３２，３３のうち一又は二以上を選択することにより、そのウェブページの有害性を判定させる（選択処理）。 That is, the data capacity range of the web page description file (HTML file etc.) of the web page read by the page reading means 20 using the plurality of harmful judgment means 31, 32, 33 (FIG. 1) for judging the harmfulness of the web page. Is determined based on one or more reference values (capacity determination processing), and the selection means 47 includes a plurality of harmful determination means 31, according to the data capacity range determined here. By selecting one or more of 32, 33, the harmfulness of the web page is determined (selection process).

〔７．辞書と基準の多様化〕
上記のように用いる複数の有害判定手段は、既に述べたＳＶＭなど機械学習に基づくもの（例えば有害判定手段３３）を利用可能であるが、図１に示すように、少なくとも一部の有害判定手段（例えば３１，３２）としては、データ容量範囲に対応し、互いに異なる有害語句群を収録語とした有害判定辞書Ｄ１，Ｄ２を設け、ＵＲＬに対応するウェブページ記述ファイルの内容について、有害判定辞書Ｄ１又はＤ２と照合しその収録語との一致に基づいて有害か否かを判定するように構成することができる。[7. (Diversification of dictionaries and standards)
As the plurality of harmful determination means used as described above, those based on machine learning such as SVM (for example, the harmful determination means 33) can be used. However, as shown in FIG. (For example, 31 and 32), there are provided the harmful determination dictionaries D1 and D2 corresponding to the data capacity range and including different harmful words and phrases as recorded words, and the content of the web page description file corresponding to the URL is determined to be harmful. It can be configured to collate with D1 or D2 and determine whether it is harmful based on the match with the recorded word.

このように、有害判定辞書の収録語との照合という簡明な有害判定手法の採用とともに、使用する有害辞書を切り替えたり組み合わせるなど、変化させることは、必須ではなく省略も可能ではあるが、その採用により、判定対象とするウェブページのデータ容量に応じて、容易かつ確実に有害判定の内容を変更可能となる。 In this way, it is not essential and can be omitted, such as switching or combining harmful dictionaries to be used, as well as adopting a simple harmful judgment method of matching against words recorded in the harmful judgment dictionary, but it is also possible to omit it Thus, the contents of the harmful determination can be easily and reliably changed according to the data capacity of the web page to be determined.

〔８．判定基準の組合せ〕
また、上記のような複数の有害判定手段は、択一的な選択には限定されず、例えば、複数の有害判定手段や有害判定基準を用いて、双方で有害と判定されるＡＮＤ条件でＵＲＬを有害としたり、いずれか一方で有害と判定されれば有害とする（ＯＲ条件）など、自由に定めることができる。[8. Combination of criteria)
In addition, the plurality of harmful determination means as described above are not limited to alternative selection. For example, the URL is an AND condition that is determined to be harmful by using a plurality of harmful determination means and harmful determination criteria. Can be determined freely, or can be determined to be harmful if one of them is determined to be harmful (OR condition).

また、データ容量に応じ、複数ある有害判定手段や複数ある有害判定辞書の使い分けだけでなく、ウェブページあたり単語などが何語ヒットすれば有害と判定するかの判定基準についても、複数ある中から選択的もしくはＡＮＤやＯＲなどの論理演算による組合せにより、適用してもよい。 Also, depending on the data volume, there are multiple criteria for determining how many words, etc. per web page are considered harmful, as well as using multiple harmful judgment means and multiple harmful judgment dictionaries. You may apply by selection or the combination by logical operations, such as AND and OR.

この場合、選択手段４７が、有害判定辞書の収録語との一致語数を含む複数の有害判定基準について、判定されたデータ容量範囲に応じ、一又は二以上を選択するとともに、それらをＡＮＤもしくはＯＲの論理演算により適用する。このように、ＨＴＭＬファイルデータ容量に応じて、語数などの有害判定基準も切り替えるとともに、複数の有害判定基準をＡＮＤやＯＲの関係で適用することにより、対象や状況などの事情に応じ、有害判定精度が一層改善可能となる。 In this case, the selection means 47 selects one or two or more harmful judgment criteria including the number of matching words with the recorded words of the harmful judgment dictionary according to the determined data capacity range, and ANDs or ORs them. It is applied by the logical operation of In this way, harmful judgment criteria such as the number of words are switched according to the HTML file data capacity, and a plurality of harmful judgment criteria are applied in a relationship of AND or OR, so that the harmful judgment is made according to the situation such as the target or situation The accuracy can be further improved.

特に望ましい態様は、有害判定辞書を有する有害判定手段において、有害判定辞書又はその収録語の分類として、有害度の違いに応じてブラックワードすなわちブラック語句と、グレーワードすなわちグレー語句に分け、それぞれをブラックワード辞書（ブラックワード群）とグレーワード辞書（グレーワード群）とし、予め定められた有害判定基準として、データ容量範囲ごとに、ブラックワード辞書での一致語数と、グレーワード辞書での一致語数、の一方又は双方を用いることである。 A particularly desirable aspect is that in the harmful judgment means having the harmful judgment dictionary, the classification of the harmful judgment dictionary or its recorded words is divided into black words, that is, black words and gray words, that is, gray words, according to the degree of harmfulness, Black word dictionary (black word group) and gray word dictionary (gray word group), and the number of matching words in the black word dictionary and the number of matching words in the gray word dictionary for each data capacity range as predetermined harmful judgment criteria One or both of the above are used.

一例として、図５（データ構造の概念図）に示すように、一般ページ用のブラックワード辞書の収録語は、例えば「素人性感」「１８歳未満の閲覧を禁じます」「全裸露出」「アダルトビデオ情報」などが考えられ、一般ページ用のグレーワード辞書の収録語は例えば、「露出系」「あなたは、１８歳以上ですか」「極上素人」「アダルト動画」などが考えられる。 As an example, as shown in Fig. 5 (conceptual diagram of data structure), the words recorded in the black word dictionary for general pages are, for example, "Amateurism", "Browsing under 18 years old", "Nude exposure", "Adult" Video information ”can be considered, and the words recorded in the gray word dictionary for general pages can be, for example,“ exposure system ”,“ Are you 18 years old or older ”,“ best amateur ”,“ adult video ”, and the like.

また、低容量向けの有害判定辞書では、半角カタカナを用いるなど簡略な表現を多く収録し、例えば、ブラックワード辞書の収録語として「露出動画」「１８歳以上？」「極上素人」「アダルト動画」などが考えられ、であり、低容量向けのグレーワード辞書の収録語の例は、「露出」「１８歳以上」「素人」「アダルト動画」などである。なお、本出願ではカタカナを全角文字で表すが、低容量向けの場合、カタカナは半角文字を基準としたり、全角と半角を区別せず照合する。 In addition, the low-capacity harmful judgment dictionary contains many simple expressions such as using half-width katakana. For example, the exposure words in the black word dictionary are “exposure video”, “over 18 years old?”, “Best amateur”, “adult video” And the like, and examples of words recorded in the low-capacity gray word dictionary are “exposure”, “over 18 years old”, “amateur”, “adult video”, and the like. In this application, katakana is represented by full-width characters. However, for low-capacity applications, katakana is collated with reference to half-width characters or without distinguishing between full-width and half-width characters.

また、辞書の数は、ブラックとグレーのように２つには限定されず、例えば、１語で有害と判定するブラックワード辞書のほかに、２語で有害と判定するグレーワード辞書と、３語で有害と判定するグレーワード辞書、といった具合に、互いに異なった有害判定基準語数を設定した複数の辞書について順次有害判定を行い、いずれかの辞書について基準語数を満たせば有害と判定する構成とすれば、よりきめ細かな有害判定基準による優れた判定精度が実現可能となる。 The number of dictionaries is not limited to two, such as black and gray. For example, in addition to a black word dictionary that is determined to be harmful by one word, a gray word dictionary that is determined to be harmful by two words, and 3 A configuration such as a gray word dictionary that is determined to be harmful by a word, such that a plurality of dictionaries that have different numbers of reference words for harmful determination are sequentially determined to be harmful, and if any of the dictionaries satisfies the reference word number, it is determined to be harmful By doing so, it is possible to realize excellent determination accuracy based on more detailed harmful determination criteria.

〔９．容量に応じた判定手順の例〕
上記のように、データ容量に応じて有害判定手段や基準を使い分ける処理手順の一例を、図６のフローチャートに例示する。このフローチャートは、図３（２）のステップＳ６２に対応するものである。[9. Example of determination procedure according to capacity]
As described above, an example of the processing procedure for properly using the harmful determination means and the reference according to the data capacity is illustrated in the flowchart of FIG. This flowchart corresponds to step S62 in FIG.

すなわち、容量判定手段４５が、有害判定の対象とするウェブページのＨＴＭＬファイル（他の規格のマークアップ言語によるウェブページ記述ファイルでもよい）について容量を取得し（ステップＳ７１）、携帯端末向けなど１２ＫＢ（キロバイト）以下の低容量ページについては（ステップＳ７２）、低容量ページ用の有害判定辞書Ｄ１を用い、ブラック１語又はグレー２語のような低容量ページ用判定基準Ｊ１を適用することにより、有害判定を行う（ステップＳ８１）。 That is, the capacity determination means 45 acquires the capacity of the HTML file of the web page that is the object of the harmful determination (may be a web page description file in a markup language of another standard) (step S71), and 12KB for mobile terminals etc. For low capacity pages below (kilobytes) (step S72), using the low capacity page harmful determination dictionary D1, and applying the low capacity page determination criteria J1 such as black 1 word or gray 2 words, A harmful determination is performed (step S81).

また、１２ＫＢ超〜３０ＫＢ未満の一般ページ用については（ステップＳ７２，Ｓ７３）、一般ページ用の有害判定辞書Ｄ２を用い、グレー４語のような一般ページ用判定基準Ｊ２を適用することにより、有害判定を行う（ステップＳ８２）。さらに、３０ＫＢ以上の容量の大きなページについては（ステップＳ７３）、少ない語数を基準とすれば過剰規制になるため、単語数ではなく、ＳＶＭ分類器に基づく機械学習判定器を用いるなど、類似ページとの関係においても判定する（ステップＳ８３）。 For general pages of more than 12 KB to less than 30 KB (steps S72, S73), using the general page harmful determination dictionary D2 and applying the general page determination criteria J2 such as the four gray words are harmful. A determination is made (step S82). Furthermore, for pages with a large capacity of 30 KB or more (step S73), since it will be over-regulated if a small number of words is used as a reference, it uses a machine learning determiner based on the SVM classifier instead of the number of words. (Step S83).

〔１０．ＡＰＩによる利用〕
また、本装置は、ＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍＩｎｔｅｒｆａｃｅ）により多様なサービスからの呼出し利用が可能である。すなわち、ＡＰＩ受渡し手段７０（ＡＰＩインターフェース部）が、各々サービスを実現し呼び出し元となるプロセスもしくはシステム（図１に示す電子掲示板などの機能部又はサーバなど）から、入力されたメッセージをパラメータに含むＡＰＩの呼出しを受け付けるとともに、呼出しに対するＡＰＩの返り値として呼び出し元に対し、有害判定の結果を返信する（ＡＰＩ受渡し処理）。[10. Use by API]
In addition, this apparatus can be called from various services by an API (Application Program Interface). That is, the API delivery means 70 (API interface unit) implements a service and includes a message input from a process or system (function unit such as an electronic bulletin board shown in FIG. 1 or a server) that is a caller as a parameter. The API call is accepted, and the result of the harmful determination is returned to the caller as an API return value for the call (API delivery process).

この場合、例えば、ＡＰＩの返り値を、
（１）有害サイトＵＲＬとの一致
（２）有害判定：アダルト表現、出会い系
（３）有害判定：誹謗・中傷・差別
（４）有害判定：自殺や暴力、薬物乱用の肯定と助長
（５）有害判定：射幸心をあおるビジネス情報
のような有害の類型にそれぞれ対応する複数のパラメータの組合せとし、各パラメータごとに、その項目に関する有害判定辞書収録語との一致数に応じたポイントなどパラメータ値の組合せを返すようにしてもよい。この場合、個々のパラメータごとに１００ポイントを満点として、あるウェブページはアダルト表現という観点のパラメータは５０ポイント、ビジネス情報の観点のパラメータは３４ポイント、のようなパターンが考えられる。In this case, for example, the return value of the API is
(1) Match with harmful site URL (2) Harmful judgment: Adult expression, dating (3) Harmful judgment: Defamation, slander, discrimination (4) Harmful judgment: Affirmation and promotion of suicide, violence, drug abuse (5) Harmful judgment: A combination of multiple parameters that correspond to harmful types such as business information that irritates euphoria. For each parameter, parameter values such as points according to the number of matches with the words in the harmful judgment dictionary for that item May be returned. In this case, a pattern such as 100 points for each individual parameter, 50 points for an adult expression parameter, and 34 points for a business information parameter can be considered for a certain web page.

このような実装形態の場合、各サービスのプロセスやシステムが、各々の機能やポリシーに応じて、上記の返り値に応じ、投稿の拒否や、管理者への削除勧告などを行う。 In the case of such an implementation, each service process or system makes a rejection of posting or recommends deletion to an administrator in accordance with the return value in accordance with each function or policy.

このように、多様なサービスからの呼出し利用をＡＰＩ経由で可能とする構成（ＡＰＩ受渡し手段７０）は、必須ではなく省略も可能ではあるが、その採用により、より多数の判定例が基礎となるため有害ＵＲＬの情報蓄積が充実して判定精度が一層改善でき、また、そのように蓄積した情報に基づく高精度な判定を、より多くのサービスから幅広く活用可能となる。 As described above, the configuration (API delivery means 70) that enables calling from various services via the API is not essential and can be omitted. However, by adopting the configuration, more judgment examples are based. Therefore, the accumulation of harmful URL information can be enhanced and the determination accuracy can be further improved, and high-accuracy determination based on the stored information can be widely used from more services.

〔１１．他の実施形態〕
なお、上記実施形態は例示に過ぎず、本発明は、上記実施形態に限定されるものではないので、次に例示するような例やさらに他の例も含むものである。例えば、メッセージ判定装置は、上記各手段などの各機能を担当する複数のコンピュータやサーバの組合せ・連携により実現してもよい。[11. Other embodiments]
In addition, since the said embodiment is only an illustration and this invention is not limited to the said embodiment, the example which is illustrated next and another example are also included. For example, the message determination apparatus may be realized by a combination / cooperation of a plurality of computers and servers in charge of each function such as the above-described means.

また、有害判定をどのようなタイミングで行うかは実装上自由であり、例えば、メッセージの入力時点では有害判定をせず、ブログの機能部又はサーバ１にホストされる各ユーザのブログのうち、事前に登録したものもしくは所定の画像などの要素（ブログパーツ）を設置しているものの新規投稿を、所定の時間周期や時刻などで巡回し有害なスパムなどを検出することができる。 In addition, the timing at which the harmful determination is performed is free for implementation. For example, the harmful determination is not performed at the time of inputting the message, and the blog of each user hosted in the functional unit of the blog or the server 1 It is possible to detect harmful spam or the like by circulating a new post of a registered thing or an element (blog part) such as a predetermined image in a predetermined time period or time.

また、有害判定の対象は、メッセージ本文内のＵＲＬに限らず、名前やプロフィールに関する表示のリンク先ＵＲＬなど、メッセージに伴って閲覧者に露出される入力内容に含まれるＵＲＬも含まれる。 Further, the target of the harmful determination is not limited to the URL in the message body, but also includes URLs included in the input contents exposed to the viewer along with the message, such as the link destination URL of the name and the display related to the profile.

また、上記実施形態では、有害ＵＲＬリスト１０と一致しなかった未知ＵＲＬについては、有害である旨の回答をその場では返さず、辞書などを用いた有害判定は事後的にまとめて行う例を示したが（図３）、そのような有害判定まで投稿時点で行う例も考えられる。 In the above-described embodiment, an unknown URL that does not match the harmful URL list 10 is not returned on the spot as a harmful response, and harmful determination using a dictionary or the like is performed after the fact. Although shown (FIG. 3), the example performed at the time of contribution to such a harmful determination is also considered.

そのような例における処理手順を図７に示す。この例では、有害ＵＲＬリストとの照合で一致が無くとも（ステップＳ５３）、その場で続けて、図３（２）に準じて末尾を除去しながら（ステップＳ６５）有害判定辞書に基づく有害判定を行い（ステップＳ６２）、有害と判定できれば（ステップＳ６３）、有害ＵＲＬリストへの追加登録（ステップＳ６７，Ｓ６９）だけでなく、有害である旨の回答までを行う（ステップＳ５４）。 The processing procedure in such an example is shown in FIG. In this example, even if there is no match in the comparison with the harmful URL list (step S53), the harmful judgment based on the harmful judgment dictionary is continued on the spot while removing the tail according to FIG. 3 (2) (step S65). (Step S62), if it can be determined that it is harmful (step S63), not only additional registration to the harmful URL list (steps S67 and S69) but also a reply indicating that it is harmful (step S54).

また、有害判定結果のサービスへの反映の仕方として、投稿内容はブログに即時反映させる一方、有害判定の結果、有害と判定したメッセージについては、投稿されたブログの管理者へ電子メールなどで通知し削除勧告を行う運用を想定できるが、これにはとどまらず、投稿を反映する前や投稿後でも有害と判定できれば、管理者の操作を待たずメッセージの投稿を拒絶したりメッセージを削除するなどの例も可能である。 In addition, as a method of reflecting the harmful judgment result to the service, the posted content is immediately reflected in the blog, while the harmful judgment result and the message judged harmful are notified to the posted blog administrator by e-mail etc. However, this is not limited to this, but if it can be determined that it is harmful before or after posting the post, the message posting can be rejected or the message deleted without waiting for the administrator's operation. Examples are also possible.

例えば、図８の概念図に示す例では、ブログの機能部又はサーバ１は、メッセージの投稿があると（ステップＳ１１）、図１に示したと同様なメッセージ判定装置２にチェック依頼を送信し（ステップＳ１３）、有害との回答を受領すると（ステップＳ１４，Ｓ１５）、投稿を削除し又はブログ掲載を回避する（ステップＳ１６）。 For example, in the example shown in the conceptual diagram of FIG. 8, when a blog functional unit or server 1 posts a message (step S11), it sends a check request to the message determination device 2 similar to that shown in FIG. When an answer of harmful is received (step S13), the post is deleted or blog posting is avoided (step S16).

本発明の実施形態の構成を示す図。The figure which shows the structure of embodiment of this invention. 本発明の実施形態におけるＵＲＬ登録の例を示す図。The figure which shows the example of URL registration in embodiment of this invention. 本発明の実施形態における全体の処理手順を示すフローチャート。The flowchart which shows the whole process sequence in embodiment of this invention. 本発明の実施形態における全体的処理手順の他の例を示すフローチャート。The flowchart which shows the other example of the whole processing procedure in embodiment of this invention. 本発明の実施形態における有害判定辞書の一例を示す図。The figure which shows an example of the harmful determination dictionary in embodiment of this invention. 本発明の実施形態において、データ容量に応じ有害判定手段を使い分ける処理手順を示すフローチャート。The flowchart which shows the process sequence which uses a harmful | toxic determination means properly according to data capacity in embodiment of this invention. 本発明の他の実施形態を示す図。The figure which shows other embodiment of this invention. 本発明の他の実施形態を示す図。The figure which shows other embodiment of this invention.

Explanation of symbols

Ａ（Ａ１，Ａ２）投稿者
Ｂブログの管理者
１ブログの機能部又はサーバ
２メッセージ判定装置
１０有害ＵＲＬリスト
１５ＵＲＬ判定手段
２０ページ読込手段
３０（３１，３２，３３）有害判定手段
Ｄ（Ｄ１，Ｄ２）有害判定手段
４５容量判定手段
４７選択手段
５０末尾除去手段
６０末尾判定登録手段
６５有害通知手段
７０ＡＰＩ受渡し手段
８０制御手段A (A1, A2) Contributor B Blog administrator 1 Blog functional part or server 2 Message judging device 10 Harmful URL list 15 URL judging means 20 Page reading means 30 (31, 32, 33) Harmful judging means D (D1 D2) Hazard determination means 45 Capacity determination means 47 Selection means 50 End removal means 60 End determination registration means 65 Hazard notification means 70 API delivery means 80 Control means

Claims

A plurality of harmful determination means for determining the harmfulness of a web page;
Page reading means for reading a corresponding web page via a communication network based on a URL included in a message input to a message service;
Capacity determination means for determining the data capacity range of the web page description file of the read web page based on one or more reference values;
A selection unit that determines the harmfulness of the web page by selecting one or more of the plurality of harmful determination units according to the determined data capacity range;
Is realized by a computer.

At least some of the plurality of harmful determination means are
Corresponding to the data capacity range, having a harmful judgment dictionary with different harmful words and phrases as recorded words,
The content of the web page description file corresponding to the URL is collated with the harmfulness judgment dictionary to determine whether or not it is harmful based on a match with the recorded words. Message determination device.

The selecting means selects one or two or more of a plurality of harmful judgment criteria including the number of coincident words with words recorded in the harmful judgment dictionary according to the determined data capacity range and performs an AND or OR logical operation. The message determination device according to claim 2, wherein the message determination device is configured to be applied.

The harmful determination means having the harmful determination dictionary includes:
As the harmfulness judgment dictionary, it has a black word dictionary and a gray word dictionary according to the difference in harmfulness,
The number of matching words in the black word dictionary or the number of matching words in the gray word dictionary is used as the harmful determination criterion determined in advance for each data capacity range. Or the message determination apparatus of 3 description.

A page reading process for reading a corresponding web page via a communication network based on a URL included in a message input to the message service;
A capacity determination process for determining a data capacity range of the web page description file of the read web page based on one or more reference values;
Selection that determines the harmfulness of the web page by selecting one or more of a plurality of different harmful determination processes that determine the harmfulness of the web page according to the determined data capacity range Processing,
Is executed by a computer.

At least a part of the plurality of harmful determination processes is
Corresponding to the data capacity range, using a pre-recorded harmful judgment dictionary with different harmful words and phrases as recorded words,
6. The message determination method according to claim 5, wherein the content of the web page description file corresponding to the URL is compared with the harmful determination dictionary to determine whether the content is harmful based on a match with the recorded word. .

The selection process selects one or more of a plurality of harmful determination criteria including the number of matching words with words recorded in the harmful determination dictionary according to the determined data capacity range, and performs an AND or OR logical operation. The message determination method according to claim 6, wherein the message determination method is applied.

The harmful determination process includes
As the harmfulness judgment dictionary, using a black word dictionary and a gray word dictionary according to the difference in harmfulness,
The predetermined harmful judgment criterion is one or both of the number of matching words in the black word dictionary and the number of matching words in the gray word dictionary for each data capacity range. Or the message determination method of 7.

By controlling the computer
A page reading process for reading a corresponding web page via a communication network based on a URL included in a message input to the message service;
A capacity determination process for determining a data capacity range of the web page description file of the read web page based on one or more reference values;
Selection that determines the harmfulness of the web page by selecting one or more of a plurality of different harmful determination processes that determine the harmfulness of the web page according to the determined data capacity range Processing,
A message determination program for executing