JP2000235540A

JP2000235540A - Method and device for automatically filtering information using url hierarchy structure

Info

Publication number: JP2000235540A
Application number: JP11037525A
Authority: JP
Inventors: Keiichiro Hoashi; 啓一郎帆足; Naoki Inoue; 直己井ノ上; Kazuo Hashimoto; 和夫橋本
Original assignee: KDD Corp
Current assignee: KDDI Corp
Priority date: 1999-02-16
Filing date: 1999-02-16
Publication date: 2000-08-29
Anticipated expiration: 2019-02-16
Also published as: JP3220104B2

Abstract

PROBLEM TO BE SOLVED: To improve the rate of a correct answer and a reproduction rate by executing automatic filtering on whether HTML information is appropriate or not based on an extracted word and inhibiting the supply of HTML information when HTML information is judged to be inappropriate as the result of automatic filtering. SOLUTION: An HTML document being HTML information supplied from an input part 1 through internet is inputted. It is judged whether the URL of the HTML document is a high-order page or not. When it is high-order URL as the result, a word extraction part 3 extracts a word in the document that high-order URL indicates and automatic filtering is executed based on the extracted word. It is judged whether information is harmful or not. When information is harmful from the judgment, high-order URL is registered in the harmful high-order page table of a harmful high-order page table storage part 11 and the supply of information is inhibited.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、インターネットを
介して提供される各種情報のうち不適切情報、例えばポ
ルノ画像等のような有害情報を識別し、この識別した不
適切情報の提供を阻止する有害情報自動フィルタリング
方法および装置に関し、更に詳しくは、階層構造に構成
されているＵＲＬに基づき不適切情報を判定し、その不
適切情報の提示を阻止するＵＲＬ階層構造を利用した情
報自動フィルタリング方法および装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention identifies inappropriate information, for example, harmful information such as pornographic images, among various information provided via the Internet, and prevents provision of the identified inappropriate information. More specifically, the present invention relates to a method and apparatus for automatically filtering harmful information, and more particularly, to an automatic information filtering method using a URL hierarchical structure for determining inappropriate information based on a URL structured in a hierarchical structure and preventing presentation of the inappropriate information, and Related to the device.

【０００２】[0002]

【従来の技術】インターネットの急速な広がりに伴い、
限られた専門家の道具でしかなかったコンピュータはご
く一般の家庭や学校などにも導入され始めている。この
ため、これまでコンピュータに触れることすらなかった
多くの一般人でも気軽にインターネットにアクセスする
ことが可能になった。こうした背景の中、近年深刻な問
題となっているのがインターネット上に氾濫するポルノ
画像などの有害情報に対する子供のアクセスである。こ
の問題に対処するため、アメリカでは政府機関がインタ
ーネット上の情報を検閲することを可能にした「通信品
位法」という法律が提案されたが、裁判の結果、表現の
自由を保証する憲法に違反すると判決され、立法するこ
とができなかった。2. Description of the Related Art With the rapid spread of the Internet,
Computers, which were only tools of a limited number of experts, have begun to be introduced to ordinary homes and schools. This has made it possible for many ordinary people who had never even touched a computer to easily access the Internet. Against this background, children's access to harmful information such as pornographic images flooding the Internet has become a serious problem in recent years. To address this issue, the United States proposed a law called the Telecommunications Quality Act, which allowed government agencies to censor information on the Internet, but as a result of a trial, it violated the constitution that guarantees freedom of expression He was ruled and was unable to legislate.

【０００３】そこで最近注目されているのが「情報フィ
ルタリング」という技術である。情報フィルタリングと
は、ユーザがインターネット上の情報にアクセスする際
にその情報の有害性をチェックし、有害と判定された場
合は何らかの手段によりその情報へのアクセスをブロッ
クするという技術である。[0003] Recently, a technique called "information filtering" has attracted attention. Information filtering is a technique in which when a user accesses information on the Internet, the harmfulness of the information is checked, and if the information is determined to be harmful, access to the information is blocked by some means.

【０００４】現在市販されている有害情報フィルタリン
グソフトで取り入れられている手法は大きく以下の３つ
に分類される。[0004] Techniques adopted by currently available harmful information filtering software are roughly classified into the following three.

【０００５】（１）自己判定によるフィルタリング（２）第三者の判定によるフィルタリング（３）自動フィルタリングここではこの３つの手法について簡単に解説する。ま
ず、自己判定によるフィルタリング手法ではＷＷＷ情報
の提供者が自らのコンテンツの有害性について判定を行
い、その結果をＨＴＭＬファイル内に記述する。フィル
タリングソフトはこの記述された結果を参照し、有害と
判断された場合にアクセスをブロックする。この手法に
よるフィルタリングを図６に示す。(1) Filtering based on self-determination (2) Filtering based on determination by a third party (3) Automatic filtering Here, these three methods will be briefly described. First, in the filtering method based on self-determination, a provider of WWW information determines the harmfulness of its own content, and describes the result in an HTML file. The filtering software refers to the described result, and blocks access if it is determined to be harmful. FIG. 6 shows filtering by this method.

【０００６】図６に示す自己判定に基づくフィルタリン
グでは、米国マサチューセッツ工科大学のWorld Wide W
eb Consortium が作成したＰＩＣＳ（Platform for Int
ernet Content Selection ）と呼ばれるインターネット
コンテンツの評価を記述するための基準を使用してい
る。ＰＩＣＳを使用することにより、コンテンツ提供者
は簡単に自分の提供している情報を描写し、開示するこ
とができる。[0006] In the filtering based on the self-determination shown in FIG.
PICS (Platform for Int) created by eb Consortium
It uses a criterion for describing Internet content ratings called ernet Content Selection. By using PICS, a content provider can easily depict and disclose information provided by the content provider.

【０００７】多くの場合、コンテンツ提供者がこのよう
な評価結果を公開する際には、ＰＩＣＳによる評価結果
を出力する評価機関のサービスを利用する。このような
評価機関の代表として、Recreational Software Adviso
ry Council（ＲＳＡＣ）やSafeSurfといった団体があげ
られ、それぞれ独自に設定した基準による評価結果を提
供している。コンテンツ提供者はこれらの機関からの評
価結果をＨＴＭＬファイルのヘッダに記述する。図７に
この評価結果の記述例を示す。In many cases, when a content provider publishes such an evaluation result, the service of an evaluation organization that outputs the evaluation result by PICS is used. On behalf of such evaluation agencies, Recreational Software Adviso
There are organizations such as the ry Council (RSAC) and SafeSurf, which provide evaluation results based on their own set of standards. The content provider describes the evaluation results from these institutions in the header of the HTML file. FIG. 7 shows a description example of this evaluation result.

【０００８】この自己判定はコンテンツ提供者の自主性
に任せられるというのが現状である。そのため、多くの
コンテンツ提供者がこの判定を受けようという意志を持
たない限りは本手法による有効な有害情報フィルタリン
グは不可能であるといえる。[0008] At present, the self-determination is left to the independence of the content provider. Therefore, it can be said that effective harmful information filtering by this method is impossible unless many content providers have a will to receive this determination.

【０００９】次に、第三者による判定に基づくフィルタ
リングについて説明する。有害情報フィルタリングソフ
トを作成している業者の中には、ＷＷＷ上のホームペー
ジの有害性を独自に判定し、その結果をフィルタリング
ソフトの判断基準とする手法を取り入れている。一般的
には、この評価の結果として有害なホームページのＵＲ
Ｌ一覧が構築される。このＵＲＬのリストはフィルタリ
ングソフトとともにユーザに分配され、フィルタリング
ソフトの判断基準となる。多くの場合、、フィルタリン
グソフトはこの有害ＵＲＬ一覧を定期的にダウンロード
する仕組みになっている。第三者による判定に基づく有
害情報フィルタリングの仕組みを図８に示す。Next, the filtering based on the judgment by the third party will be described. Among companies that create harmful information filtering software, a method of independently determining the harmfulness of a homepage on the WWW and using the result as a criterion for the filtering software is adopted. Generally, as a result of this evaluation, the UR of the harmful website
An L list is constructed. This list of URLs is distributed to the user together with the filtering software, and serves as a criterion for the filtering software. In many cases, filtering software is designed to periodically download the harmful URL list. FIG. 8 shows a mechanism of harmful information filtering based on a judgment by a third party.

【００１０】このような仕組みを持つソフトウェアの代
表的なものとしてCyberPatrol があげられる。CyberPat
rol は「暴力」「性行為」など１３個のジャンルに対
し、それぞれ有害ＵＲＬ一覧を持っており、これらのシ
ステムに従って有害情報フィルタリングを行う。A typical example of software having such a mechanism is CyberPatrol. CyberPat
rol has a list of harmful URLs for 13 genres such as "violence" and "sex", and performs harmful information filtering according to these systems.

【００１１】この手法で使用される有害ＵＲＬ一覧はそ
れぞれのソフトウェア業者でホームページをアクセス
し、判定を行うことによって作成・拡張されているた
め、新しく設立されたホームページや従来のＵＲＬから
別のＵＲＬに移動したホームページには対処することは
不可能である。従って、こうした評価対象外のページに
対するフィルタリングには対処できないのが現状であ
る。Since the list of harmful URLs used in this method is created and expanded by accessing the homepage of each software company and making a judgment, the list of harmful URLs is changed from a newly established homepage or a conventional URL to another URL. It is impossible to deal with the moved homepage. Therefore, at present, it is not possible to cope with filtering for pages that are not evaluated.

【００１２】次に、自動フィルタリングについて説明す
る。有害情報フィルタリングソフトの中にはアクセスさ
れたホームページの中身をチェックし、有害性の判断を
行うものもある。このような発想は初期のフィルタリン
グソフトで導入されていた。その例として、例えば”ｓ
ｅｘ”や“ｘｘｘ”といった文字列がＵＲＬに含まれて
いた場合、そのＵＲＬへのアクセスを禁止するなどとい
う処理を行うソフトが存在した。現在はページの中身に
ついて検証を行うソフトも開発されている。CyberSITTE
R はこうした自動フィルタリングを行うソフトの１つで
ある。このソフトではアクセスされたページに含まれる
有害な単語を取り除いて出力するという手法によってフ
ィルタリングが行われる。Next, automatic filtering will be described. Some harmful information filtering software checks the contents of the accessed homepage to determine harmfulness. Such ideas were introduced in early filtering software. For example, "s
When a character string such as "ex" or "xxx" is included in a URL, there is software that performs processing such as prohibiting access to the URL. Currently, software for verifying the contents of a page has been developed. Yes, CyberSITTE
R is one of the software that performs such automatic filtering. In this software, filtering is performed by a method of removing and outputting harmful words contained in the accessed page.

【００１３】本手法には２つの問題点がある。まず１つ
は、この自動判定を行う際に生じる処理時間である。最
も、この程度の処理では数ミリ秒程度の少ない処理時間
ではあるが、こうした短い時間でもユーザにフラストレ
ーションが生じる可能性は否定できない。The method has two problems. The first is the processing time that occurs when performing this automatic determination. Although the processing time of this level is a short processing time of about several milliseconds, there is no denying that frustration may occur to the user even in such a short time.

【００１４】もう一方の問題は、自動フィルタリングの
精度である。まず、単語単位で有害性を判断するような
判定アルゴリズムが採用されている場合、多くの無害な
ページがブロックされてしまう可能性が高い。現に、イ
ギリスの“Sussex”という町に関するホームページがブ
ロックされるといった悪例も報告されている。更に、ペ
ージ内のテキスト情報のみに着目して自動フィルタリン
グを行う場合、画像のみが表示されているページをブロ
ックすることは不可能であるという問題もあげられる。Another problem is the accuracy of automatic filtering. First, when a judgment algorithm for judging harmfulness in word units is employed, there is a high possibility that many harmless pages will be blocked. In fact, there have been reports of evil cases such as blocking homepages about the town of Sussex in the UK. Further, when automatic filtering is performed by focusing only on text information in a page, there is a problem that it is impossible to block a page on which only an image is displayed.

【００１５】[0015]

【発明が解決しようとする課題】フィルタリングソフト
の大きな目的は有害なページがブロックされる割合を増
やすことと、無害なページが誤ってブロックされる割合
を減らすことである。ブロックされたページのうち、実
際に有害だったページの割合を正解率（precision ）、
実際に有害なページのうちブロックされたページの割合
を再現率（recall）とすると、フィルタリングソフトの
目的は正解率と再現率をともに高めることであるといえ
る。The main objectives of filtering software are to increase the rate at which harmful pages are blocked and to reduce the rate at which harmless pages are erroneously blocked. The percentage of blocked pages that were actually harmful is the precision,
Assuming that the ratio of blocked pages among the actually harmful pages is recall, the purpose of the filtering software is to increase both the correct answer rate and the recall rate.

【００１６】上述した各手法にはそれぞれ一長一短があ
る。各手法の特徴を正解率と再現率という観点からまと
めた。この結果を表１に示す。Each of the above methods has advantages and disadvantages. The features of each method are summarized in terms of correct answer rate and recall rate. Table 1 shows the results.

【００１７】[0017]

【表１】このように、現在市販されているフィルタリングソフト
では十分なフィルタリング性能が得られないのが現状で
ある。[Table 1] Thus, at present, sufficient filtering performance cannot be obtained with currently marketed filtering software.

【００１８】また、上述したように、従来の自動フィル
タリングでは、ページ内のテキスト情報のみに着目して
自動フィルタリングを行うため、テキスト情報が少ない
かまたは全くなく、画像のみが表示されているページを
阻止することが不可能であるという問題がある。Further, as described above, in the conventional automatic filtering, since the automatic filtering is performed by focusing on only the text information in the page, the page in which only the image is displayed with little or no text information is displayed. There is a problem that it is impossible to prevent.

【００１９】本発明は、上記に鑑みてなされたもので、
その目的とするところは、階層構造になっているＵＲＬ
の上位ＵＲＬを用いることにより正解率および再現率共
に向上し得るとともに、画像のみが掲載されている少テ
キストページに対しても内容の不適切さを適確に判定し
得るＵＲＬ階層構造を利用した情報自動フィルタリング
方法および装置を提供することにある。The present invention has been made in view of the above,
The purpose is to use a URL with a hierarchical structure.
Utilizing a URL hierarchical structure that can improve both the correct answer rate and the recall rate by using the upper URL of, and can appropriately judge the inadequacy of the contents even for a small text page in which only an image is posted An object of the present invention is to provide a method and apparatus for automatically filtering information.

【００２０】[0020]

【課題を解決するための手段】上記目的を達成するた
め、請求項１記載の本発明は、インターネットを介して
提供される各種情報のうち不適切情報を識別し、この識
別した不適切情報の提供を阻止する情報自動フィルタリ
ング方法であって、インターネットを介して提供される
ＨＴＭＬ情報を入力し、このＨＴＭＬ情報のＵＲＬが上
位ＵＲＬであるか否かを判定し、このＵＲＬが上位ＵＲ
Ｌである場合、この上位ＵＲＬが示す情報に出現する単
語を抽出し、この抽出された単語に基づいて該情報が不
適切であるか否かの自動フィルタリングを行い、この自
動フィルタリングの結果、前記情報が不適切であると判
定された場合、前記上位ＵＲＬを不適切上位ＵＲＬ一覧
に登録するとともに、前記情報の提供を阻止し、前記Ｈ
ＴＭＬ情報のＵＲＬが上位ＵＲＬでなかった場合、この
ＵＲＬを前記登録された不適切上位ＵＲＬ一覧の各ＵＲ
Ｌと照合して、一致するＵＲＬがあるか否かを判定し、
一致する場合、このＵＲＬが示す情報の提示を阻止し、
前記ＵＲＬが不適切上位ＵＲＬ一覧のＵＲＬと一致する
ものがない場合、該ＵＲＬが示す情報に出現する単語を
抽出し、この抽出された単語に基づいて該情報が不適切
であるか否かの自動フィルタリングを行い、この自動フ
ィルタリングの結果、前記情報が不適切であると判定さ
れた場合、該情報の提供を阻止することを要旨とする。In order to achieve the above object, the present invention according to claim 1 identifies inappropriate information among various types of information provided via the Internet, and identifies the inappropriate information. This is an information automatic filtering method for preventing provision, in which HTML information provided via the Internet is input, and it is determined whether or not the URL of the HTML information is a high-order URL.
L, the words appearing in the information indicated by the upper URL are extracted, and based on the extracted words, automatic filtering is performed on whether or not the information is inappropriate. As a result of the automatic filtering, If it is determined that the information is inappropriate, the upper URL is registered in an inappropriate upper URL list, and the provision of the information is prevented.
If the URL of the TML information is not the upper URL, this URL is added to each URL in the registered inappropriate upper URL list.
L to determine whether there is a matching URL,
If they match, the presentation of the information indicated by this URL is prevented,
If the URL does not match the URL in the list of inappropriate upper URLs, a word appearing in the information indicated by the URL is extracted, and whether the information is inappropriate is determined based on the extracted word. The gist of the present invention is to perform automatic filtering, and to block the provision of the information when it is determined that the information is inappropriate as a result of the automatic filtering.

【００２１】請求項１記載の本発明にあっては、入力さ
れたＨＴＭＬ情報のＵＲＬが上位ＵＲＬである場合、こ
の上位ＵＲＬが示す情報に対して自動フィルタリングを
行い、その結果、該情報が不適切である場合、前記上位
ＵＲＬを不適切上位ＵＲＬ一覧に登録するとともに、前
記情報の提供を阻止し、上位ＵＲＬでなかった場合、こ
のＵＲＬを不適切上位ＵＲＬ一覧の各ＵＲＬと照合し、
一致するＵＲＬがある場合、このＵＲＬが示す情報の提
示を阻止し、一致するものがない場合、該ＵＲＬが示す
情報に対して自動フィルタリングを行い、その結果、前
記情報が不適切である場合、該情報の提供を阻止するた
め、画像のみが提示されている少テキストページでもそ
の不適切さを適確に判定して阻止することができ、正解
率および再現率の両方を向上することができる。According to the first aspect of the present invention, when the URL of the input HTML information is a high-order URL, the information indicated by the high-order URL is automatically filtered, and as a result, the information becomes invalid. If appropriate, register the upper URL in the list of inappropriate upper URLs and prevent the provision of the information. If not, check the URL against each URL in the list of inappropriate upper URLs.
If there is a matching URL, the presentation of the information indicated by this URL is prevented, and if there is no matching URL, automatic filtering is performed on the information indicated by the URL, and as a result, if the information is inappropriate, In order to prevent the provision of the information, it is possible to accurately determine and prevent inappropriateness even in a small text page in which only an image is presented, thereby improving both the correct answer rate and the recall rate. .

【００２２】また、請求項２記載の本発明は、請求項１
記載の発明において、不適切な情報を提供するＵＲＬを
不適切ＵＲＬ一覧として登録しておき、前記入力された
ＨＴＭＬ情報のＵＲＬを前記不適切ＵＲＬ一覧の各ＵＲ
Ｌと照合して、一致するＵＲＬがあるか否かを判定し、
一致する場合、このＵＲＬが示す情報の提示を阻止する
第三者判定に基づくフィルタリングを更に行うことを要
旨とする。Further, the present invention according to claim 2 provides the invention according to claim 1.
In the invention described in the above, a URL providing inappropriate information is registered as an inappropriate URL list, and a URL of the inputted HTML information is registered in each URL of the inappropriate URL list.
L to determine whether there is a matching URL,
If they match, the gist is that filtering based on a third-party determination that prevents the presentation of the information indicated by the URL is further performed.

【００２３】請求項２記載の本発明にあっては、不適切
な情報を提供するＵＲＬを不適切ＵＲＬ一覧として登録
しておき、ＨＴＭＬ情報のＵＲＬを不適切ＵＲＬ一覧の
各ＵＲＬと照合して、一致するＵＲＬがある場合、この
ＵＲＬが示す情報の提示を阻止する第三者判定に基づく
フィルタリングを更に行うため、この第三者判定に基づ
くフィルタリングと上位ＵＲＬを利用した自動フィルタ
リングの両方により更に完全にフィルタリングを行うこ
とができる。According to the second aspect of the present invention, a URL that provides inappropriate information is registered as an inappropriate URL list, and the URL of the HTML information is checked against each URL in the inappropriate URL list. If there is a matching URL, to further perform filtering based on a third-party determination that prevents the presentation of the information indicated by this URL, the filtering based on the third-party determination and the automatic filtering using the upper URL are further performed. Complete filtering can be performed.

【００２４】更に、請求項３記載の本発明は、インター
ネットを介して提供される各種情報のうち不適切情報を
識別し、この識別した不適切情報の提供を阻止する情報
自動フィルタリング装置であって、インターネットを介
して提供されるＨＴＭＬ情報を入力する入力手段と、こ
の入力されたＨＴＭＬ情報のＵＲＬが上位ＵＲＬである
か否かを判定する上位ＵＲＬ判定手段と、該上位ＵＲＬ
判定手段による判定の結果、前記ＵＲＬが上位ＵＲＬで
ある場合、この上位ＵＲＬが示す情報に出現する単語を
抽出し、この抽出された単語に基づいて該情報が不適切
であるか否かの自動フィルタリングを行う第１の自動フ
ィルタリング手段と、この自動フィルタリングの結果、
前記情報が不適切であると判定された場合、該情報の提
示を阻止するとともに、前記上位ＵＲＬを不適切上位Ｕ
ＲＬ一覧テーブルに登録する不適切上位ＵＲＬ一覧登録
手段と、前記上位ＵＲＬ判定手段による判定の結果、前
記ＨＴＭＬ情報のＵＲＬが上位ＵＲＬでなかった場合、
このＵＲＬを前記不適切上位ＵＲＬ一覧テーブルに登録
されている各ＵＲＬと照合して、一致するＵＲＬがある
か否かを判定する不適切ＵＲＬ判定手段と、この判定の
結果、前記ＵＲＬが不適切上位ＵＲＬ一覧テーブルに登
録されているＵＲＬと一致するものがない場合、該ＵＲ
Ｌが示す情報に出現する単語を抽出し、この抽出された
単語に基づいて該情報が不適切であるか否かの自動フィ
ルタリングを行う第２の自動フィルタリング手段と、前
記不適切ＵＲＬ判定手段による判定の結果、前記ＵＲＬ
が不適切上位ＵＲＬ一覧テーブルに登録されているＵＲ
Ｌと一致する場合、このＵＲＬが示す情報の提示を阻止
し、また前記第２の自動フィルタリング手段によるフィ
ルタリングの結果、前記情報が不適切であると判定され
た場合、該情報の提供を阻止する情報提示阻止手段とを
有することを要旨とする。Further, the present invention according to claim 3 is an information automatic filtering device for identifying inappropriate information among various types of information provided via the Internet, and preventing provision of the identified inappropriate information. Input means for inputting HTML information provided via the Internet, high-order URL determining means for determining whether or not the URL of the input HTML information is a high-order URL;
If the result of the determination by the determining means is that the URL is a high-order URL, a word appearing in the information indicated by the high-order URL is extracted, and based on the extracted word, whether or not the information is inappropriate is automatically determined. A first automatic filtering means for performing filtering, and a result of the automatic filtering;
If it is determined that the information is inappropriate, the presentation of the information is prevented, and the upper URL is changed to the inappropriate upper U.
When the URL of the HTML information is not the upper URL as a result of the determination by the inappropriate upper URL list registration unit to be registered in the RL list table and the upper URL determination unit,
An inappropriate URL determining unit that checks the URL against each URL registered in the inappropriate upper URL list table to determine whether there is a matching URL, and as a result of the determination, determines whether the URL is inappropriate. If there is no URL that matches the URL registered in the upper URL list table,
A second automatic filtering means for extracting a word appearing in the information indicated by L and automatically filtering whether or not the information is inappropriate based on the extracted word; As a result of the determination, the URL
UR registered in the inappropriate upper URL list table
L, the presentation of the information indicated by the URL is prevented, and if the information is determined to be inappropriate as a result of the filtering by the second automatic filtering means, the provision of the information is prevented. The gist of the present invention is to have an information presentation preventing means.

【００２５】請求項３記載の本発明にあっては、入力さ
れたＨＴＭＬ情報のＵＲＬが上位ＵＲＬである場合、こ
の上位ＵＲＬが示す情報に対して自動フィルタリングを
行い、その結果、該情報が不適切である場合、前記上位
ＵＲＬを不適切上位ＵＲＬ一覧テーブルに登録するとと
もに、前記情報の提供を阻止し、上位ＵＲＬでなかった
場合、このＵＲＬを不適切上位ＵＲＬ一覧テーブルの各
ＵＲＬと照合し、一致するＵＲＬがある場合、このＵＲ
Ｌが示す情報の提示を阻止し、一致するものがない場
合、該ＵＲＬが示す情報に対して自動フィルタリングを
行い、その結果、前記情報が不適切である場合、該情報
の提供を阻止するため、画像のみが提示されている少テ
キストページでもその不適切さを適確に判定して阻止す
ることができ、正解率および再現率の両方を向上するこ
とができる。According to the third aspect of the present invention, when the URL of the input HTML information is a high-order URL, the information indicated by the high-order URL is automatically filtered, and as a result, the information becomes invalid. If the URL is appropriate, the upper URL is registered in the inappropriate upper URL list table, and the provision of the information is prevented. If the URL is not the upper URL, the URL is checked against each URL in the inappropriate upper URL list table. , If there is a matching URL, this URL
L to prevent the presentation of the information indicated by L and, if there is no match, perform automatic filtering on the information indicated by the URL. As a result, if the information is inappropriate, the provision of the information is prevented. The inadequacy of a small text page in which only an image is presented can be accurately determined and prevented, and both the correct answer rate and the recall rate can be improved.

【００２６】請求項４記載の本発明は、請求項３記載の
発明において、不適切な情報を提供するＵＲＬを不適切
ＵＲＬ一覧テーブルに登録する不適切ＵＲＬ一覧登録手
段と、前記入力手段から入力される前記ＨＴＭＬ情報の
ＵＲＬを前記不適切ＵＲＬ一覧テーブルに登録されてい
る各ＵＲＬと照合し、一致するＵＲＬがあるか否かを判
定する一致ＵＲＬ判定手段と、この判定の結果、一致す
るＵＲＬがある場合、このＵＲＬが示す情報の提示を阻
止する第三者判定に基づくフィルタリング手段とを更に
有することを要旨とする。According to a fourth aspect of the present invention, in the invention of the third aspect, an inappropriate URL list registering means for registering a URL providing inappropriate information in an inappropriate URL list table, and an input from the input means. Matching URL determining means for checking the URL of the HTML information to be performed with each URL registered in the inappropriate URL list table to determine whether there is a matching URL, and as a result of this determination, the matching URL In this case, the gist of the present invention is to further include filtering means based on a third-party determination for preventing presentation of the information indicated by the URL.

【００２７】請求項４記載の本発明にあっては、不適切
な情報を提供するＵＲＬを不適切ＵＲＬ一覧テーブルと
して登録しておき、ＨＴＭＬ情報のＵＲＬを不適切ＵＲ
Ｌ一覧テーブルの各ＵＲＬと照合して、一致するＵＲＬ
がある場合、このＵＲＬが示す情報の提示を阻止する第
三者判定に基づくフィルタリングを更に行うため、この
第三者判定に基づくフィルタリングと上位ＵＲＬを利用
した自動フィルタリングの両方により更に完全にフィル
タリングを行うことができる。According to the present invention, a URL providing inappropriate information is registered as an inappropriate URL list table, and the URL of the HTML information is registered as an inappropriate URL.
Check each URL in the L list table for a matching URL
In order to further perform filtering based on a third-party determination that prevents the presentation of the information indicated by this URL, there is a more complete filtering by both the filtering based on the third-party determination and the automatic filtering using the upper URL. It can be carried out.

【００２８】[0028]

【発明の実施の形態】以下、図面を用いて本発明の実施
の形態について説明する。図１は、本発明の一実施形態
に係るＵＲＬ階層構造を利用した情報自動フィルタリン
グ装置の構成を示すブロック図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration of an automatic information filtering apparatus using a URL hierarchical structure according to an embodiment of the present invention.

【００２９】図１に示す情報自動フィルタリング装置
は、インターネットを介して提供される各種情報のうち
不適切情報、例えばポルノ等の有害情報を識別し、この
識別した不適切情報の提供を阻止するものであり、イン
ターネットを介して提供されるＨＴＭＬ情報を入力する
入力部１、この入力された情報に出現する単語を抽出す
る単語抽出部３、この抽出した単語や本実施形態の情報
自動フィルタリング処理を実行するソフトウェアやその
他の各種情報を記憶する記憶部５、単語の重みデータを
格納する単語重みデータ格納部７、自動フィルタリング
を行う自動フィルタリング部９、有害上位ページ一覧を
テーブルとして格納している有害上位ページ一覧テーブ
ル格納部１１、およびフィルタリングした結果を出力す
る出力部１３から構成されている。The information automatic filtering device shown in FIG. 1 identifies inappropriate information, for example, harmful information such as pornography, out of various information provided via the Internet, and prevents provision of the identified inappropriate information. And an input unit 1 for inputting HTML information provided via the Internet, a word extraction unit 3 for extracting words appearing in the input information, and an automatic filtering process for the extracted words and the information according to the present embodiment. A storage unit 5 that stores software to be executed and other various information, a word weight data storage unit 7 that stores word weight data, an automatic filtering unit 9 that performs automatic filtering, and a harmful table that stores a list of harmful top pages as a table. An upper page list table storage unit 11 and an output unit 13 that outputs a filtering result are configured. It is.

【００３０】本実施形態の情報自動フィルタリング装置
は、ＵＲＬ階層構造を有するＵＲＬのうち上位ＵＲＬを
利用して、有害情報のフィルタリングを行うものである
が、まずその概念について説明する。The automatic information filtering apparatus of this embodiment performs filtering of harmful information by using a higher URL among URLs having a URL hierarchical structure. First, the concept will be described.

【００３１】上述したように、自動フィルタリングの大
きな問題の１つとしてテキスト情報が少ないあるいは全
くないホームページに対するフィルタリングが困難であ
ることがあげられる。特にポルノ系の有害情報ページに
は画像のみが掲載されているものが多数含まれていると
考えられるため、これらの少テキストページに対する対
処法を検討する必要がある。しかし、典型的なＷＷＷユ
ーザならば、画像のみのページにアクセスするためにリ
ンクをたどるものと考えられる。この仮定が真実なら
ば、画像ページに至るまでの上位階層のページに対して
フィルタリングを行えば画像ページへのアクセスもブロ
ックすることが可能になる。この上位階層のページに対
してフィルタリングを行う手法について説明する。As described above, one of the major problems of the automatic filtering is that it is difficult to filter a home page having little or no text information. In particular, pornographic harmful information pages are considered to contain many images with only images, so it is necessary to consider how to deal with these small text pages. However, a typical WWW user would follow a link to access an image-only page. If this assumption is true, access to the image page can be blocked by filtering the pages in the upper hierarchy up to the image page. A method for performing filtering on the upper-layer page will be described.

【００３２】まず、上位階層にあるページをそのＵＲＬ
が以下に列挙した７つの文字列で終わるページとする：（１）index.html （２）index.htm （３）index.shtml （４）welcome.html （５）welcome.htm （６）welcome.shtml （７）/ 例えば、http://www.kdd.co.jp/index.html やhttp://w
ww.asahi.com/ などは上位ページとみなされる。これら
上位ページのうち、フィルタリングソフトによって有害
と判断されたものを有害上位ページ一覧に保存する。こ
の際、ＵＲＬのすべてを保存するのではなく、ＵＲＬの
うち最も深いディレクトリまでのＵＲＬを保存すること
にする。例えば、http://www.###.co.jp/index.html が
有害の場合は、http://www.###.co.jp/ を一覧に保存
し、http://www.###.co.jp/aaa/bbb/ccc/index.html が
有害の場合はhttp://www.###.co.jp/aaa/bbb/ccc/ を一
覧に保存する。First, the page in the upper hierarchy is stored in the URL
Is a page ending with the seven strings listed below: (1) index.html (2) index.htm (3) index.shtml (4) welcome.html (5) welcome.htm (6) welcome. shtml (7) / For example, http://www.kdd.co.jp/index.html or http: // w
ww.asahi.com/ etc. are regarded as top pages. Of these top pages, those judged to be harmful by the filtering software are stored in a harmful top page list. At this time, instead of storing all of the URLs, the URLs up to the deepest directory among the URLs are stored. For example, if http: //www.###.co.jp/index.html is harmful, save http: //www.###.co.jp/ in the list and http: // www If. ###. co.jp/aaa/bbb/ccc/index.html is harmful, save http: //www.###.co.jp/aaa/bbb/ccc/ in the list.

【００３３】上位ページ以外のページへのアクセスの
際、通常の自動有害性判断の前に、この有害上位ページ
一覧に記されているＵＲＬとアクセスされているページ
のＵＲＬを比較する。比較の結果、アクセスされたペー
ジのＵＲＬのディレクトリが有害上位ページ一覧中のど
れかのＵＲＬと一致した場合、そのページを有害である
とみなす。例えばhttp://www.###.co.jp/ が有害上位ペ
ージ一覧に含まれていた場合、http://www.###.co.jp/a
aa/bbb.html もhttp://www.###.co.jp/nantoka.html も
有害であるとみなす。一方、有害上位ページ一覧中のデ
ータと一致しない場合は、自動フィルタリングソフトに
より有害性の判断を行う。When accessing a page other than the upper page, before the normal automatic harmfulness judgment, the URL described in the harmful upper page list is compared with the URL of the accessed page. As a result of the comparison, if the directory of the URL of the accessed page matches one of the URLs in the harmful upper page list, the page is regarded as harmful. For example, if http: //www.###.co.jp/ is included in the list of top harmful pages, http: //www.###.co.jp/a
Both aa / bbb.html and http: //www.###.co.jp/nantoka.html are considered harmful. On the other hand, if the data does not match the data in the harmful top page list, harmfulness is determined by automatic filtering software.

【００３４】上述した考え方に基づいて本実施形態の情
報自動フィルタリング装置は有害情報を阻止するように
構成されている。次に、図２に示すフローチャートを参
照して、図１に示すＵＲＬ階層構造を利用した情報自動
フィルタリング装置の作用について説明する。Based on the above-described concept, the automatic information filtering apparatus according to the present embodiment is configured to block harmful information. Next, the operation of the automatic information filtering apparatus using the URL hierarchical structure shown in FIG. 1 will be described with reference to the flowchart shown in FIG.

【００３５】図２において、まず入力部１からインター
ネットを介して提供されるＨＴＭＬ情報であるＨＴＭＬ
文書が入力されると（ステップＳ１１）、この入力され
たＨＴＭＬ文書のＵＲＬが上位ＵＲＬ、すなわち上位ペ
ージであるか否かが判定される（ステップＳ１３）。こ
の判定の結果、前記ＨＴＭＬ文書のＵＲＬが上位ＵＲＬ
である場合には、この上位ＵＲＬが示す文書、すなわち
情報に出現する単語を単語抽出部３で抽出し、この抽出
した単語に基づいて自動フィルタリング部９による自動
フィルタリングを行い（ステップＳ１５）、前記情報が
有害であるか否かについての判定を行う（ステップＳ１
７）。In FIG. 2, first, HTML, which is HTML information provided from the input unit 1 via the Internet,
When a document is input (step S11), it is determined whether or not the URL of the input HTML document is an upper URL, that is, an upper page (step S13). As a result of this determination, the URL of the HTML document is
In the case of, the document indicated by the upper URL, that is, the word appearing in the information is extracted by the word extraction unit 3, and the automatic filtering unit 9 performs automatic filtering based on the extracted word (step S15). A determination is made as to whether the information is harmful (step S1).
7).

【００３６】この自動フィルタリングの情報の有害性に
ついて判定の結果、前記情報が有害である場合には、前
記上位ＵＲＬを有害上位ページ一覧テーブル格納部１１
の有害上位ページ一覧テーブルに登録するとともに（ス
テップＳ２１）、この情報の提供を阻止（ブロック）し
て処理を終了する（ステップＳ３１）。As a result of the judgment on the harmfulness of the information of the automatic filtering, when the information is harmful, the upper URL is stored in the harmful upper page list table storage unit 11.
Is registered in the harmful upper page list table (step S21), the provision of this information is blocked (blocked), and the process is terminated (step S31).

【００３７】一方、ステップＳ１７における判定の結
果、前記情報が有害でない場合には、出力部１３により
ブラウザに表示して処理を終了する（ステップＳ１
９）。On the other hand, if the result of determination in step S17 is that the information is not harmful, it is displayed on the browser by the output unit 13 and processing is terminated (step S1).
9).

【００３８】また、ステップＳ１３における上位ページ
か否かの判定の結果、上位ページでない場合には、この
ＵＲＬを有害上位ページ一覧テーブル格納部１１に有害
上位ページ一覧テーブルとして登録されている各ＵＲＬ
と照合し（ステップＳ２３）、一致するＵＲＬがあるか
否かをチェックする（ステップＳ２５）。このチェック
の結果、有害上位ページ一覧テーブルに一致するＵＲＬ
がある場合には、このＵＲＬが示す情報の提供を阻止し
て処理を終了する（ステップＳ３１）。If it is determined in step S13 that the page is not the upper page, if the URL is not the upper page, this URL is registered in the harmful upper page list table storage unit 11 as each harmful upper page list table.
Is checked (step S23), and it is checked whether there is a matching URL (step S25). As a result of this check, a URL that matches the harmful top page list table
If there is, the provision of the information indicated by the URL is prevented, and the process ends (step S31).

【００３９】ステップＳ２５における一致するＵＲＬが
あるか否かのチェックの結果、一致するＵＲＬがない場
合には、このＵＲＬが示す情報に出現する単語を単語抽
出部３で抽出し、この抽出した単語に基づいて自動フィ
ルタリング部９による自動フィルタリングを行い（ステ
ップＳ２７）、前記情報が有害であるか否かについての
判定を行う（ステップＳ２９）。As a result of checking whether or not there is a matching URL in step S25, if there is no matching URL, a word appearing in the information indicated by the URL is extracted by the word extracting unit 3, and the extracted word is extracted. The automatic filtering is performed by the automatic filtering unit 9 based on the information (step S27), and it is determined whether the information is harmful (step S29).

【００４０】この自動フィルタリングの情報の有害性に
ついて判定の結果、前記情報が有害である場合には、該
情報の提供を阻止して処理を終了するが（ステップＳ３
１）、ステップＳ２９における判定の結果、前記情報が
有害でない場合には、出力部１３によりブラウザに表示
して処理を終了する（ステップＳ１９）。As a result of the judgment on the harmfulness of the information of the automatic filtering, when the information is harmful, the provision of the information is stopped and the process is terminated (step S3).
1) If the result of determination in step S29 is that the information is not harmful, the information is displayed on the browser by the output unit 13 and the process is terminated (step S19).

【００４１】上述した本実施形態のＵＲＬ階層構造を利
用した情報自動フィルタリング装置の効果を判定するた
め、次に示すような評価実験を行った。In order to determine the effect of the automatic information filtering apparatus using the URL hierarchical structure of the above-described embodiment, the following evaluation experiment was performed.

【００４２】この評価実験における評価用のデータとし
て、ＷＷＷ上のデータを自動的に収集するソフト（「収
集ロボット」）を使用して大量の有害ページを収集し
た。この収集ロボットは２９０個の有害ページ（うち、
１６０個が日本語、１３０個が英語で記述されたペー
ジ）へのリンクが張られているＨＴＭＬページから起動
され、順々にリンクをたどりその途中でアクセスされた
ＨＴＭＬ文書を収集した。なお、この際収集されたデー
タはＨＴＭＬ文書のみであり、画像データ、音声データ
等は収集していない。この結果、２８０３４個のＨＴＭ
Ｌ文書が収集された。As data for evaluation in this evaluation experiment, a large number of harmful pages were collected using software (“collection robot”) for automatically collecting data on the WWW. This collection robot has 290 harmful pages (of which,
HTML pages with links to (160 pages written in Japanese and 130 pages written in English) were launched, and the HTML documents that were accessed along the way were collected by following the links one by one. The data collected at this time is only an HTML document, and image data, audio data, and the like are not collected. As a result, 28034 HTMs
L documents were collected.

【００４３】次に、収集された個々のＨＴＭＬ文書に対
し、主観評価によってその有害性を３段階で評価した。
なお、この評価は各ページの性的表現の有無についての
ものである。各評価段階の基準を表２に示す。Next, the harmfulness of each collected HTML document was evaluated on a three-point scale by subjective evaluation.
In addition, this evaluation is about the presence or absence of the sexual expression of each page. Table 2 shows the criteria for each evaluation stage.

【００４４】[0044]

【表２】収集されたデータに対する有害性評価の結果は表３に示
す通りである。[Table 2] The results of the hazard assessment for the collected data are as shown in Table 3.

【００４５】[0045]

【表３】この有害性評価とともに、各ページに記述されている言
語についての調査も行った。その結果を表４に示す。[Table 3] Along with this hazard assessment, a survey was also conducted of the languages described on each page. Table 4 shows the results.

【００４６】[0046]

【表４】また、このデータのうち、上述した「有害上位ページ」
に該当するデータの全有害データ中の割合についても調
査した。なお、ここでは有害データを前記有害性評価の
結果がレベル２または３だったものとする。その結果を
表５に示す。[Table 4] In addition, out of this data,
The ratio of the data corresponding to the above to the total harmful data was also investigated. Here, it is assumed that the result of the harm evaluation is harmful data of level 2 or 3. Table 5 shows the results.

【００４７】[0047]

【表５】次に、図１に示した実施形態の情報自動フィルタリング
装置に使用されている自動フィルタリング部９による自
動フィルタリングのアルゴリズム、特に前記評価実験に
使用した自動フィルタリングのアルゴリズムについて説
明する。なお、この自動フィルタリングは、情報検索や
自動分類等に使用されているベクトル空間モデルを使用
している。[Table 5] Next, an algorithm of automatic filtering by the automatic filtering unit 9 used in the information automatic filtering device of the embodiment shown in FIG. 1, particularly an algorithm of automatic filtering used in the evaluation experiment will be described. This automatic filtering uses a vector space model used for information retrieval, automatic classification, and the like.

【００４８】まず、入力部１から入力されたＨＴＭＬ文
書をベクトル空間モデルによって表現する。すなわち、
すべての文書を表現するｎ個の単語を選択し、それぞれ
の文書をｎ次元のベクトルで次式のように表現する。First, the HTML document input from the input unit 1 is represented by a vector space model. That is,
The n words representing all documents are selected, and each document is represented by an n-dimensional vector as in the following expression.

【００４９】[0049]

【数１】このベクトルの各要素は、各々単語の文書ｄでの出現頻
度を正規化したものである。単語の出現頻度の正規化に
は次に示す数式で表されるＴＦ^＊ＩＤＦという手法を用
いている。(Equation 1) Each element of this vector is obtained by normalizing the frequency of occurrence of each word in the document d. To normalize the appearance frequency of words, a technique called TF ^* IDF represented by the following equation is used.

【００５０】[0050]

【数２】ここで、ｔｆ_diは単語ｉが文書ｄに出現する頻度、Ｎは
すべての文書の数、ｄｆ_iは単語ｉが出現する文書の数
である。(Equation 2) Here, tf _di is the frequency of word i appears in document d, N is the number of all the documents, df _i is the number of documents that word i appears.

【００５１】自動フィルタリングは、次に示す数式で表
される線形識別関数によって行われ、この関数によって
単語重みの総和Ｄｉｓ（ｄ）が計算される。The automatic filtering is performed by a linear discriminant function represented by the following equation, and a total sum Dis (d) of word weights is calculated by this function.

【００５２】[0052]

【数３】ここで、ｗ_iは各単語ｉに対する重みであり、ｆ_diは上
式（３）の値であり、文書における各単語のｆ_di値であ
る。(Equation 3) Here, w _i is the weight for each word i, f _di is the value of the above equation (3), and is the f _di value of each word in the document.

【００５３】上述した式（３）から、総和Ｄｉｓ（ｄ）
が０より大きい場合、前記文書は有害であり、０以下で
ある場合、無害であると判定される。From the above equation (3), the sum Dis (d)
Is greater than 0, the document is deemed harmful; if it is less than 0, it is determined to be harmless.

【００５４】なお、上述した各単語ｉに対する重みは文
書ｄが有害な場合、総和Ｄｉｓ（ｄ）＞０となり、無害
な場合、総和Ｄｉｓ（ｄ）≦０となるように設定され
る。The weight for each word i described above is set such that when the document d is harmful, the total sum Dis (d)> 0, and when it is harmless, the total sum Dis (d) ≦ 0.

【００５５】次に、この単語の重みの設定について図３
に示すフローチャートを参照して説明する。なお、この
単語の重みの学習には perceptron learning algorithm
（ＰＬＡ）を使用している。Next, the setting of the weight of the word will be described with reference to FIG.
This will be described with reference to the flowchart shown in FIG. The learning of the weight of this word is perceptron learning algorithm
(PLA).

【００５６】図３においては、まず各種パラメータを設
定する（ステップＳ５１）。このパラメータとしては、
各単語の重みの集合Ｗ＝（ｗ₁，…，ｗ_n）、Ｎ個の学
習データＥ＝｛ｄ₁，…，ｄ_N｝、定数η、最大学習回
数Ｍａｘ、図３に示す学習処理を繰り返し行う学習回数
ｍがある。In FIG. 3, first, various parameters are set (step S51). This parameter includes
A set of weights W = (w ₁ ,..., W _n ) of each word, N pieces of learning data E = {d ₁ ,..., D _N }, a constant η, a maximum number of learnings Max, and a learning process shown in FIG. There is a learning number m to be repeated.

【００５７】次に、単語の重みの集合Ｗを初期化する
（ステップＳ５３）。この初期化では、各単語の重みに
乱数を入力する。それから、すべての学習データに対し
て前記単語重みの総和Ｄｉｓ（ｄ）を上式（３）により
計算する（ステップＳ５５）。Next, a set W of word weights is initialized (step S53). In this initialization, a random number is input as the weight of each word. Then, the sum Dis (d) of the word weights is calculated for all the learning data by the above equation (3) (step S55).

【００５８】そして、この計算の結果、すべての無害な
文書ｄについて総和Ｄｉｓ（ｄ）≦０であり、かつすべ
ての有害な文書ｄについて総和Ｄｉｓ（ｄ）＞０である
か否かをチェックし（ステップＳ５７）、そうである場
合には、処理を終了するが、そうでない場合には、この
ように誤って分類されたすべての文書ｄについて次のス
テップＳ６１，Ｓ６３で示すように重みの変化度合Ｓを
補正する（ステップＳ５９）。Then, as a result of this calculation, it is checked whether or not the total sum Dis (d) ≦ 0 for all harmless documents d and whether the total sum Dis (d)> 0 for all harmful documents d. (Step S57) If so, the process ends. If not, the weight change is performed on all the documents d classified in this way as shown in the following steps S61 and S63. The degree S is corrected (step S59).

【００５９】すなわち、ステップＳ６１では、文書ｄ_i
が有害であって、かつ総和Ｄｉｓ（ｄ）≦０の場合に
は、重み変化度合Ｓを増加するように補正し、またステ
ップＳ６３では、文書ｄ_iが無害であって、かつ総和Ｄ
ｉｓ（ｄ）＞０の場合には、重み変化度合Ｓを低減する
ように補正する。That is, in step S61, the document d _i
Is a harmful and if the sum Dis (d) ≦ 0 corrects to increase the weight degree of change S, also in step S63, the document d _i is a harmless, and the sum D
If is (d)> 0, correction is made so as to reduce the weight change degree S.

【００６０】そして、このように補正された重み変化度
合Ｓを使用して単語重みの集合ＷをステップＳ６５で示
す式のように補正する。それから、学習回数ｍを＋１イ
ンクリメントし（ステップＳ６７）、この学習回数ｍが
最大学習回数Ｍａｘより小さいか否かをチェックし（ス
テップＳ６９）、また最大学習回数Ｍａｘより小さい場
合には、ステップＳ５５に戻り、ステップＳ５７に示し
た条件が満たされるまで、ステップＳ５５以降の処理を
繰り返し行う。Then, using the weight change degree S corrected in this way, the word weight set W is corrected as in the equation shown in step S65. Then, the learning number m is incremented by +1 (step S67), and it is checked whether the learning number m is smaller than the maximum learning number Max (step S69). If the learning number m is smaller than the maximum learning number Max, the process proceeds to step S55. Returning, the processing from step S55 is repeated until the condition shown in step S57 is satisfied.

【００６１】次に、上述した実施形態のＵＲＬ階層構造
を利用した情報自動フィルタリング装置の評価実験につ
いて説明する。この評価実験は次に示す３つのプロセス
からなる。Next, an evaluation experiment of the automatic information filtering apparatus using the URL hierarchical structure of the above-described embodiment will be described. This evaluation experiment includes the following three processes.

【００６２】（１）文書を表現する単語集合抽出。（２）各単語に対する重みの学習。（３）最終評価。(1) Extraction of a word set expressing a document. (2) Learning weights for each word. (3) Final evaluation.

【００６３】まず、単語抽出のプロセスでは、収集され
たデータの中から日本語で記述された文書５９１２個に
対し形態素解析を行い、名詞・固有名詞・未定義語を抽
出した。日本語用の形態素解析ソフトを使用したため、
文書中に含まれる英単語は未定義語として抽出される。
また、この形態素解析の際には標準の日本語辞書ととも
に辞書に載っていない性的表現などに関する用語集を制
作し、これを使用した。この専門用語集には約１０００
語の単語が登録されている。また、抽出された単語のう
ち、データ全体での出現頻度が２０以下の単語は取り除
かれた。この結果、８０１３個の単語が抽出された。First, in the word extraction process, a morphological analysis was performed on 5912 documents described in Japanese from the collected data to extract nouns, proper nouns, and undefined words. Because we used morphological analysis software for Japanese,
English words included in the document are extracted as undefined words.
For this morphological analysis, a glossary of sexual expressions not included in the dictionary was created along with a standard Japanese dictionary and used. This glossary contains about 1000
The word of the word is registered. Further, among the extracted words, words whose appearance frequency in the entire data was 20 or less were removed. As a result, 8013 words were extracted.

【００６４】重み学習では評価データの一部が使用され
た。この学習用データは１８３８７個のＨＴＭＬ文書か
ら構成される。このうち、英語で記述された文書は９２
６３個、日本語で記述された文書は８１７１個、その他
の言語で記述された文書は９５３個であった。最終評価
は単語抽出用データと学習データを含む評価データ全体
に対して行われた。In the weight learning, a part of the evaluation data was used. This learning data is composed of 18387 HTML documents. Of these, 92 are written in English.
There were 63 documents, 8171 documents written in Japanese, and 953 documents written in other languages. The final evaluation was performed on the entire evaluation data including the word extraction data and the learning data.

【００６５】評価結果では、テキスト情報が少ないＨＴ
ＭＬ文書に対するフィルタリングが困難であるという仮
定を証明するため、１つのＨＴＭＬ文書に出現する全単
語数が閾値ｍｉｎ以下の文書に対してフィルタリングを
行い、その正解率と再現率を求めた。表６にその結果を
示す。The evaluation results show that the HT with less text information
In order to prove the assumption that filtering of an ML document is difficult, filtering was performed on a document in which the total number of words appearing in one HTML document is equal to or smaller than a threshold min, and the correct answer rate and the recall rate were obtained. Table 6 shows the results.

【００６６】[0066]

【表６】この結果から明らかなように、単語数が減るにつれ、正
解率こそ大きく変化しないものの、再現率が著しく低下
する。従って、単語数が少ない文書に対するフィルタリ
ングが困難であるという仮定は示されたといえる。[Table 6] As is apparent from this result, as the number of words decreases, the accuracy rate does not change significantly, but the recall rate decreases significantly. Therefore, it can be said that the assumption that filtering of a document having a small number of words is difficult is performed.

【００６７】次に、同じ評価データに対し、ＵＲＬ階層
構造を考慮したフィルタリングを行い、同様に正解率と
再現率を求めた。この結果を表７に示す。Next, the same evaluation data was filtered in consideration of the URL hierarchical structure, and the correct answer rate and the recall rate were similarly obtained. Table 7 shows the results.

【００６８】[0068]

【表７】この結果から、本発明による自動フィルタリング手法を
取り入れることにより、高い正解率を維持したまま、再
現率を大幅に増加させることができたことが明らかにな
った。これらの結果より、本発明の有効性が証明された
といえる。[Table 7] From these results, it has been clarified that the adoption of the automatic filtering method according to the present invention has significantly increased the recall while maintaining a high accuracy rate. From these results, it can be said that the effectiveness of the present invention has been proved.

【００６９】次に、図４および図５を参照して、本発明
の他の実施形態に係る自動フィルタリング装置について
説明する。この実施形態の自動フィルタリング装置は、
上述したように図１〜図３で説明したＵＲＬ階層構造を
利用した情報自動フィルタリング装置に対して第三者判
定によりフィルタリングを行う第三者判定フィルタリン
グ処理部を付加するように構成したものであり、両フィ
ルタリング処理を組み合わせることにより理想的なフィ
ルタリングを達成しようとするものである。Next, an automatic filtering apparatus according to another embodiment of the present invention will be described with reference to FIGS. The automatic filtering device of this embodiment includes:
As described above, the information automatic filtering apparatus using the URL hierarchical structure described with reference to FIGS. 1 to 3 is configured to add a third-party determination filtering processing unit that performs filtering by third-party determination. , By combining the two filtering processes to achieve ideal filtering.

【００７０】図４に示す自動フィルタリング装置は、図
１〜図３で説明したＵＲＬ階層構造を利用した情報自動
フィルタリング装置２５に対して第三者判定フィルタリ
ング処理部２３および該第三者判定フィルタリング処理
部２３で有害ＵＲＬを参照するために使用される有害Ｕ
ＲＬ一覧テーブル格納部１７が付加されている。The automatic filtering device shown in FIG. 4 is used for the information filtering device 25 using the URL hierarchical structure described with reference to FIGS. Harmful U used to refer to the harmful URL in the part 23
An RL list table storage unit 17 is added.

【００７１】有害ＵＲＬ一覧テーブル格納部１７は、有
害情報を提供するＵＲＬを有害ＵＲＬ一覧テーブルとし
て格納しているものであり、第三者判定フィルタリング
処理部２３は、前記入力部１から入力されたＨＴＭＬ文
書のＵＲＬを有害ＵＲＬ一覧テーブル格納部１７の有害
ＵＲＬ一覧テーブルに登録されている各ＵＲＬと照合
し、一致するＵＲＬがあるか否かを判定するものであ
る。The harmful URL list table storage unit 17 stores URLs providing harmful information as a harmful URL list table, and the third party judgment filtering processing unit 23 receives the URL from the input unit 1. The URL of the HTML document is collated with each URL registered in the harmful URL list table of the harmful URL list table storage unit 17 to determine whether there is a matching URL.

【００７２】図５は、図４に示す自動フィルタリング装
置の更に詳細な構成を示すブロック図である。図５に示
す自動フィルタリング装置は、図１に示したＵＲＬ階層
構造を利用した情報自動フィルタリング装置を構成する
入力部１、単語抽出部３、記憶部５、単語重みデータ格
納部７、自動フィルタリング部９、有害上位ページ一覧
テーブル格納部１１、出力部１３に加えて、図４の第三
者判定フィルタリング処理部２３に対応するＵＲＬリス
トに基づくフィルタリング部１５および有害ＵＲＬ一覧
テーブル格納部１７を有している。FIG. 5 is a block diagram showing a more detailed configuration of the automatic filtering device shown in FIG. The automatic filtering device shown in FIG. 5 includes an input unit 1, a word extraction unit 3, a storage unit 5, a word weight data storage unit 7, and an automatic filtering unit which constitute the information automatic filtering device using the URL hierarchical structure shown in FIG. 9. In addition to the harmful upper page list table storage unit 11 and the output unit 13, the harmful upper page list table storage unit 17 and the harmful URL list table storage unit 17 based on the URL list corresponding to the third party determination filtering processing unit 23 in FIG. ing.

【００７３】このように構成される自動フィルタリング
装置、すなわち第三者判定フィルタリング処理部による
ＵＲＬリスト一覧とＵＲＬ階層構造を利用した情報自動
フィルタリング装置によるフィルタリング処理では、ま
ずインターネット２１を介して入力部１から入力された
ＨＴＭＬ文書は、そのＵＲＬが有害ＵＲＬ一覧テーブル
格納部１７の有害ＵＲＬ一覧テーブルに登録されている
各ＵＲＬと照合され、一致するＵＲＬがあるか否かが判
定される。そして、有害ＵＲＬ一覧テーブル格納部１７
の有害ＵＲＬ一覧テーブルに登録されたＵＲＬと一致す
る場合には、このＵＲＬが示す情報の提示は阻止され
る。In the filtering process performed by the automatic filtering device configured as described above, that is, in the filtering process performed by the information filtering device using the URL list and the URL hierarchical structure by the third-party judgment filtering unit, first, the input unit 1 The URL of the HTML document input from the URL is collated with each URL registered in the harmful URL list table of the harmful URL list table storage unit 17 to determine whether there is a matching URL. Then, the harmful URL list table storage unit 17
If the URL matches the URL registered in the harmful URL list table, the presentation of the information indicated by this URL is blocked.

【００７４】ＵＲＬリストに基づくフィルタリング部１
５による有害ＵＲＬ一覧テーブルを参照した判定の結
果、有害ＵＲＬ一覧テーブル格納部１７の有害ＵＲＬ一
覧テーブルに登録されているＵＲＬと一致するものがな
い場合には、ＵＲＬ階層構造を利用した情報自動フィル
タリング装置２５によるフィルタリングが図１〜図３で
説明したように行われる。Filtering unit 1 based on URL list
As a result of the determination made by referring to the harmful URL list table in No. 5, if there is no URL that matches the URL registered in the harmful URL list table in the harmful URL list table storage unit 17, information is automatically filtered using the URL hierarchical structure. The filtering by the device 25 is performed as described in FIGS.

【００７５】このように本実施形態では、第三者による
判定に基づくフィルタリングとＵＲＬ階層構造を利用し
たフィルタリングの両方が行われるため、有害情報を適
確に検出して阻止することができる。As described above, in this embodiment, since both filtering based on the judgment by the third party and filtering using the URL hierarchical structure are performed, harmful information can be accurately detected and prevented.

【００７６】[0076]

【発明の効果】以上説明したように、本発明によれば、
ＨＴＭＬ情報のＵＲＬが上位ＵＲＬである場合、この上
位ＵＲＬが示す情報に対して自動フィルタリングを行
い、その結果、該情報が不適切である場合、上位ＵＲＬ
を不適切上位ＵＲＬ一覧に登録するとともに、該情報の
提供を阻止し、上位ＵＲＬでなかった場合、このＵＲＬ
を不適切上位ＵＲＬ一覧の各ＵＲＬと照合し、一致する
ＵＲＬがある場合、このＵＲＬが示す情報の提示を阻止
し、一致するものがない場合、該ＵＲＬが示す情報に対
して自動フィルタリングを行い、その結果、該情報が不
適切である場合、該情報の提供を阻止するので、画像の
みが提示されている少テキストページでもその不適切さ
を適確に判定して阻止することができ、正解率および再
現率の両方を向上することができる。As described above, according to the present invention,
When the URL of the HTML information is the upper URL, the information indicated by the upper URL is automatically filtered. As a result, when the information is inappropriate, the upper URL is used.
Is registered in the list of inappropriate upper URLs, and the provision of the information is prevented. If the URL is not the upper URL, this URL is
Is checked against each URL in the list of inappropriate upper URLs. If there is a matching URL, the presentation of the information indicated by this URL is prevented. If there is no matching URL, the information indicated by the URL is automatically filtered. As a result, if the information is inappropriate, the provision of the information is prevented, so that even a small text page in which only an image is presented can be accurately determined and prevented from being inappropriate, Both the correct answer rate and the recall rate can be improved.

【００７７】また、本発明によれば、ＵＲＬ階層構造を
利用した情報自動フィルタリングに加えて、不適切な情
報を提供するＵＲＬを不適切ＵＲＬ一覧として登録して
おき、ＨＴＭＬ情報のＵＲＬを不適切ＵＲＬ一覧の各Ｕ
ＲＬと照合して、一致するＵＲＬがある場合、このＵＲ
Ｌが示す情報の提示を阻止する第三者判定に基づくフィ
ルタリングを更に行うので、この第三者判定に基づくフ
ィルタリングと上位ＵＲＬを利用した自動フィルタリン
グの両方により更に完全にフィルタリングを行うことが
できる。Further, according to the present invention, in addition to the automatic information filtering using the URL hierarchical structure, a URL providing inappropriate information is registered as an inappropriate URL list, and the URL of the HTML information is registered as an inappropriate URL. Each U in URL list
If there is a matching URL against the RL, this URL
Since the filtering based on the third party determination for preventing the presentation of the information indicated by L is further performed, the filtering can be more completely performed by both the filtering based on the third party determination and the automatic filtering using the upper URL.

[Brief description of the drawings]

【図１】本発明の一実施形態に係るＵＲＬ階層構造を利
用した情報自動フィルタリング装置の構成を示すブロッ
ク図である。FIG. 1 is a block diagram illustrating a configuration of an automatic information filtering apparatus using a URL hierarchical structure according to an embodiment of the present invention.

【図２】図１に示すＵＲＬ階層構造を利用した情報自動
フィルタリング装置の作用を示すフローチャートであ
る。FIG. 2 is a flowchart showing an operation of the automatic information filtering apparatus using the URL hierarchical structure shown in FIG.

【図３】図２に示すフローチャートに使用されている単
語重みの設定手順を示すフローチャートである。FIG. 3 is a flowchart showing a procedure for setting word weights used in the flowchart shown in FIG. 2;

【図４】本発明の他の実施形態に係る自動フィルタリン
グ装置の概要構成を示す説明図である。FIG. 4 is an explanatory diagram showing a schematic configuration of an automatic filtering device according to another embodiment of the present invention.

【図５】図４に示す自動フィルタリング装置の詳細な構
成を示すブロック図である。5 is a block diagram showing a detailed configuration of the automatic filtering device shown in FIG.

【図６】従来の自己判定に基づくフィルタリングを説明
するための図である。FIG. 6 is a diagram for explaining conventional filtering based on self-determination.

【図７】図６に示した自己判定に基づくフィルタリング
の一例としてRSACi とSafeSurfによる評価結果の記述例
を示す図である。7 is a diagram showing an example of description of an evaluation result by RSACi and SafeSurf as an example of filtering based on self-determination shown in FIG. 6;

【図８】従来の第三者による判定に基づく有害情報フィ
ルタリングを説明するための図である。FIG. 8 is a diagram for explaining conventional harmful information filtering based on determination by a third party.

[Explanation of symbols]

１入力部３単語抽出部７単語重みデータ格納部９自動フィルタリング部１１有害上位ページ一覧テーブル格納部１５ＵＲＬリストに基づくフィルタリング部１７有害ＵＲＬ一覧テーブル格納部 DESCRIPTION OF SYMBOLS 1 Input part 3 Word extraction part 7 Word weight data storage part 9 Automatic filtering part 11 Harmful upper page list table storage part 15 Filtering part based on URL list 17 Harmful URL list table storage part

───────────────────────────────────────────────────── フロントページの続き (72)発明者橋本和夫埼玉県上福岡市大原２−１−15 株式会社ケイディディ研究所内Ｆターム(参考） 5B075 KK60 KK63 ND36 5B089 HA10 JA22 JB02 KA04 KA12 KA17 KB07 KC15 KC52 KC53 LB08 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Kazuo Hashimoto 2-1-15 Ohara, Kamifukuoka-shi, Saitama F-term in Kadidi Laboratory Co., Ltd. (Reference) 5B075 KK60 KK63 ND36 5B089 HA10 JA22 JB02 KA04 KA12 KA17 KB07 KC15 KC52 KC53 LB08

Claims

[Claims]

1. An automatic information filtering method for identifying inappropriate information among various types of information provided via the Internet and preventing the provision of the identified inappropriate information, wherein the HTML is provided via the Internet. Information is input, and it is determined whether the URL of the HTML information is a high-order URL. If the URL is a high-order URL, a word appearing in the information indicated by the high-order URL is extracted, and the extracted word is extracted. Based on the information, perform automatic filtering of whether or not the information is inappropriate. If the result of the automatic filtering is that the information is determined to be inappropriate, the upper URL is replaced with an inappropriate upper UR.
And registering the information in the L list, and preventing the provision of the information. If the URL of the HTML information is not the upper URL, the URL is compared with each URL in the registered inappropriate upper URL list to match. It is determined whether or not there is a URL. If the URL matches, the presentation of the information indicated by the URL is prevented. If the URL does not match the URL of the inappropriate upper URL list, the information appears in the information indicated by the URL. A word to be extracted, and based on the extracted word, perform automatic filtering on whether or not the information is inappropriate. As a result of the automatic filtering, when it is determined that the information is inappropriate, An information automatic filtering method using a URL hierarchical structure, characterized by preventing provision of information.

2. Registering a URL that provides inappropriate information as a list of inappropriate URLs and registering the input HTM
The URL of the L information is checked against each URL in the list of inappropriate URLs to determine whether there is a matching URL. If there is a match, the URL is determined based on a third party determination that prevents the presentation of the information indicated by the URL. 2. The method according to claim 1, further comprising performing filtering.

3. An automatic information filtering apparatus for identifying inappropriate information among various kinds of information provided via the Internet and preventing provision of the identified inappropriate information, wherein the HTML is provided via the Internet. Input means for inputting information; upper URL determining means for determining whether or not the URL of the input HTML information is a higher URL; as a result of the determination by the higher URL determining means, the URL is a higher URL In this case, a first automatic filtering means for extracting a word appearing in the information indicated by the upper URL and automatically filtering whether or not the information is inappropriate based on the extracted word; As a result, when it is determined that the information is inappropriate, the presentation of the information is prevented and the upper URL is inappropriate. And inadequate upper URL list registration means for registering the upper URL list table, the result of the determination by the upper URL determination means, the HTM
If the URL of the L information is not the upper URL, this URL
L against the respective URLs registered in the inappropriate upper URL list table to determine whether there is a matching URL, and as a result of this determination, the URL If there is no URL that matches the URL registered in the URL list table, a word that appears in the information indicated by the URL is extracted, and based on the extracted word, whether or not the information is inappropriate is automatically determined. A second automatic filtering means for performing filtering, and a result of the determination by the inappropriate URL determination means,
L is registered in the inappropriate upper URL list table
If it matches the RL, the presentation of the information indicated by the URL is prevented, and if the information is determined to be inappropriate as a result of the filtering by the second automatic filtering means, the provision of the information is prevented. An automatic information filtering apparatus using a URL hierarchical structure, characterized by comprising information presentation blocking means.

4. An inappropriate URL list registering means for registering a URL providing inappropriate information in an inappropriate URL list table, and a URL of the HTML information input from the input means.
For each U registered in the inappropriate URL list table.
A matching URL determining unit that checks against the RL to determine whether there is a matching URL; and, as a result of the determination, if there is a matching URL, the URL
4. The automatic information filtering apparatus using a URL hierarchical structure according to claim 3, further comprising a filtering means based on a third-party determination for preventing presentation of the information indicated by L.