JP2001028006A

JP2001028006A - Method and device for automatic information filtering

Info

Publication number: JP2001028006A
Application number: JP11201988A
Authority: JP
Inventors: Naoki Inoue; 直己井ノ上; Keiichiro Hoashi; 啓一郎帆足; Kazuo Hashimoto; 和夫橋本
Original assignee: KDD Corp
Current assignee: KDDI Corp
Priority date: 1999-07-15
Filing date: 1999-07-15
Publication date: 2001-01-30

Abstract

PROBLEM TO BE SOLVED: To provide a method and device for automatic information filtering which properly decide whether or not information is improper by easily and accurately setting weight of a word and using the weight of the word. SOLUTION: Improper information which should be stopped from being provided and proper information which need not be stopped from being provided are inputted as learning data to a weighted word list learning part 60, weight of a word is obtained from a linear discrimination function for discriminating between the improper information and proper information in a vector space and stored as a weighted word list in a weighted word list storage part 50, and a word extraction part 3 extracts a word from information from an input part 1; and the weight(w) of this word is obtained from the weighted word list storage part 50 and inputted to an automatic filtering part 30, and the total of weights(w) of those words is calculated, so that the improper information when the total is larger than a threshold or the information when the total is smaller is decided.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、インターネットを
介して提供されるメールを含む各種情報に対して、情報
に出現する単語を抽出し、この単語に基づいて前記情報
が不適切であるか否かを判定し、不適切な情報の提供を
阻止する情報自動フィルタリング方法および装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention extracts words appearing in information from various types of information including mail provided via the Internet, and determines whether the information is inappropriate based on the words. The present invention relates to an information automatic filtering method and apparatus for judging whether or not to provide inappropriate information.

【０００２】[0002]

【従来の技術】インターネットの急速な広がりに伴い、
限られた専門家の道具でしかなかったコンピュータはご
く一般の家庭や学校などにも導入され始めている。この
ため、これまでコンピュータに触れることすらなかった
多くの一般人でも気軽にインターネットにアクセスする
ことが可能になった。こうした背景の中、近年深刻な問
題となっているのがインターネット上に氾濫するポルノ
画像などの有害情報に対する子供のアクセスである。こ
の問題に対処するため、アメリカでは政府機関がインタ
ーネット上の情報を検閲することを可能にした「通信品
位法」という法律が提案されたが、裁判の結果、表現の
自由を保証する憲法に違反すると判決され、立法するこ
とができなかった。2. Description of the Related Art With the rapid spread of the Internet,
Computers, which were only tools of a limited number of experts, have begun to be introduced to ordinary homes and schools. This has made it possible for many ordinary people who had never even touched a computer to easily access the Internet. Against this background, children's access to harmful information such as pornographic images flooding the Internet has become a serious problem in recent years. To address this issue, the United States proposed a law called the Telecommunications Quality Act, which allowed government agencies to censor information on the Internet, but as a result of a trial, it violated the constitution that guarantees freedom of expression He was ruled and was unable to legislate.

【０００３】そこで最近注目されているのが「情報フィ
ルタリング」という技術である。情報フィルタリングと
は、ユーザがインターネット上の情報にアクセスする際
にその情報の有害性をチェックし、有害と判定された場
合は何らかの手段によりその情報へのアクセスをブロッ
クするという技術である。[0003] Recently, a technique called "information filtering" has attracted attention. Information filtering is a technique in which when a user accesses information on the Internet, the harmfulness of the information is checked, and if the information is determined to be harmful, access to the information is blocked by some means.

【０００４】現在市販されている有害情報フィルタリン
グソフトで取り入れられている手法は大きく以下の４つ
に分類される。The methods adopted by currently available harmful information filtering software are roughly classified into the following four.

【０００５】（１）自己判定によるフィルタリング（２）第三者の判定によるフィルタリング（３）自動フィルタリング（４）単語に対するスコア（得点）を利用する方式ここではこの４つの手法について簡単に解説する。ま
ず、自己判定によるフィルタリング手法ではＷＷＷ情報
の提供者が自らのコンテンツの有害性について判定を行
い、その結果をＨＴＭＬファイル内に記述する。フィル
タリングソフトはこの記述された結果を参照し、有害と
判断された場合にアクセスをブロックする。この手法に
よるフィルタリングを図４に示す。(1) Filtering based on self-determination (2) Filtering based on determination by a third party (3) Automatic filtering (4) Method using score (score) for words Here, these four methods will be briefly described. First, in the filtering method based on self-determination, a provider of WWW information determines the harmfulness of its own content, and describes the result in an HTML file. The filtering software refers to the described result, and blocks access if it is determined to be harmful. FIG. 4 shows filtering by this method.

【０００６】図４に示す自己判定に基づくフィルタリン
グでは、米国マサチューセッツ工科大学のWorld Wide W
eb Consortium が作成したＰＩＣＳ（Platform for Int
ernet Content Selection ）と呼ばれるインターネット
コンテンツの評価を記述するための基準を使用してい
る。ＰＩＣＳを使用することにより、コンテンツ提供者
は簡単に自分の提供している情報を描写し、開示するこ
とができる。[0006] In the filtering based on the self-determination shown in FIG. 4, the World Wide W
PICS (Platform for Int) created by eb Consortium
It uses a criterion for describing Internet content ratings called ernet Content Selection. By using PICS, a content provider can easily depict and disclose information provided by the content provider.

【０００７】多くの場合、コンテンツ提供者がこのよう
な評価結果を公開する際には、ＰＩＣＳによる評価結果
を出力する評価機関のサービスを利用する。このような
評価機関の代表として、Recreational Software Adviso
ry Council（ＲＳＡＣ）やSafeSurfといった団体があげ
られ、それぞれ独自に設定した基準による評価結果を提
供している。コンテンツ提供者はこれらの機関からの評
価結果をＨＴＭＬファイルのヘッダに記述する。図５に
この評価結果の記述例を示す。In many cases, when a content provider publishes such an evaluation result, the service of an evaluation organization that outputs the evaluation result by PICS is used. On behalf of such evaluation agencies, Recreational Software Adviso
There are organizations such as the ry Council (RSAC) and SafeSurf, which provide evaluation results based on their own set of standards. The content provider describes the evaluation results from these institutions in the header of the HTML file. FIG. 5 shows a description example of this evaluation result.

【０００８】この自己判定はコンテンツ提供者の自主性
に任せられるというのが現状である。そのため、多くの
コンテンツ提供者がこの判定を受けようという意志を持
たない限りは本手法による有効な有害情報フィルタリン
グは不可能であるといえる。[0008] At present, the self-determination is left to the independence of the content provider. Therefore, it can be said that effective harmful information filtering by this method is impossible unless many content providers have a will to receive this determination.

【０００９】次に第三者による判定に基づくフィルタリ
ングについて説明する。有害情報フィルタリングソフト
を作成している業者の中には、ＷＷＷ上のホームページ
の有害性を独自に判定し、その結果をフィルタリングソ
フトの判断基準とする手法を取り入れている。一般的に
は、この評価の結果として有害なホームページのＵＲＬ
一覧が構築されている。このＵＲＬのリストはフィルタ
リングソフトと共にユーザに分配され、フィルタリング
ソフトの判断基準となる。多くの場合、フィルタリング
ソフトはこの有害ＵＲＬ一覧を定期的にダウンロードす
る仕組みになっている。第三者による判定に基づく有害
情報フィルタリングの仕組みを図６に示す。Next, the filtering based on the judgment by the third party will be described. Among companies that create harmful information filtering software, a method of independently determining the harmfulness of a homepage on the WWW and using the result as a criterion for the filtering software is adopted. Generally, the URL of the harmful homepage as a result of this evaluation
The list is built. This list of URLs is distributed to the user together with the filtering software, and serves as a criterion for the filtering software. In many cases, filtering software is designed to periodically download the harmful URL list. FIG. 6 shows a mechanism of harmful information filtering based on a judgment by a third party.

【００１０】このような仕組みを持つソフトウェアの代
表的なものとしてCyberPatrolがあげられる。CyberPatr
olは「暴力」「性行為」など１３個のジャンルに対し、
それぞれ有害ＵＲＬ一覧を持っており、これらのシステ
ムに従って有害情報フィルタリングを行う。A typical example of software having such a mechanism is CyberPatrol. CyberPatr
ol for 13 genres such as "violence" and "sex"
Each has a harmful URL list, and performs harmful information filtering according to these systems.

【００１１】この手法で使用される有害ＵＲＬ一覧はそ
れぞれソフトウェア業者でホームページをアクセスし、
判定を行うことによって作成・拡張されているため、新
しく設立されたホームページや従来のＵＲＬから別のＵ
ＲＬに移動したホームページには対処することは不可能
である。従って、こうした評価対象外のページに対する
フィルタリングには対処できないのが現状である。The list of harmful URLs used in this method is accessed by a software company on a homepage,
Since it is created and expanded by making a judgment, it is possible to use a different URL from a newly established homepage or a conventional URL.
It is impossible to deal with the homepage moved to the RL. Therefore, at present, it is not possible to cope with filtering for pages that are not evaluated.

【００１２】次に、自動フィルタリングについて説明す
る。有害情報フィルタリングソフトの中にはアクセスさ
れたホームページの中身をチェックし、有害性の判断を
行うものもある。Next, automatic filtering will be described. Some harmful information filtering software checks the contents of the accessed homepage to determine harmfulness.

【００１３】具体的には、有害な情報、すなわち不適切
な情報内に含まれるであろう単語を予め登録しておき、
この登録した単語が情報内に出現するか否かをチェック
し、前記登録した単語が含まれていた場合に情報の提供
を阻止する方式である。例えば、ポルノ情報の提供を阻
止する場合、情報内に”ｓｅｘ”や“ｘｘｘ”といった
文字列が含まれていた場合、その情報の提供を阻止す
る。この手法の応用として、登録した単語が情報内に含
まれている割合が所定の閾値を上回った場合に情報の提
供を阻止する方式もある。Specifically, harmful information, that is, words that would be included in inappropriate information are registered in advance,
In this method, it is checked whether or not the registered word appears in the information, and when the registered word is included, provision of the information is prevented. For example, when the provision of pornographic information is blocked, if the information includes a character string such as "sex" or "xxx", the provision of the information is blocked. As an application of this method, there is also a method of preventing the provision of information when the ratio of registered words included in the information exceeds a predetermined threshold.

【００１４】次に、単語に対するスコア（得点）を利用
する方式について説明する。この方式は、不適切な情報
内に含まれるであろう単語およびこの単語に対するスコ
アを予め登録しておき、この登録した単語が情報内に出
現するか否かをチェックし、登録した単語が含まれてい
た場合に単語のスコアを合計し、この合計が所定の閾値
を上回った場合に該情報の提供を阻止するものである。Next, a method of using a score (score) for a word will be described. In this method, a word that is likely to be included in inappropriate information and a score for the word are registered in advance, and whether or not the registered word appears in the information is checked. If the sum is greater than a predetermined threshold, the information is prevented from being provided.

【００１５】[0015]

【発明が解決しようとする課題】情報自動フィルタリン
グの大きな目的は不適切な情報を阻止する割合を増やす
とともに、適切な情報が誤って阻止される割合を減らす
ことであるが、上述した各手法はそれぞれ一長一短があ
り、従来の情報自動フィルタリングでは十分なフィルタ
リング性能を得ることができないという問題がある。The main purpose of automatic information filtering is to increase the rate at which inappropriate information is blocked and reduce the rate at which appropriate information is erroneously blocked. There are advantages and disadvantages, and there is a problem that conventional information automatic filtering cannot provide sufficient filtering performance.

【００１６】具体的には、従来の自動フィルタリング手
法では、例えば”Ｓｕｓｅｘ”というイギリスの町に関
するホームページがブロックされるという悪例が報告さ
れている。また、単語に対するスコアを利用する従来の
方式では、単語および単語のスコアの設定がアドホック
となり、ユーザにとってどのように設定すれば最も有効
であるかに関して全く指針がなかった。そのため、提供
を阻止すべき情報を阻止できなかったり、本来提供を阻
止する必要のない情報が阻止されるなど、性能の点で問
題があった。Specifically, in the conventional automatic filtering method, for example, a bad example has been reported in which, for example, a homepage about a British town called "Susex" is blocked. Further, in the conventional method using the score for a word, the setting of the word and the score of the word is ad hoc, and there is no guide as to how the setting is most effective for the user. Therefore, there is a problem in terms of performance such that information that should be prevented from being provided cannot be prevented or information that should not be provided should be prevented.

【００１７】例えば、「女子高生」という単語は一般的
にポルノ情報に頻出すると考え、「女子高生」という単
語とそのスコアを４０として登録したとする。その結
果、「女子高生のサンプル画像、無料」という表現中に
「女子高生」が含まれているため、この表現全体のスコ
アは４０となる。また、同様に「女子高生の乗ったバス
が北海道で事故」という表現についてもこの表現全体の
スコアは４０となり、これらの表現のスコアは同じにな
る。このため、閾値を２０としたとすると、本来阻止す
る必要のない後者の表現が阻止されてしまうという問題
があり、また閾値を５０としたとすると、本来阻止すべ
き前者の表現が阻止されないという問題がある。これら
２つの表現を区別するためには、「サンプル」「画像」
「無料」などの単語や「バス」「北海道」「事故」とい
った単語にもスコアを設定する必要があることになる
が、これらの単語は一般的にも良く利用される単語であ
り、スコアをどのように設定すれば良いかが明確でな
く、スコアの設定により性能が大きく変動し、不適切な
表現か否かの判定性能が十分に得られないという問題が
ある。For example, it is assumed that the word "high school girl" generally appears frequently in pornographic information, and the word "high school girl" and its score are registered as 40. As a result, since the expression “sample image of a high school girl, free” includes “high school girl”, the score of the entire expression is 40. Similarly, for the expression "the bus on which a high school girl got on accident in Hokkaido", the total score of this expression is 40, and the scores of these expressions are the same. For this reason, if the threshold value is set to 20, there is a problem that the latter expression that should not be blocked is blocked, and if the threshold value is set to 50, the former expression that should be blocked is not blocked. There's a problem. To distinguish between these two expressions, you need to use “sample”, “image”
Words such as "free" and words such as "bus", "Hokkaido" and "accident" will also need to be scored, but these words are commonly used words. It is not clear how to set it, and there is a problem that the performance greatly fluctuates due to the setting of the score, and it is not possible to obtain sufficient performance for determining whether the expression is inappropriate.

【００１８】本発明は、上記に鑑みてなされたもので、
その目的とするところは、単語の重みを容易かつ適確に
設定し、この単語の重みを利用して情報が不適切である
か否かを適確に判定する情報自動フィルタリング方法お
よび装置を提供することにある。The present invention has been made in view of the above,
It is an object of the present invention to provide an information automatic filtering method and apparatus for easily and accurately setting the weight of a word and accurately determining whether information is inappropriate by using the weight of the word. Is to do.

【００１９】[0019]

【課題を解決するための手段】上記目的を達成するた
め、請求項１記載の本発明は、インターネットを介して
提供される情報のうち不適切情報を識別し、この識別し
た不適切情報の提供を阻止する情報自動フィルタリング
方法であって、提供の阻止を必要とする不適切な情報お
よび提供の阻止を必要としない適切な情報を学習データ
とした自動学習により前記情報に含まれる単語に対して
情報の提供を阻止する必要があるか否かを判定するため
に使用される単語の重みを求め、この求めた単語の重み
を各単語に対応して重み付き単語リストとして記憶管理
しておき、インターネットを介して提供される情報を入
力し、この情報に含まれる単語を抽出し、この抽出した
単語の各々に対する重みを前記重み付き単語リストから
読み出し、この読み出した各単語の重みの総和を算出
し、この算出した総和に基づき前記情報の提供を阻止す
べきか否かを判定することを要旨とする。In order to achieve the above object, the present invention according to claim 1 identifies inappropriate information among information provided via the Internet, and provides the identified inappropriate information. An automatic information filtering method for preventing the word included in the information by automatic learning with learning data of inappropriate information that needs to be provided and appropriate information that does not need to be provided is provided as learning data. Determine the weight of the word used to determine whether it is necessary to prevent the provision of information, and store and manage the weight of the obtained word as a weighted word list corresponding to each word, Information provided via the Internet is input, words included in the information are extracted, and the weight for each of the extracted words is read out from the weighted word list. Calculating the sum of the weights of each word that is summarized in that to determine whether to prevent the providing of the information based on the calculated sum.

【００２０】請求項１記載の本発明にあっては、提供の
阻止を必要とする不適切な情報および提供の阻止を必要
としない適切な情報を学習データとした自動学習により
単語の重みを求め、この求めた単語の重みを各単語に対
応して重み付き単語リストとして記憶管理しておき、イ
ンターネットを介して提供される情報に含まれる単語を
抽出し、この抽出した単語の各々に対する重みを重み付
き単語リストから読み出し、この読み出した各単語の重
みの総和を算出し、この総和に基づき前記情報の提供を
阻止すべきか否かを判定するため、従来アドホックに設
定しなければならなかった単語の重みを自動学習により
適確に求め、この適確に求めた単語の重みを利用して情
報が不適切な情報であるか否かを適確に判定し、不適切
な情報の提供を阻止することができる。According to the first aspect of the present invention, the weight of a word is obtained by automatic learning using, as learning data, inappropriate information that needs to be prevented from being provided and appropriate information that does not need to be prevented from being provided. The weights of the obtained words are stored and managed as a weighted word list corresponding to each word, the words included in the information provided via the Internet are extracted, and the weight for each of the extracted words is calculated. Read from the weighted word list, calculate the sum of the weights of the read words, and determine whether or not to prevent the provision of the information based on the sum. The weight of the word is determined accurately by automatic learning, and whether or not the information is inappropriate information is accurately determined by using the weight of the word accurately determined, and provision of inappropriate information is prevented. It can be.

【００２１】また、請求項２記載の本発明は、請求項１
記載の発明において、前記単語の重みを求める処理が、
前記不適切な文書と適切な文書に対してベクトル空間上
で弁別できる線形識別関数に基づく自動学習により単語
の重みを求めることを要旨とする。The present invention described in claim 2 provides the present invention in claim 1.
In the invention described in the above, the processing for obtaining the weight of the word is performed by:
The gist of the present invention is to determine the weight of a word by automatic learning based on a linear discriminant function that can discriminate an inappropriate document from an appropriate document in a vector space.

【００２２】請求項２記載の本発明にあっては、不適切
な文書と適切な文書に対してベクトル空間上で弁別でき
る線形識別関数に基づく自動学習により単語の重みを求
めるため、単語の重みを適確に設定することができる。According to the second aspect of the present invention, the weight of a word is determined by automatic learning based on a linear discriminant function that can discriminate an inappropriate document from an appropriate document in a vector space. Can be set appropriately.

【００２３】更に、請求項３記載の本発明は、インター
ネットを介して提供される情報のうち不適切情報を識別
し、この識別した不適切情報の提供を阻止する情報自動
フィルタリング装置であって、提供の阻止を必要とする
不適切な文書および提供の阻止を必要としない適切な文
書を学習データとした自動学習により前記文書に含まれ
る単語に対して情報の提供を阻止する必要があるか否か
を判定するために使用される単語の重みを求める単語重
み学習手段と、この求めた単語の重みを各単語に対応し
て重み付き単語リストとして記憶管理する重み付き単語
リスト格納手段と、インターネットを介して提供される
情報を入力する入力手段と、この入力された情報に含ま
れる単語を抽出する単語抽出手段と、この抽出した単語
の各々に対する重みを前記重み付き単語リストから読み
出し、この読み出した各単語の重みの総和を算出し、こ
の算出した総和に基づき前記情報の提供を阻止すべきか
否かを判定する判定手段とを有することを要旨とする。Further, the present invention according to claim 3 is an information automatic filtering device for identifying inappropriate information among information provided via the Internet, and preventing provision of the identified inappropriate information, Whether it is necessary to prevent the provision of information for words contained in the document by automatic learning using learning data of an inappropriate document that needs to be provided and an appropriate document that does not need to be provided. Word weight learning means for calculating the weight of a word used to determine whether the weight of the word is determined, weighted word list storage means for storing and managing the weight of the determined word as a weighted word list corresponding to each word, Input means for inputting information provided via the PC, word extracting means for extracting words included in the input information, and weighting for each of the extracted words. From the weighted word list, calculate the sum of the weights of the read words, and determine whether to prevent provision of the information based on the calculated sum. I do.

【００２４】請求項３記載の本発明にあっては、提供の
阻止を必要とする不適切な文書および提供の阻止を必要
としない適切な文書を学習データとした自動学習により
単語の重みを求め、この求めた単語の重みを各単語に対
応して重み付き単語リストとして記憶管理しておき、イ
ンターネットを介して提供される情報に含まれる単語を
抽出し、この抽出した単語の各々に対する重みを重み付
き単語リストから読み出し、この読み出した各単語の重
みの総和を算出し、この総和に基づき前記情報の提供を
阻止すべきか否かを判定するため、従来アドホックに設
定しなければならなかった単語の重みを自動学習により
適確に求め、この適確に求めた単語の重みを利用して情
報が不適切な情報であるか否かを適確に判定し、不適切
な情報の提供を阻止することができる。According to the third aspect of the present invention, the weight of a word is obtained by automatic learning using, as learning data, an inappropriate document that needs to be prevented from being provided and an appropriate document that does not need to be prevented from being provided. The weights of the obtained words are stored and managed as a weighted word list corresponding to each word, the words included in the information provided via the Internet are extracted, and the weight for each of the extracted words is calculated. Read from the weighted word list, calculate the sum of the weights of the read words, and determine whether or not to prevent the provision of the information based on the sum. The weight of the word is determined accurately by automatic learning, and whether or not the information is inappropriate information is accurately determined by using the weight of the word accurately determined, and provision of inappropriate information is prevented. It can be.

【００２５】請求項４記載の本発明は、請求項３記載の
発明において、前記単語重み学習手段が、前記不適切な
文書と適切な文書に対してベクトル空間上で弁別できる
線形識別関数に基づく自動学習により単語の重みを求め
る手段を有することを要旨とする。According to a fourth aspect of the present invention, in the third aspect of the present invention, the word weight learning means is based on a linear identification function capable of discriminating the inappropriate document from the appropriate document in a vector space. The gist of the present invention is to have means for calculating the weight of a word by automatic learning.

【００２６】請求項４記載の本発明にあっては、不適切
な文書と適切な文書に対してベクトル空間上で弁別でき
る線形識別関数に基づく自動学習により単語の重みを求
めるため、単語の重みを適確に設定することができる。According to the fourth aspect of the present invention, the weight of a word is determined by automatic learning based on a linear discriminant function that can discriminate an inappropriate document from an appropriate document in a vector space. Can be set appropriately.

【００２７】[0027]

【発明の実施の形態】次に、図１を参照して、本発明の
実施形態に係る情報自動フィルタリング装置について説
明する。同図に示す情報自動フィルタリング装置は、単
語の重みを自動学習により求め、この自動学習で求めた
単語の重みを利用して情報が不適切であるか否かを判定
し、不適切な情報の提供を阻止するものであり、インタ
ーネットを介して提供されるＨＴＭＬ情報を入力する入
力部１、この入力部１を介して入力された情報に出現す
る単語抽出部３、提供の阻止を必要とする不適切な情報
である文書および提供の阻止を必要としない適切な情報
である文書を学習データとした自動学習により前記文書
に含まれる単語に対して情報の提供を阻止する必要があ
るか否かを判定するために使用される単語の重みを求め
る重み付き単語リスト学習部６０、この重み付き単語リ
スト学習部６０で求めた単語の重みを各単語に対応して
重み付き単語リストとして記憶管理する重み付き単語リ
スト格納部５０、単語抽出部３で抽出された単語および
該単語に対して重み付き単語リスト格納部５０から得ら
れた単語の重みｗに基づき入力部１から入力された情報
の提供を阻止すべきか否かを判定する自動フィルタリン
グ部３０、および該自動フィルタリング部３０で得られ
た判定結果を出力する出力部４０から構成されている。FIG. 1 is a block diagram of an automatic information filtering apparatus according to an embodiment of the present invention. The information automatic filtering device shown in the figure obtains the weight of a word by automatic learning, determines whether the information is inappropriate by using the weight of the word obtained by the automatic learning, and An input unit 1 for inputting HTML information provided via the Internet, a word extraction unit 3 appearing in information input via the input unit 1, and a need to prevent provision Whether or not it is necessary to prevent the provision of information for words contained in the document by automatically learning the document which is inappropriate information and the document which is appropriate information that does not need to be prevented from being provided as learning data. Weighted word list learning unit 60 for calculating the weight of the word used for determining the word, the weight of the word obtained by the weighted word list learning unit 60 is used as a weighted word list corresponding to each word. The weighted word list storage unit 50 to be managed and input from the input unit 1 based on the words extracted by the word extraction unit 3 and the weights w of the words obtained from the weighted word list storage unit 50 for the words. It comprises an automatic filtering unit 30 for determining whether or not provision of information should be prevented, and an output unit 40 for outputting a determination result obtained by the automatic filtering unit 30.

【００２８】本実施形態の情報自動フィルタリング装置
は、重み付き単語リスト学習部６０において単語の重み
を自動学習により予め取得し、この自動学習で得た単語
の重みを利用することを特徴とする。この単語の重みの
自動学習の方法を図２のフローチャートに示す単語重み
の学習アルゴリズムで行われるものである。すなわち、
図２に示す学習アルゴリズムでは、学習データの集合Ｅ
＝｛ｄ1 ，…，ｄn ｝として提供の阻止を必要とする不
適切な情報および提供の阻止を必要としない適切な情報
を重み付き単語リスト学習部６０に入力し、この入力さ
れた不適切な情報と適切な情報をベクトル空間上で弁別
する線形識別関数から単語の重みを取得する。具体的に
は次ぎの手順で行う。The automatic information filtering apparatus according to the present embodiment is characterized in that the weights of words are obtained in advance by automatic learning in the weighted word list learning section 60, and the weights of the words obtained by the automatic learning are used. This method of automatically learning the weight of a word is performed by a word weight learning algorithm shown in the flowchart of FIG. That is,
In the learning algorithm shown in FIG.
= {D 1,..., Dn}, the inappropriate information that needs to be prevented from being provided and the appropriate information that does not need to be provided are input to the weighted word list learning unit 60. The word weight is obtained from a linear discriminant function that discriminates information and appropriate information in a vector space. Specifically, the following procedure is performed.

【００２９】まず、入力部１から入力されたＨＴＭＬ文
書をベクトル空間モデルによって表現する。すなわち、
すべての文書を表現するｎ個の単語を選択し、それぞれ
の文書をｎ次元のベクトルで次式のように表現する。First, an HTML document input from the input unit 1 is represented by a vector space model. That is,
The n words representing all documents are selected, and each document is represented by an n-dimensional vector as in the following expression.

【００３０】[0030]

【数１】このベクトルの各要素は、各々単語の文書ｄでの出現頻
度を正規化したものである。単語の出現頻度の正規化に
は次に示す数式で表されるＴＦ＊ＩＤＦという手法を用
いている。(Equation 1) Each element of this vector is obtained by normalizing the frequency of occurrence of each word in the document d. To normalize the appearance frequency of words, a technique called TF * IDF expressed by the following equation is used.

【００３１】[0031]

【数２】ここで、ｔｆdiは単語ｉが文書ｄに出現する頻度、Ｎは
すべての文書の数、ｄｆi は単語ｉが出現する文書の数
である。(Equation 2) Here, tfdi is the frequency at which word i appears in document d, N is the number of all documents, and dfi is the number of documents in which word i appears.

【００３２】自動フィルタリングは、次に示す数式で表
される線形識別関数によって行われ、この関数によって
単語の重みの総和Ｄｉｓ（ｄ）が計算される。The automatic filtering is performed by a linear discriminant function represented by the following equation, and the function is used to calculate the sum of the word weights Dis (d).

【００３３】[0033]

【数３】ここで、ｗi は各単語ｉに対する重みであり、ｆdiは上
式（２）の値であり、文書における各単語のｆdi値であ
る。(Equation 3) Here, wi is the weight for each word i, fdi is the value of the above equation (2), and is the fdi value of each word in the document.

【００３４】上述した式（３）から、総和Ｄｉｓ（ｄ）
が０より大きい場合、前記文書は有害であり、０以下で
ある場合、無害であると判定される。From the above equation (3), the sum Dis (d)
Is greater than 0, the document is deemed harmful; if it is less than 0, it is determined to be harmless.

【００３５】なお、上述した各単語ｉに対する重みは文
書ｄが有害な場合、総和Ｄｉｓ（ｄ）＞０となり、無害
な場合、総和Ｄｉｓ（ｄ）≦０となるように設定され
る。The weight for each word i described above is set so that when the document d is harmful, the total sum Dis (d)> 0, and when it is harmless, the total sum Dis (d) ≦ 0.

【００３６】次に、この単語の重みの学習アルゴリズム
について図２に示すフローチャートを参照して説明す
る。なお、この単語の重みの学習には perceptron lear
ning algorithm（ＰＬＡ）を使用している。Next, the word weight learning algorithm will be described with reference to the flowchart shown in FIG. The learning of the weight of this word is perceptron lear
ning algorithm (PLA) is used.

【００３７】図２においては、まず各種パラメータを設
定する（ステップＳ５１）。このパラメータとしては、
各単語の重みの集合Ｗ＝（ｗ1 ，…，ｗn ）、Ｎ個の学
習データＥ＝｛ｄ1 ，…，ｄn ｝、定数η、最大学習回
数Ｍａｘ、図２に示す学習処理を繰り返し行う学習回数
ｍがある。In FIG. 2, first, various parameters are set (step S51). This parameter includes
A set of weights of each word W = (w1,..., Wn), N pieces of learning data E = {d1,..., Dn}, a constant η, a maximum number of learnings Max, and a number of learnings for repeating the learning processing shown in FIG. m.

【００３８】それから、全ての文書を表現する単語のう
ち頻度の高いｎ個の単語を選択する（ステップＳ５
２）。Then, among the words representing all the documents, n frequently-selected words are selected (step S5).
2).

【００３９】次に、単語の重みの集合Ｗを初期化する
（ステップＳ５３）。この初期化では、各単語の重みに
乱数を入力する。それから、すべての学習データに対し
て前記単語重みの総和Ｄｉｓ（ｄ）を上式（３）により
計算する（ステップＳ５５）。Next, a set W of word weights is initialized (step S53). In this initialization, a random number is input as the weight of each word. Then, the sum Dis (d) of the word weights is calculated for all the learning data by the above equation (3) (step S55).

【００４０】そして、この計算の結果、すべての無害な
文書ｄについて総和Ｄｉｓ（ｄ）≦０であり、かつすべ
ての有害な文書ｄについて総和Ｄｉｓ（ｄ）＞０である
か否かをチェックし（ステップＳ５７）、そうである場
合には、処理を終了するが、そうでない場合には、この
ように誤って分類されたすべての文書ｄについて次のス
テップＳ６１，Ｓ６３で示すように重みの変化度合Ｓを
補正する（ステップＳ５９）。Then, as a result of this calculation, it is checked whether or not the total sum Dis (d) ≦ 0 for all harmless documents d and the total sum Dis (d)> 0 for all harmful documents d. (Step S57) If so, the process ends. If not, the weight change is performed on all the documents d classified in this way as shown in the following steps S61 and S63. The degree S is corrected (step S59).

【００４１】すなわち、ステップＳ６１では、文書ｄi
が有害であって、かつ総和Ｄｉｓ（ｄ）≦０の場合に
は、重み変化度合Ｓを増加するように補正し、またステ
ップＳ６３では、文書ｄi が無害であって、かつ総和Ｄ
ｉｓ（ｄ）＞０の場合には、重み変化度合Ｓを低減する
ように補正する。That is, in step S61, the document di
Is harmful and the total sum Dis (d) ≦ 0, the weight change degree S is corrected so as to increase, and in step S63, the document di is harmless and the sum D
If is (d)> 0, correction is made so as to reduce the weight change degree S.

【００４２】そして、このように補正された重み変化度
合Ｓを使用して単語重みの集合ＷをステップＳ６５で示
す式のように補正する。それから、学習回数ｍを＋１イ
ンクリメントし（ステップＳ６７）、この学習回数ｍが
最大学習回数Ｍａｘより小さいか否かをチェックし（ス
テップＳ６９）、また最大学習回数Ｍａｘより小さい場
合には、ステップＳ５５に戻り、ステップＳ５７に示し
た条件が満たされるまで、ステップＳ５５以降の処理を
繰り返し行う。そして、最終的にｎ個の単語に対する単
語重みの集合が求まる。Then, using the weight change degree S thus corrected, the word weight set W is corrected as in the equation shown in step S65. Then, the learning number m is incremented by +1 (step S67), and it is checked whether the learning number m is smaller than the maximum learning number Max (step S69). If the learning number m is smaller than the maximum learning number Max, the process proceeds to step S55. Returning, the processing from step S55 is repeated until the condition shown in step S57 is satisfied. Then, a set of word weights for the n words is finally obtained.

【００４３】重み付き単語リスト学習部６０で取得され
た各単語の重みは、各単語に対応して重み付き単語リス
トとして重み付き単語リスト格納部５０に格納される。
次に示す表７は、重み付き単語リスト格納部５０に格納
されている重み付き単語リストを示す表であり、各単語
に対応して単語重みｗが格納されている。The weight of each word obtained by the weighted word list learning unit 60 is stored in the weighted word list storage unit 50 as a weighted word list corresponding to each word.
Table 7 shown below is a table showing a weighted word list stored in the weighted word list storage unit 50, and a word weight w is stored for each word.

【００４４】[0044]

【表１】次に、このように重み付き単語リスト学習部６０で得ら
れ、重み付き単語リスト格納部５０に格納された単語重
みに基づきインターネットから提供された情報が不適切
な情報であるか否かを判定する処理について説明する。[Table 1] Next, based on the word weights obtained in the weighted word list learning unit 60 and stored in the weighted word list storage unit 50, it is determined whether or not the information provided from the Internet is inappropriate information. Will be described.

【００４５】図１において、入力部１から入力されたイ
ンターネットからの情報は、単語抽出部３で、重み付き
単語リスト格納部５０に格納されている単語リストと照
合し、入力情報中に出現する単語とその出現頻度を求め
る。また、同時に出現した単語の重みwも重み付き単語
リスト格納部５０から求め、出現単語とその頻度および
重みを自動フィルタリング部３０に供給する。自動フィ
ルタリング部３０は、この入力された単語に対する重み
ｗと出現頻度から、入力情報中に出現した全ての単語に
対する重みｗの総和を算出し、この総和を所定の閾値と
比較し、総和が閾値よりも大きい場合不適切な情報と判
定し、総和が閾値よりも小さい場合、適切な情報と判定
し、この判定結果を出力部４０から出力する。In FIG. 1, the information from the Internet input from the input unit 1 is collated by the word extraction unit 3 with the word list stored in the weighted word list storage unit 50, and appears in the input information. Find words and their frequency of appearance. Further, the weight w of the word that appears at the same time is also obtained from the weighted word list storage unit 50, and the appearing word and its frequency and weight are supplied to the automatic filtering unit 30. The automatic filtering unit 30 calculates the sum of the weights w for all the words that appear in the input information from the weight w and the appearance frequency for the input word, compares the sum with a predetermined threshold, and determines that the sum is the threshold. If the sum is larger than the threshold value, the information is determined to be inappropriate. If the sum is smaller than the threshold value, the information is determined to be appropriate information.

【００４６】具体的に説明する。表１に示すように、重
み付き単語リスト学習部６０では、予め入力された学習
データから「画像」の重みは１０．９、「サンプル」の
重みは１８．７、「事故」の重みは−１６．６、「女子
高生」の重みは８２．２、「バス」の重みは−１０１．
９、「北海道」の重みは−１１２．５、「無料」の重み
は−６．３と求まり、重み付き単語リスト格納部５０に
格納しているので、この結果を利用すると、例えば「女
子高生の乗ったバスが北海道で事故」という表現全体に
対しては、自動フィルタリング部３０で各単語の重みの
総和を求め、８２．２−１０１．９−１１２．５−１
６．６＝−１４８．８となる。また、「女子高生のサン
プル画像、無料」の表現全体に対しては、自動フィルタ
リング部３０で各単語の総和を求め、８２．２＋１８．
７＋１０．９−６．３＝１０５．５となる。そして、図
２の処理と同様に閾値を０とすると、「女子高生の乗っ
たバスが北海道で事故」という表現は閾値を下回るの
で、情報の提供は阻止されず、また「女子高生のサンプ
ル画像、無料」という表現は閾値を上回るので、情報の
提供は阻止されるというように正しく判定することがで
きる。A specific description will be given. As shown in Table 1, in the weighted word list learning unit 60, the weight of “image” is 10.9, the weight of “sample” is 18.7, and the weight of “accident” is − 16.6, the weight of "high school girl" is 82.2, and the weight of "bus" is -101.
9, the weight of "Hokkaido" is determined to be -112.5, and the weight of "free" is determined to be -6.3, which are stored in the weighted word list storage unit 50. For the entire expression "the bus on which the bus rides in Hokkaido", the total sum of the weights of the words is calculated by the automatic filtering unit 30, and 82.2-101.9-112.5-1.
6.6 = −148.8. In addition, for the entire expression of “sample image of high school girl, free”, the sum total of each word is calculated by the automatic filtering unit 30, and 82.2 + 18.
7 + 10.9-6.3 = 105.5. If the threshold value is set to 0 in the same manner as in the processing of FIG. 2, the expression “the bus on which a high school girl got on accident in Hokkaido” is below the threshold value. Since the expression "free" exceeds the threshold, it can be correctly determined that the provision of information is blocked.

【００４７】次に、図４および図６を参照して、本発明
の他の実施形態に係る自動フィルタリング装置について
説明する。図４に示す自動フィルタリング装置は、図６
で説明した学習により単語リストを作成する情報自動フ
ィルタリング装置２５に対して第三者判定フィルタリン
グ処理部２３および該第三者判定フィルタリング処理部
２３で有害ＵＲＬを参照するために使用される有害ＵＲ
Ｌ一覧テーブル格納部１７が付加されている。Next, an automatic filtering device according to another embodiment of the present invention will be described with reference to FIGS. The automatic filtering device shown in FIG.
And a harmful UR used for referring to a harmful URL in the third-party determination filtering unit 23 for the information automatic filtering device 25 that creates a word list by learning described in
An L list table storage unit 17 is added.

【００４８】有害ＵＲＬ一覧テーブル格納部１７は、有
害情報を提供するＵＲＬを有害ＵＲＬ一覧テーブルとし
て格納しているものであり、第三者判定フィルタリング
処理部２３は、前記入力部１から入力されたＨＴＭＬ文
書のＵＲＬを有害ＵＲＬ一覧テーブル格納部１７の有害
ＵＲＬ一覧テーブルに登録されている各ＵＲＬと照合
し、一致するＵＲＬがあるか否かを判定するものであ
る。The harmful URL list table storage unit 17 stores URLs that provide harmful information as a harmful URL list table, and the third party judgment filtering processing unit 23 receives the URL from the input unit 1. The URL of the HTML document is collated with each URL registered in the harmful URL list table of the harmful URL list table storage unit 17 to determine whether there is a matching URL.

【００４９】図６は、図４に示す自動フィルタリング装
置の更に詳細な構成を示すブロック図である。図６に示
す自動フィルタリング装置は、図６に示した学習により
作成した重み付き単語リストを用いた情報自動フィルタ
リング装置を構成する入力部１、単語抽出部３、重み付
き単語リスト格納部５０、自動フィルタリング部３０、
出力部４０に加えて、図４の第三者判定フィルタリング
処理部２３に対応するＵＲＬリストに基づくフィルタリ
ング部１５および有害ＵＲＬ一覧テーブル格納部１７を
有している。FIG. 6 is a block diagram showing a more detailed configuration of the automatic filtering device shown in FIG. The automatic filtering device shown in FIG. 6 includes an input unit 1, a word extraction unit 3, a weighted word list storage unit 50, and an automatic information filtering device that use the weighted word list created by learning shown in FIG. Filtering unit 30,
In addition to the output unit 40, a filtering unit 15 based on a URL list and a harmful URL list table storage unit 17 corresponding to the third party determination filtering processing unit 23 in FIG.

【００５０】このように構成される自動フィルタリング
装置、すなわち第三者判定フィルタリング処理部による
ＵＲＬリスト一覧と学習により作成した重み付き単語リ
ストを用いた情報自動フィルタリング装置によるフィル
タリング処理では、まずインターネット２１を介して入
力されたＨＴＭＬ文書は、そのＵＲＬが有害ＵＲＬ一覧
テーブル格納部１７の有害ＵＲＬ一覧テーブルに登録さ
れている各ＵＲＬと照合され、一致するＵＲＬがあるか
否かが判定される。そして、有害ＵＲＬ一覧テーブル格
納部１７の有害ＵＲＬ一覧テーブルに登録されたＵＲＬ
と一致する場合には、このＵＲＬが示す情報の提示は阻
止される。In the filtering process performed by the automatic filtering device configured as described above, that is, the filtering process performed by the information filtering device using the URL list list and the weighted word list created by learning by the third party judgment filtering processing unit, the Internet 21 is first used. The URL of the HTML document input via the URL is collated with each URL registered in the harmful URL list table of the harmful URL list table storage unit 17 to determine whether there is a matching URL. Then, the URL registered in the harmful URL list table of the harmful URL list table storage unit 17
If the URL matches, the presentation of the information indicated by the URL is blocked.

【００５１】ＵＲＬリストに基づくフィルタリング部１
５による有害ＵＲＬ一覧テーブルを参照した判定の結
果、有害ＵＲＬ一覧テーブル格納部１７の有害ＵＲＬ一
覧テーブルに登録されているＵＲＬと一致するものがな
い場合には、学習により作成した重み付き単語リストを
用いた情報自動フィルタリング装置２５によるフィルタ
リングが図６で説明したように行われる。Filtering unit 1 based on URL list
5 as a result of referring to the harmful URL list table, if there is no URL that matches the URL registered in the harmful URL list table in the harmful URL list table storage unit 17, the weighted word list created by learning is used. The filtering by the used information automatic filtering device 25 is performed as described in FIG.

【００５２】このように本実施形態では、第三者による
判定に基づくフィルタリングと学習により作成した重み
付き単語リストを用いたフィルタリングの両方が行われ
るため、有害情報を適確に検出して阻止することができ
る。As described above, in the present embodiment, both filtering based on judgment by a third party and filtering using a weighted word list created by learning are performed, so that harmful information is accurately detected and prevented. be able to.

【００５３】[0053]

【発明の効果】以上説明したように、本発明によれば、
提供の阻止を必要とする不適切な情報および提供の阻止
を必要としない適切な情報を学習データとした自動学習
により単語の重みを求め、この単語の重みを各単語に対
応して重み付き単語リストとして記憶管理し、インター
ネットを介して提供される情報に含まれる単語を抽出
し、この抽出した単語の各々に対する重みを重み付き単
語リストから読み出し、各単語の重みの総和を算出し、
この総和に基づき情報の提供を阻止すべきか否かを判定
するので、従来アドホックに設定しなければならなかっ
た単語の重みを自動学習により適確に求め、この適確に
求めた単語の重みを利用して情報が不適切な情報である
か否かを適確に高い性能で判定し、不適切な情報の提供
を阻止することができる。As described above, according to the present invention,
The weight of a word is determined by automatic learning using inappropriate information that needs to be prevented from providing and inappropriate information that does not need to be provided as learning data, and the weight of this word is assigned to each word. Store and manage as a list, extract words included in information provided via the Internet, read the weight for each of the extracted words from the weighted word list, calculate the sum of the weights of each word,
Based on this sum, it is determined whether the provision of information should be stopped or not. Therefore, the weight of words that had to be set as ad hoc in the past is accurately obtained by automatic learning, and the weight of the accurately obtained words is calculated. It is possible to accurately determine with high performance whether or not the information is inappropriate information by using the information, thereby preventing the provision of the inappropriate information.

[Brief description of the drawings]

【図１】本発明の別の実施形態に係る情報自動フィルタ
リング装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an information automatic filtering device according to another embodiment of the present invention.

【図２】図１に示すフローチャートに使用されている単
語重みの設定手順を示すフローチャートである。FIG. 2 is a flowchart showing a procedure for setting word weights used in the flowchart shown in FIG. 1;

【図３】本発明の他の実施形態に係る自動フィルタリン
グ装置の概要構成を示す説明図である。FIG. 3 is an explanatory diagram showing a schematic configuration of an automatic filtering device according to another embodiment of the present invention.

【図４】従来の自己判定に基づくフィルタリングを説明
するための図である。FIG. 4 is a diagram for explaining conventional filtering based on self-determination.

【図５】図４に示した自己判定に基づくフィルタリング
の一例としてRSACi とSafeSurfによる評価結果の記述例
を示す図である。5 is a diagram illustrating an example of description of an evaluation result by RSACi and SafeSurf as an example of filtering based on self-determination illustrated in FIG. 4;

【図６】従来の第三者による判定に基づく有害情報フィ
ルタリングを説明するための図である。FIG. 6 is a diagram for describing conventional harmful information filtering based on determination by a third party.

[Explanation of symbols]

１入力部３単語抽出部３０自動フィルタリング部５０重み付き単語リスト格納部６０重み付き単語リスト学習部 DESCRIPTION OF SYMBOLS 1 Input part 3 Word extraction part 30 Automatic filtering part 50 Weighted word list storage part 60 Weighted word list learning part

─────────────────────────────────────────────────────
────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成１１年８月１８日（１９９９．８．１
８）[Submission date] August 18, 1999 (1999.8.1)
8)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００２７[Correction target item name] 0027

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００２７】[0027]

【発明の実施の形態】次に、図１を参照して、本発明の
実施形態に係る情報自動フィルタリング装置について説
明する。同図に示す情報自動フィルタリング装置は、単
語の重みを自動学習により求め、この自動学習で求めた
単語の重みを利用して情報が不適切であるか否かを判定
し、不適切な情報の提供を阻止するものであり、インタ
ーネットを介して提供されるＨＴＭＬ情報を入力する入
力部１、この入力部１を介して入力された情報に出現す
る単語を抽出する単語抽出部３、提供の阻止を必要とす
る不適切な情報である文書および提供の阻止を必要とし
ない適切な情報である文書を学習データとした自動学習
により前記文書に含まれる単語に対して情報の提供を阻
止する必要があるか否かを判定するために使用される単
語の重みを求める重み付き単語リスト学習部６０、この
重み付き単語リスト学習部６０で求めた単語の重みを各
単語に対応して重み付き単語リストとして記憶管理する
重み付き単語リスト格納部５０、単語抽出部３で抽出さ
れた単語および該単語に対して重み付き単語リスト格納
部５０から得られた単語の重みｗに基づき入力部１から
入力された情報の提供を阻止すべきか否かを判定する自
動フィルタリング部３０、および該自動フィルタリング
部３０で得られた判定結果を出力する出力部４０から構
成されている。FIG. 1 is a block diagram of an automatic information filtering apparatus according to an embodiment of the present invention. The information automatic filtering device shown in the figure obtains the weight of a word by automatic learning, determines whether the information is inappropriate by using the weight of the word obtained by the automatic learning, and An input unit 1 for inputting HTML information provided via the Internet, a word extracting unit 3 for extracting words appearing in the information input via the input unit 1, and a blocking of the provision. It is necessary to prevent the provision of information for the words included in the document by automatically learning a document that is inappropriate information that requires the information and a document that is appropriate information that does not need to be prevented from being provided as learning data. A weighted word list learning unit 60 for determining the weight of a word used to determine whether or not there is a word; A weighted word list storage unit 50 stored and managed as a list, a word extracted by the word extraction unit 3 and an input from the input unit 1 based on the word weight w obtained from the weighted word list storage unit 50 for the word The automatic filtering unit 30 determines whether or not to provide the provided information, and an output unit 40 that outputs the determination result obtained by the automatic filtering unit 30.

【手続補正２】[Procedure amendment 2]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００４７[Correction target item name] 0047

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００４７】次に、図３を参照して、本発明の他の実施
形態に係る情報自動フィルタリング装置について説明す
る。図３に示す情報自動フィルタリング装置は、図１で
説明した学習により単語リストを作成する情報自動フィ
ルタリング装置に対して第三者判定フィルタリング処理
部２３および該第三者判定フィルタリング処理部２３で
有害ＵＲＬを参照するために使用される有害ＵＲＬ一覧
テーブル格納部１７が付加されている点が異なる。Next, an automatic information filtering apparatus according to another embodiment of the present invention will be described with reference to FIG. The automatic information filtering apparatus shown in FIG. 3 uses a third-party judgment filtering processing unit 23 and a harmful URL in the third-party judgment filtering processing unit 23 for the information automatic filtering apparatus that creates a word list by learning described in FIG. The difference is that a harmful URL list table storage unit 17 used to refer to the URL is added.

【手続補正３】[Procedure amendment 3]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００４８[Correction target item name] 0048

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００４８】図３において、有害ＵＲＬ一覧テーブル格
納部１７は、有害情報を提供するＵＲＬを有害ＵＲＬ一
覧テーブルとして格納しているものであり、第三者判定
フィルタリング処理部２３は、図１に示す入力部１を介
して入力されたＨＴＭＬ文書のＵＲＬを有害ＵＲＬ一覧
テーブル格納部１７の有害ＵＲＬ一覧テーブルに登録さ
れている各ＵＲＬと照合し、一致するＵＲＬがあるか否
かを判定するものである。In FIG. 3, the harmful URL list table storage unit 17 stores URLs that provide harmful information as a harmful URL list table, and the third party determination filtering processing unit 23 shown in FIG. The URL of the HTML document input via the input unit 1 is collated with each URL registered in the harmful URL list table of the harmful URL list table storage unit 17 to determine whether there is a matching URL. is there.

【手続補正４】[Procedure amendment 4]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００４９[Correction target item name] 0049

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００４９】すなわち、図３に示す情報自動フィルタリ
ング装置は、図１に示す入力部１、単語抽出部３、重み
付き単語リスト格納部５０、自動フィルタリング部３０
および出力部４０による構成により、学習により作成し
た重み付き単語リストを用いてフィルタリングを行う機
能に、図３に示す第三者判定フィルタリング処理部２３
及び有害ＵＲＬ一覧テーブル格納部１７を付加してＵＲ
Ｌによるフィルタリングを行う機能を有している。That is, the automatic information filtering apparatus shown in FIG. 3 has an input unit 1, a word extraction unit 3, a weighted word list storage unit 50, and an automatic filtering unit 30 shown in FIG.
3 and a function of performing filtering using a weighted word list created by learning, the third-party determination filtering processing unit 23 shown in FIG.
And a harmful URL list table storage unit 17
It has a function of performing filtering by L.

【手続補正５】[Procedure amendment 5]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００５０[Correction target item name] 0050

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００５０】このように構成される情報自動フィルタリ
ング装置、すなわち第三者判定フィルタリング処理部２
３の有害ＵＲＬ一覧テーブル格納部１７に蓄積される有
害ＵＲＬリストと学習により作成した重み付き単語リス
トを用いた情報自動フィルタリング処理では、まずイン
ターネット２１を介して入力されたＨＴＭＬ文書は、そ
のＵＲＬが有害ＵＲＬ一覧テーブル格納部１７の有害Ｕ
ＲＬ一覧テーブルに登録されている各ＵＲＬと照合さ
れ、一致するＵＲＬがあるか否かが判定される。そし
て、有害ＵＲＬ一覧テーブル格納部１７の有害ＵＲＬ一
覧テーブルに登録されたＵＲＬと一致する場合には、こ
のＵＲＬが示す情報の提示は阻止される。The thus configured automatic information filtering apparatus, that is, the third-party judgment filtering processing unit 2
In the automatic information filtering process using the harmful URL list stored in the harmful URL list table storage unit 17 and the weighted word list created by learning, first, the HTML document input via the Internet 21 has its URL. Harmful U in harmful URL list table storage unit 17
It is checked against each URL registered in the RL list table, and it is determined whether there is a matching URL. When the URL matches the URL registered in the harmful URL list table of the harmful URL list table storage unit 17, presentation of the information indicated by the URL is prevented.

【手続補正６】[Procedure amendment 6]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００５１[Correction target item name] 0051

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００５１】ＵＲＬリストに基づくフィルタリング部１
５による有害ＵＲＬ一覧テーブルを参照した判定の結
果、有害ＵＲＬ一覧テーブル格納部１７の有害ＵＲＬ一
覧テーブルに登録されているＵＲＬと一致するものがな
い場合には、学習により作成した重み付き単語リストを
用いた情報自動フィルタリング装置２５によるフィルタ
リングが行われる。Filtering unit 1 based on URL list
5 as a result of referring to the harmful URL list table, if there is no URL that matches the URL registered in the harmful URL list table in the harmful URL list table storage unit 17, the weighted word list created by learning is used. Filtering is performed by the used information automatic filtering device 25.

【手続補正７】[Procedure amendment 7]

【補正対象書類名】図面[Document name to be amended] Drawing

【補正対象項目名】図１[Correction target item name] Fig. 1

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【図１】 FIG.

───────────────────────────────────────────────────── フロントページの続き (72)発明者橋本和夫埼玉県上福岡市大原２−１−15 株式会社ケイディディ研究所内Ｆターム(参考） 5B075 KK07 KK13 KK33 KK54 KK70 ND03 NR02 NR12 QM10 UU40 ──────────────────────────────────────────────────続き Continuing from the front page (72) Inventor Kazuo Hashimoto 2-1-15 Ohara, Kamifukuoka-shi, Saitama F-term in K.D. Laboratory (reference) 5B075 KK07 KK13 KK33 KK54 KK70 ND03 NR02 NR12 NR12 QM10 UU40

Claims

[Claims]

1. An automatic information filtering method for identifying inappropriate information among information provided via the Internet and preventing provision of the identified inappropriate information, the method comprising: Is used to determine whether it is necessary to prevent the provision of information for words included in the information by automatic learning using appropriate information and appropriate information that does not need to be prevented from being provided as learning data. The weight of the word is obtained, and the weight of the obtained word is stored and managed as a weighted word list corresponding to each word, and information provided through the Internet is inputted, and the words included in this information are extracted. Then, a weight for each of the extracted words is read from the weighted word list, a sum of the weights of the read words is calculated, and a weight is calculated based on the calculated sum. Automatic information filtering method characterized by determining whether to block the provision of information.

2. The processing for obtaining the weight of a word, wherein the weight of the word is obtained by automatic learning based on a linear discriminant function that can discriminate the inappropriate information and the appropriate information in a vector space. The information automatic filtering method according to claim 1.

3. An automatic information filtering apparatus for identifying inappropriate information out of information provided via the Internet and preventing provision of the identified inappropriate information, wherein the information is inappropriate. Is used to determine whether it is necessary to prevent the provision of information for words included in the information by automatic learning using appropriate information and appropriate information that does not need to be prevented from being provided as learning data. Word weight learning means for obtaining word weights, weighted word list storage means for storing and managing the obtained word weights as weighted word lists corresponding to the respective words, and inputting information provided via the Internet Inputting means, word extracting means for extracting words included in the input information, and a weighted word list for weighting each of the extracted words. Et read, calculates the sum of the weights of each word thus read out, information automatic filtering device, characterized in that it comprises a determination means for determining whether to block the provision of the information based on the calculated sum.

4. The method according to claim 1, wherein the word weight learning unit includes a unit that obtains a word weight by automatic learning based on a linear discriminant function that can discriminate the inappropriate information and the appropriate information in a vector space. The information automatic filtering device according to claim 3.