JP2005149124A

JP2005149124A - Electronic message filter system and computer program

Info

Publication number: JP2005149124A
Application number: JP2003385511A
Authority: JP
Inventors: Tanev Ivan; イヴァン・タネヴ
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-11-14
Filing date: 2003-11-14
Publication date: 2005-06-09

Abstract

<P>PROBLEM TO BE SOLVED: To provide a mail filter system capable of filtering with high probability and deleting a message associated with a spam and including a spelling-altered word. <P>SOLUTION: The mail filter system 46 includes an object connector 72 providing a connection to a thesaurus 48 for spell checking ; a token look up module 60 reading an e-mail message; and determining, via the connection provided by the object connector 72, whether each token in the e-mail message is defined in the thesaurus 48; and processing modules 62, 64 classifying the e-mail message, by referring to whether the e-mail message is the spam in accordance with the result of token look up. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明はメールフィルタシステムに関し、特に、明示的なトレーニングの必要性無しに、好ましくない（スパム）電子メール（ｅ−メール）及びメッセージをフィルタして削除する、メールフィルタシステムに関する。 The present invention relates to a mail filter system, and more particularly to a mail filter system that filters out unwanted (spam) email (e-mail) and messages without the need for explicit training.

インターネット上でのｅ−メールは非常に便利な通信ツールである。ｅ−メールを送るのにはほとんど費用がかからない。さらに、一瞬で地球の裏側の友人と連絡をとることもできる。 E-mail on the Internet is a very convenient communication tool. Sending an e-mail costs little. You can even get in touch with friends on the other side of the globe.

残念ながら、ｅ−メールはまた、不適切な商業目的のｅ−メールを配信するにも非常に便利な方法である。このような、不適切な商業目的のｅ−メールを「スパム」と呼ぶ。今日、天文学的数字のスパムがネットワーク上で送信されている。スパムメールを送信する人（スパマ）にとっては、スパムメールを送るのにほとんど費用がかからないため、今後もネットワーク上ではますます多くのスパムメールの送信が行なわれるものと思われる。 Unfortunately, e-mail is also a very convenient way to deliver inappropriate commercial e-mail. Such inappropriate commercial e-mail is referred to as “spam”. Today, astronomical numbers of spam are being sent over the network. For those who send spam mails (spammers), it will cost little to send spam mails, so it seems that more and more spam mails will continue to be sent over the network.

スパムは莫大な経済的損失を引起こす。世界中でのスパムによる経済的損失は、２００３年の２０５億ドルから２００７年には１９８０億ドルを超えるものと予想されている。こうした損失は、帯域とＣＰＵ時間とが失われること、ネットワークと、目標とされるコンピュータとのセキュリティが低下すること、及びスパムの受信者の時間と名誉とが失われることにより引起こされる。 Spam causes enormous economic losses. Global economic losses from spam are expected to exceed $ 20.5 billion in 2003 to over $ 198 billion in 2007. These losses are caused by the loss of bandwidth and CPU time, the loss of security between the network and the targeted computer, and the loss of time and honor of spam recipients.

スパムを阻止するかフィルタして削除するために多くの努力がなされてきた。最も進んだスパムフィルタとして、ベイズ学習及びハッシュブラックリストに基づく二つのスパムフィルタがある。 Many efforts have been made to prevent or filter spam. The most advanced spam filters are two spam filters based on Bayesian learning and hash blacklists.

前者のアプローチによれば、メッセージは、このメッセージが含むトークン（語）に関し、学習済みのスパム関連確率から確率を推定し、スパムへの分類が行なわれる。このような分類は、以下のベイズ規則に従って推定される。 According to the former approach, a message is classified into spam by estimating a probability from a learned spam related probability with respect to a token (word) included in the message. Such classification is estimated according to the following Bayes rule.

例えば、“ｍｏｎｅｙ”という語についてスパムの確率(すなわちＰ（ａ）)が０．６であり、“ｃｏｎｆｉｄｅｎｔｉａｌ”という語について（すなわちＰ（ｂ））が０．８であれば、“ｍｏｎｅｙ”及び“ｃｏｎｆｉｄｅｎｔｉａｌ”の両方の語を含むメッセージの確率（すなわちＰ（ａ，ｂ））は、式（１）により、０．８６となる。

For example, if the spam probability (ie P (a)) is 0.6 for the word “money” and the word “confidential” (ie P (b)) is 0.8, then “money” and The probability of a message including both words “confidental” (that is, P (a, b)) is 0.86 according to equation (1).

ハッシュブラックリストは、「ハッシング」と呼ばれる処理により、着信するｅメールの全テキストを一つの数（シグナチャ）に変換してしまうことにより動作する。これらの数（ハッシュ値）は高速な比較を行なうためには十分に単純であるが、ある程度は役に立つ一意性を持つ。その後これらを既知のスパムのハッシュ値リストと比較し、「スパムハッシュ」に合致したメッセージを破棄する。 The hash blacklist operates by converting the entire text of an incoming email into a single number (signature) by a process called “hashing”. These numbers (hash values) are simple enough for fast comparisons, but have some useful uniqueness. These are then compared to a list of known spam hashes, and messages matching the “spam hash” are discarded.

スパムフィルタシステムの別の例が特許文献１に開示されている。特許文献１に開示されたスパムフィルタシステムは、それぞれ重み付けされた語のリストを含む。受信したｅメールの各々について、そのｅ−メール中の語の重みをリストで確認し、その合計を計算する。もし合計が予め定められたしきい値を超えれば、そのｅ−メールはスパムであると判定され、破棄フォルダに入れられる。語の重みは「トレーニング」によって計算される。トレーニングにおいては、ユーザは受信したｅ−メールを手動で分類する。もしある語がこのリストに入っていなければ、新たな語には予め定められた初期値が割当てられる。 Another example of the spam filter system is disclosed in Patent Document 1. The spam filter system disclosed in Patent Document 1 includes a weighted word list. For each received e-mail, check the weight of the words in the e-mail in a list and calculate the sum. If the total exceeds a predetermined threshold, the e-mail is determined to be spam and placed in a discard folder. Word weights are calculated by “training”. In training, the user manually categorizes received e-mails. If a word is not in this list, a new initial value is assigned to the new word.

特開２００３−０６７３０４、図１−３JP2003-066734, FIG.

スパムフィルタ用ソフトウェア（スパムフィルタ）と、スパム発生ソフト（スパムソフト）のますます高度化するスパム発生技術との間には、絶え間なく進化しあうという関係（軍備拡大競争）がある。そのため、技術的に克服すべき課題が生ずる。スパムメールを阻止するかまたはフィルタして削除するための上述の３つの方法には、以下のような問題がある。 There is a constantly evolving relationship (arms expansion competition) between spam filter software (spam filter) and the increasingly sophisticated spam generation technology of spam generation software (spam software). As a result, technical problems to be overcome arise. The above three methods for blocking or filtering out spam mail have the following problems.

ベイズフィルタの使用する技術には、通常、機械がメッセージの意味を把握することはできないという、人工知能に共通した問題がある。ベイズフィルタを用いるものでは、メッセージの意味を把握するのではなく、語の組合わせの出現に関する統計に基づいて、これをスパムと考える。従って、現在利用可能なスパムフィルタを用いると、正しいものを誤りとし、誤りを正しいとする結果が得られることも多い。 The technology used by the Bayesian filter has a common problem with artificial intelligence, where the machine usually cannot understand the meaning of the message. In the case of using a Bayesian filter, the meaning of a message is not grasped, but this is considered as spam based on statistics about the appearance of word combinations. Thus, using currently available spam filters often results in the correct being wrong and the error being correct.

ベイズフィルタについて考えると、このフィルタが初めて認知するトークンは、フィルタの判定に関して中立の効果を持つはずであり、そのためフィルタは再び学習（トレーニング）状態となるはずである。もしこのようなトークンが、明らかにスパムであるメッセージに当然含まれる部分であったとしても、フィルタはそのスパムをスパムに分類することはできない。もし、さらに多数の同じスパムメッセージを受け取りながら、それらが別々の新たなトークンを含む場合には、事態はさらに悪化する。 Considering a Bayesian filter, a token that this filter recognizes for the first time should have a neutral effect on the determination of the filter, so the filter should be in the learning (training) state again. Even if such a token is naturally part of a message that is clearly spam, the filter cannot classify the spam as spam. Things get even worse if you receive a larger number of the same spam messages and they contain separate new tokens.

一方、スパムの確率が高いと分かっているトークンと一致してしまうことを避けるために、スパマにとって自明な解決策は、スパムメッセージ中の最も重要なスパム関連のトークンを改造することである。例えば、“ｃｏｎｆｉｄｅｎｔｉａｌ”という語がスパムの一部である確率を０．８と仮定すれば、これを含むメッセージはいずれも、スパムに分類されるであろう。しかし、“ｃｏｎｆｉｄｅｎｔｉａｌ”という語をわざと（かつランダムに）“ｃｏｎｆ１ｄｅｎｔｉａｌ”（“ｉ”の代わりに“１”を使う）、“ｃｏｎｆｉｄｅｎｔｉａ１”（“ｌ”の代わりに“１”）、“ｃ０ｎｆｉｄｅｎｔｉａｌ”（“ｏ”の代わりに“０”）、“ｃｏｎｆｉｄｅｎ＿ｔｉａｌ）（空白を挿入）等のようにミススペルすると、当然、このような組合わせの量は膨大なものになり、重要なトークンである”ｃｏｎｆｉｄｅｎｔｉａｌ“が、ベイズフィルタにとっては全く新しい、未知のトークンと置きかえられてしまう。 On the other hand, the obvious solution for spammers is to modify the most important spam-related tokens in spam messages to avoid matching with tokens that are known to have a high probability of spam. For example, assuming the probability that the word “confidental” is part of spam is 0.8, any message containing this would be classified as spam. However, the word “confidental” is intentionally (and randomly) “conf1dental” (uses “1” instead of “i”), “confidentia1” (“1” instead of “l”), “c0nidential” ( Misspelling such as “0” instead of “o”), “confiden_tial” (inserting a blank), etc., naturally, the amount of such a combination becomes enormous and is an important token “confidential”. However, it is replaced with a completely new and unknown token for the Bayesian filter.

先行特許文献１に開示されたシステムも同様である。もしスパムソフトが重みの大きい語を改造すれば、その結果できた新たな語はおそらくリストには存在しない。その結果、この語に対しては、それほど大きくない初期重みが割当てられることになる。改造されたトークンが新たに現れるたびに、ユーザはｅ−メールを手動で分類することによってシステムをトレーニングしなければならない。 The system disclosed in the prior patent document 1 is the same. If spam software modifies a heavy word, the resulting new word is probably not on the list. As a result, this word is assigned an initial weight that is not very large. Each time a modified token appears, the user must train the system by manually classifying the email.

スパマにとって効率的にこうした改造を実現するのは、技術的に深刻な問題ではない。実際、スパマも同じベイズフィルタを実行させ、同じ、典型的なスパムメッセージの集合を用いてこれらのフィルタをトレーニングしているかも知れない。従って、スパマは、彼らがメッセージに含ませようとしている語が、どの程度スパムに関連しているかを示す確率のおよその値を良く承知している。この結果、スパムと非常に強い関連性を持つ語は、ランダムに改造された後、膨大な数のネットワーク上の潜在的なスパム受信者に一斉送信されることになるだろう。 It is not a technically serious problem for a spammer to achieve such a modification efficiently. In fact, spammers may run the same Bayes filters and train these filters with the same set of typical spam messages. Spammers are therefore well aware of the approximate value of the probability that the word they are trying to include in the message is related to spam. As a result, words that have a very strong association with spam will be randomized and then broadcast to a large number of potential spam recipients on the network.

加えて、改造されたトークンはハッシュブラックリストフィルタも回避できる。なぜなら、新たなトークンは、以前にスパムであると認知されていた内容に対しても、新たなハッシュ値（シグナチャ）を生じさせることになるからである。ランダムに発生させた意味のないトークンをスパムメッセージにさらに追加するということは、メッセージのシグナチャのみを用いてスパムメッセージを適切に分類しようとするスパムフィルタの能力の劣化をさらに大きくさせる程度の意味しかない。 In addition, the modified token can bypass the hash blacklist filter. This is because a new token will generate a new hash value (signature) for content that was previously recognized as spam. Adding more randomly generated meaningless tokens to a spam message is only meant to further degrade the ability of the spam filter to properly classify spam messages using only the message signature. Absent.

従って、この発明の目的の一つは、改造されたスパムに関連した語を含むメッセージを、高い確率でフィルタし削除することが可能なメールフィルタシステムを提供することである。 Accordingly, one of the objects of the present invention is to provide a mail filter system capable of filtering and deleting messages containing words related to modified spam with a high probability.

この発明の別の目的は、何ら明示的なトレーニングなしで、改造されたスパムに関連した語を含むメッセージを、高い確率でフィルタし削除することが可能なメールフィルタシステムを提供することである。 Another object of the present invention is to provide a mail filter system that can filter and delete messages containing modified spam-related words with high probability without any explicit training.

この発明に従った電子メッセージフィルタシステムは、スペルチェックに用いるための予め定められた辞書への接続を提供するための接続手段と、電子メッセージを読むための読出手段と、読出手段が読出した電子メッセージ中の各トークンについて、そのトークンが辞書中に見出しを有するか否かを、接続手段によって提供された接続を介して判定するための手段と、電子メッセージ中のトークンについて、判定手段によって得られた結果に従い、電子メッセージを処理するための処理手段とを含む。 An electronic message filter system according to the present invention comprises a connection means for providing a connection to a predetermined dictionary for use in spell checking, a reading means for reading an electronic message, and an electronic read by the reading means. For each token in the message, the means for determining whether the token has a heading in the dictionary, via the connection provided by the connection means, and for the token in the electronic message, obtained by the determination means. And processing means for processing the electronic message according to the result.

好ましくは、処理手段は、電子メッセージ中の全トークンに対する、電子メッセージ中のトークンであって辞書中に見出しのないものの率を取得するための手段と、辞書中に見出しのないトークンの率が予め定められた条件を満たすか否かを判定するための判定手段と、判定手段の判定に従って電子メッセージを分類するための手段とを含む。 Preferably, the processing means has means for obtaining a ratio of tokens in the electronic message that do not have a heading in the dictionary to all tokens in the electronic message; A determination unit for determining whether or not a predetermined condition is satisfied, and a unit for classifying the electronic message according to the determination of the determination unit.

判定手段は、辞書中に見出しのないトークンの率が予め定められたしきい値を超えるか否かを判定するための手段を含んでもよい。 The determining means may include means for determining whether or not the rate of tokens without a heading in the dictionary exceeds a predetermined threshold value.

分類するための手段は、トークンの率が予め定められたしきい値を超えた電子メッセージを不要なメッセージとして分類するための第１の分類手段と、トークンの率が予め定められたしきい値を超えないメッセージを通常のメッセージとして分類するための第２の分類手段とを含んでもよい。 The means for classifying includes: first classifying means for classifying an electronic message having a token rate exceeding a predetermined threshold as an unnecessary message; and a threshold having a predetermined token rate And a second classification means for classifying a message that does not exceed the normal message as a normal message.

好ましくは、第１の分類手段は、トークンの率が予め定められたしきい値を超えた電子メッセージに予め定められたテキストを挿入するための手段を含む。 Preferably, the first classifying means includes means for inserting a predetermined text into an electronic message in which the token rate exceeds a predetermined threshold.

電子メッセージは件名部分と本文部分を含んでもよく、挿入する手段は、トークンの率が予め定められたしきい値を超えた電子メッセージの件名部分に、予め定められたテキストを挿入するための手段を含んでもよい。 The electronic message may include a subject part and a body part, and the means for inserting is means for inserting a predetermined text into the subject part of the electronic message where the token rate exceeds a predetermined threshold. May be included.

好ましくは、第１の分類手段は、電子メッセージを予め定められた第１の宛先に送信するための手段を含み、第２の分類手段は、電子メッセージを第１の宛先と異なる、予め定められた第２の宛先に送信するための手段を含む。 Preferably, the first classifying means includes means for sending the electronic message to a predetermined first destination, and the second classifying means is a predetermined number different from the first destination. Means for transmitting to the second destination.

好ましくは、判定するための手段は、電子メッセージの各語について、その語が辞書中に見出しを有するか否かを、接続手段によって提供された接続を介して判定するための手段を含む。 Preferably, the means for determining includes means for determining, for each word of the electronic message, whether the word has a heading in the dictionary via a connection provided by the connecting means.

システムはさらに、電子メッセージの各文字について、その文字が辞書中に見出しのない語に属するか否かを判定するための手段を含み、処理手段は、電子メッセージ中の全語に対する、電子メッセージ中の語であって辞書中に見出しのないものの率を取得するための手段と、電子メッセージ中の語であって辞書中に見出しのないものに含まれる文字の率を取得するための手段と、語の率及び文字の率が予め定められた条件を満たすか否かを判定するための手段と、判定手段の判定に従って、電子メッセージを分類するための手段とを含む。 The system further includes means for determining, for each character in the electronic message, whether the character belongs to a word that does not have a heading in the dictionary, and the processing means is in the electronic message for all words in the electronic message. Means for obtaining the rate of words having no heading in the dictionary, means for obtaining the rate of characters contained in words in the electronic message that have no heading, and Means for determining whether the word rate and the character rate satisfy a predetermined condition, and means for classifying the electronic message according to the determination of the determination means.

好ましくは、判定手段は、語の率が予め定められたしきい値を超え、かつ文字の率が予め定められたしきい値を超えるか否かを判定するための手段を含む。 Preferably, the determination means includes means for determining whether the word rate exceeds a predetermined threshold and whether the character rate exceeds a predetermined threshold.

分類するための手段が、語の率及び文字の率がそれぞれのしきい値を超えた電子メッセージを不要なメッセージとして分類するための第１の分類手段と、語の率または文字の率のいずれかが、対応するしきい値を超えない電子メッセージを通常のメッセージとして分類するための第２の分類手段とを含んでもよい。 The means for classifying includes: a first classifying means for classifying an electronic message having a word rate and a character rate exceeding respective threshold values as an unnecessary message; and either a word rate or a character rate. May include a second classifying means for classifying an electronic message that does not exceed the corresponding threshold as a normal message.

好ましくは、第１の分類手段は、語の率及び文字の率がそれぞれのしきい値を超えた電子メッセージに、予め定められたテキストを挿入するための手段を含む。 Preferably, the first classification means includes means for inserting a predetermined text into an electronic message in which the word rate and the character rate exceed respective threshold values.

電子メッセージは件名部分と本文部分を含んでもよく、挿入するための手段は、語の率及び文字の率がそれぞれのしきい値を超えた電子メッセージの件名部分に、予め定められたテキストを挿入するための手段を含んでもよい。 The electronic message may include a subject part and a body part, and the means for inserting is to insert a predetermined text into the subject part of the electronic message where the word rate and the character rate exceed the respective thresholds. Means for doing so may be included.

好ましくは、判定するための手段は、電子メッセージの各語について、その語が辞書中に見出しを有するか否かを、接続手段によって提供された接続を介して判定するための手段を含み、システムはさらに、電子メッセージ中の各文字について、その文字が辞書中に見出しのない語に属するか否かを判定するための手段を含み、処理手段は、電子メッセージ中の語であって辞書中に見出しのないものに含まれる文字の率を取得するための手段と、文字の率が予め定められた条件を満たすか否かを判定するための判定手段と、判定手段の判定に従って、電子メッセージを分類するための手段とを含む。 Preferably, the means for determining includes, for each word of the electronic message, means for determining via a connection provided by the connection means whether the word has a heading in the dictionary, Further includes means for determining, for each character in the electronic message, whether the character belongs to a word that does not have a heading in the dictionary, the processing means being a word in the electronic message According to the determination of the determination means, a means for obtaining the rate of characters included in those without a headline, a determination unit for determining whether the character rate satisfies a predetermined condition, and an electronic message Means for classifying.

判定手段は、文字の率が予め定められたしきい値を超えるか否かを判定するための手段を含んでもよい。 The determining means may include means for determining whether or not the character rate exceeds a predetermined threshold value.

分類手段は、文字の率がしきい値を超えた電子メッセージを不要なメッセージとして分類するための第１の分類手段と、文字の率がしきい値を超えない電子メッセージを通常のメッセージとして分類するための第２の分類手段とを含んでもよい。 The classifying means classifies an electronic message whose character rate exceeds the threshold as an unnecessary message, and classifies an electronic message whose character rate does not exceed the threshold as a normal message. And a second classifying means for performing the processing.

この発明の第２の局面は、コンピュータ上で実行されると、コンピュータに、上述の電子メッセージフィルタシステムのいずれかに記載された機能の全てを行なわせる、コンピュータで実行可能なプログラムに関する。 A second aspect of the present invention relates to a computer-executable program that, when executed on a computer, causes the computer to perform all of the functions described in any of the electronic message filter systems described above.

‐概観‐
この発明のメールフィルタシステムは、最新のベイズ及びハッシュブラックリストスパムフィルタを克服しようとする、進んだスパムｅ−メールメッセージに対処するためのアプローチに基づくものである。このアプローチは、スパム確率がより高くなるような、改造されたトークン（語）は、スペルチェックソフトウェアエージェントを用いて検出できる、という考えに基づいている。 -Overview-
The mail filter system of the present invention is based on an approach to address advanced spam email messages that seeks to overcome the latest Bayesian and hash blacklist spam filters. This approach is based on the idea that a modified token (word) that has a higher spam probability can be detected using a spell check software agent.

スパムメッセージの場合、スペルチェック用のシソーラスでは見つからないトークンや文字の量が通常のｅ−メールメッセージに比べ格段に多い。操作しやすいユーザインターフェイスを備えた、コンピュータが多数あるオフィス環境に当然備えられている辞書（シソーラス）を、ここで提案するように用いることにより、本実施の形態に係るスパムフィルタ用スペルチェックソフトウェア（ＳｐａｍＦｉｌｔｅｒｉｎｇＳｐｅｌｌ−ＣｈｅｃｋｉｎｇＳｏｆｔｗｅａｒ：ＳＦＳＣＡ）が、ごく一般的なソフトウェアの接続可能性を利用すること、サードパーティによる付加的なスペルチェック用辞書データベースが不要であること、さらにカスタマイズが容易であること等の利益が得られることは明確である。この実施の形態に従ったＳＦＳＣＡはスパムフィルタ用の独立したソフトウェアとして使用することも、ベイズ又はハッシュブラックリストフィルタのプラグインとして用いてそれらの機能をさらに高めることもできる。 In the case of a spam message, the amount of tokens and characters that cannot be found in a spell check thesaurus is much larger than in a normal e-mail message. By using a dictionary (thesaurus) provided with an easy-to-operate user interface, which is naturally provided in an office environment with many computers, as proposed here, the spell check software for spam filter according to the present embodiment ( Spam Filtering Spell-Checking Software (FSSCA) uses very common software connectivity, no additional third-party dictionary database for spell-checking, easy customization, etc. It is clear that the benefits can be obtained. The SFSCA according to this embodiment can be used as independent software for spam filters, or it can be used as a plug-in for Bayesian or hash blacklist filters to further enhance their functionality.

経験によれば、スパムに関連したトークンに改造を加えることが、最新のスパムフィルタを回避する最も効果的な方法と思われるので、このような改造されたトークンを見つけ、着信するメールメッセージ中にそのような語が存在する率を計算する方法について考察した。 Experience has shown that remodeling tokens related to spam is the most effective way to circumvent the latest spam filters, so you can find these modified tokens and include them in incoming email messages. We discussed how to calculate the rate at which such words exist.

改造されたトークンはランダムで、特に重要なこととして、同じメッセージでさえ到着するたびに異なるので、それらを発見する基準として用いることができるような、それらの共通の特徴は何かが問題となる。この実施の形態のシステムは、改造されたトークンがランダムであったとしても、これらはいずれもスペルチェックルーチンによって正しい語としては認知されない、という共通の特徴を持つという事実に基づく。この実施の形態のシステムは、オフィス製品で広く利用可能なスペルチェック用辞書のシソーラス(単語)辞書を使用する。 Remodeled tokens are random, and most importantly, even the same message is different every time it arrives, so what matters are their common features that can be used as a basis for finding them . The system of this embodiment is based on the fact that even though the modified tokens are random, none of them is recognized as a correct word by the spell check routine. The system of this embodiment uses a thesaurus (word) dictionary of a spell check dictionary widely available in office products.

このアプローチは以下のような利点を持つ。 This approach has the following advantages.

ｉ）ほとんどのデスクトップコンピュータは、広く採用されているオフィス用ソフトウェア環境のいずれかを実行している。 i) Most desktop computers run one of the widely adopted office software environments.

ｉｉ）オフィス用ソフトウェア環境のシソーラスは高度にカスタマイズすることができ、特定の用語を自由に追加でき、特別なスペルチェック用の設定を個別に調整することができる。 ii) The thesaurus in the office software environment can be highly customized, specific terms can be freely added, and special spell check settings can be individually adjusted.

ｉｉｉ）ユーザは通常、自分が発信するメッセージのスペルチェックをするので、誤りを正しいものとしてしまうおそれはさらに小さくなる。 iii) Since the user normally checks the spelling of the message he / she sends, the risk of making the error correct is further reduced.

ｉｖ）シソーラスは、通常、コンポーネントオブジェクト技術で実現されており、したがって、別のアプリケーション（例えば本実施の形態に係るＳＦＳＣＡ）によるシソーラスの起動は一般的な形でサポートされているはずである。
‐実施の形態に係るシステムの構造‐
図１を参照して、この実施の形態のＳＦＳＣＡ４６は、インボックス４４を備えたｅ−メールエージェント４２と、オペレーティングシステムのオブジェクト接続機能５０を介してアクセス可能な、スペルチェック用に用いられるシソーラス４８とによって規定される環境で動作する、ソフトウェアエージェントとして実現されている。インボックス４４はｅ−メールエージェント４２がｅ−メールサーバ４０から取出す、着信メールを格納する。 iv) The thesaurus is usually implemented in component object technology, and therefore the thesaurus activation by another application (eg, the SFSCA according to the present embodiment) should be supported in a general manner.
-System structure according to the embodiment-
Referring to FIG. 1, the SFSCA 46 of this embodiment includes an e-mail agent 42 having an inbox 44 and a thesaurus 48 used for spell check that can be accessed through an object connection function 50 of the operating system. It is realized as a software agent that operates in an environment defined by. The inbox 44 stores incoming mail that the e-mail agent 42 retrieves from the e-mail server 40.

図２はＳＦＳＣＡ４６の詳細なブロック図である。図２を参照して、ＳＦＳＣＡ４６はシソーラス４８への接続を提供するオブジェクトコネクタ７２と、オブジェクトコネクタ７２及びオブジェクト接続機能５０を介して、インボックス４４に格納された未読のｅ−メールメッセージ中の全トークンをシソーラス４８内で探し、そのトークンがシソーラスにリストされているか否かを判定するトークンルックアップモジュール６０と、トークンルックアップモジュール６０の出力に基づいて、シソーラス４８にリストされていない語の率を計算する計算モジュール６２とを含む。計算モジュール６２はさらに、メッセージ中の、シソーラス４８内にリストされた語中の文字に対する、メッセージ中の、シソーラスに定義されていない語中の文字の率も計算する。 FIG. 2 is a detailed block diagram of the SFSCA 46. Referring to FIG. 2, the SFSCA 46 provides an object connector 72 that provides connection to the thesaurus 48, and all the unread e-mail messages stored in the inbox 44 via the object connector 72 and the object connection function 50. A token lookup module 60 that looks for a token in the thesaurus 48 and determines whether the token is listed in the thesaurus, and a rate of words not listed in the thesaurus 48 based on the output of the token lookup module 60 And a calculation module 62 for calculating. The calculation module 62 also calculates the ratio of characters in words that are not defined in the thesaurus in the message to characters in words that are listed in the thesaurus 48 in the message.

ＳＦＳＣＡはさらに、計算モジュール６２に接続され、計算モジュール６２によって計算された率を記憶するためのワークメモリ７０と、ワークメモリ７０に記憶された率に基づき、ｅ−メールがスパムか否かを判定する判定モジュール６４と、予め定められた、未定義語のしきい値率（ＴＨ₁）及び文字のしきい値率（ＴＨ₂）を記憶するための２個のメモリ６６及び６８とを含む。メモリ６６及び６８に記憶されるしきい値率は手動で修正することができる。 The SFSCA is further connected to the calculation module 62 and determines whether the e-mail is spam based on the work memory 70 for storing the rate calculated by the calculation module 62 and the rate stored in the work memory 70. Determination module 64, and two memories 66 and 68 for storing a predetermined threshold rate of undefined words (TH ₁ ) and a threshold rate of characters (TH ₂ ). The threshold rates stored in memories 66 and 68 can be manually modified.

さらに、判定モジュール６４はスパムであると判定されたｅ−メールメッセージのヘッダーに特定のキーワード（例えば、“ｓｐａｍ”）をマークとして挿入する機能を有し、これによりｅ−メールエージェント４２はメッセージのフィルタリングを簡単に行なうことができる。モジュール６２、６４及びメモリ６６、６８、７０は、トークンルックアップモジュール６０によるトークンルックアップの結果に従って各ｅ−メールを処理する。すなわち、モジュール６２及び６４は各ｅ−メールをスパム（好ましくないｅ−メールメッセージ）または通常のメッセージに分類する。 Further, the determination module 64 has a function of inserting a specific keyword (for example, “spam”) as a mark in the header of the e-mail message determined to be spam, so that the e-mail agent 42 can read the message. Filtering can be performed easily. Modules 62, 64 and memories 66, 68, 70 process each email according to the result of the token lookup by token lookup module 60. That is, modules 62 and 64 classify each e-mail as spam (unwanted e-mail message) or regular message.

この実施の形態の判定モジュール６４は、未定義語の率がメモリ６６に記憶されているしきい値を超え、かつ、未定義語中の文字の率がメモリ６８に記憶されているしきい値を超えた場合、そのｅ−メールはスパムであると判定する。 The determination module 64 of this embodiment has a threshold at which the rate of undefined words exceeds the threshold stored in the memory 66 and the rate of characters in the undefined words is stored in the memory 68. If it exceeds, it is determined that the e-mail is spam.

しきい値はユーザの環境に従って決定することができる。この実施の形態では、未定義語の率が２０％を超え、かつ、未定義語中の文字の率が２０％を超えた場合、ｅ−メールメッセージはスパムであると判定する。もし未定義語の率が２０％を超えないか、または、未定義語中の文字の率が２０％を超えない場合は、ｅ−メールはスパムではないと判定される。
‐動作‐
この実施の形態に従ったＳＦＳＣＡ４６は、以下のように動作する。ｅ−メールエージェント４２はｅ−メールサーバ４０からメッセージを取出し、それらをインボックス４４に格納する。予めスケジュールされたタイミングで、又は手動によるトリガ信号に応答して、ＳＦＳＣＡ４６はインボックス４４中の未読のｅ−メールを読む。 The threshold can be determined according to the user's environment. In this embodiment, if the rate of undefined words exceeds 20% and the rate of characters in the undefined words exceeds 20%, it is determined that the e-mail message is spam. If the rate of undefined words does not exceed 20%, or the rate of characters in the undefined words does not exceed 20%, it is determined that the e-mail is not spam.
-Operation-
The SFSCA 46 according to this embodiment operates as follows. The e-mail agent 42 retrieves messages from the e-mail server 40 and stores them in the inbox 44. At a pre-scheduled timing or in response to a manual trigger signal, the SFSCA 46 reads unread e-mail in the inbox 44.

図２を参照して、未読のｅ−メールメッセージの各々について、トークンルックアップモジュール６０は各語についてシソーラス４８を検索し、その語がシソーラス４８中に定義されているか否かを確認する。トークンルックアップモジュール６０は未定義語の数を計算モジュール６２に出力する。計算モジュール６２はｅ−メールメッセージ中の語及び文字の数を計数し、トークンルックアップモジュール６０の出力に応答して、未定義語中の文字の数と、未定義語の率及び未定義語中の文字の率とを計算する。 Referring to FIG. 2, for each unread e-mail message, token lookup module 60 searches the thesaurus 48 for each word to see if the word is defined in the thesaurus 48. The token lookup module 60 outputs the number of undefined words to the calculation module 62. The calculation module 62 counts the number of words and characters in the email message and, in response to the output of the token lookup module 60, the number of characters in the undefined words, the rate of undefined words and the undefined words. Calculate the percentage of characters in the middle.

判定モジュール６４は計算された率をワークメモリ７０から読出す。判定モジュール６４は、計算された未定義語の率をメモリ６６に記憶されたしきい値と比較する。もし計算された率がしきい値を超えていれば、計算された未定義語中の文字の率をメモリ６８に記憶されたしきい値と比較する。もし計算された率がしきい値を超えていれば、判定モジュール６４はそのｅ−メールメッセージがスパムであると判定し、そのｅ−メールメッセージの件名に“ｓｐａｍ”というマークを付し、件名にマークが付されたｅ−メールメッセージをインボックス４４に格納する。そしてトークンルックアップモジュール６０はインボックス中の別の未読メッセージを読出す。計算された率のいずれも、対応するしきい値を超えていない場合には、判定モジュールは単にインボックス中の別の未読メッセージを読出す。 The determination module 64 reads the calculated rate from the work memory 70. Decision module 64 compares the calculated rate of undefined words with a threshold value stored in memory 66. If the calculated rate exceeds the threshold, the calculated rate of characters in the undefined word is compared to the threshold stored in memory 68. If the calculated rate exceeds the threshold, the determination module 64 determines that the email message is spam, marks the subject of the email message “spam”, and sets the subject The e-mail message marked with is stored in the inbox 44. The token lookup module 60 then reads another unread message in the inbox. If none of the calculated rates exceeds the corresponding threshold, the decision module simply reads another unread message in the inbox.

上述の動作を、インボックス４４中に未読のｅ−メールメッセージがなくなるまで繰返す。 The above operation is repeated until there are no unread e-mail messages in the inbox 44.

図３はｅ−メールメッセージの例を示す。この実施の形態のＳＦＳＣＡ４６によれば、未定義語の率は３％であり、未定義語中の文字の率は５％である。従って、このｅ−メールメッセージはスパムではないと判定される。差出人は、ここで考察したメッセージのスペルチェックをしていなかったようであるが、それでも、未定義語の率及び未定義語中の文字の率が比較的小さいことに注目されたい。 FIG. 3 shows an example of an e-mail message. According to the SFSCA 46 of this embodiment, the rate of undefined words is 3%, and the rate of characters in undefined words is 5%. Therefore, it is determined that this e-mail message is not spam. It appears that the sender did not spell check the message considered here, but still the rate of undefined words and the rate of characters in undefined words is relatively small.

図４は別のｅ−メールメッセージの例を示す。この実施の形態のＳＦＳＣＡ４６によれば、未定義語の率は４６％、未定義語中の文字の率は４５％であると計算される。従って、このｅ−メールメッセージはスパムであると判定される。 FIG. 4 shows another example of an e-mail message. According to the SFSCA 46 of this embodiment, the rate of undefined words is calculated to be 46%, and the rate of characters in undefined words is calculated to be 45%. Therefore, this e-mail message is determined to be spam.

上述の通り、この実施の形態のＳＦＳＣＡ４６はｅ−メールがスパムであるか否かを、通常のオフィス環境で容易に利用できる辞書を用いて判定できる。サードパーティのスペルチェック用の辞書データベースをさらに用いることは必要ではない。もっとも、この実施の形態はこのような付加的な辞書の使用を除外するものではない。この実施の形態は、それ自身で（スペルチェック辞書の支援により）スパムｅ−メールメッセージを判定することができ、従って、これを独立したソフトウェアとして用いることができる。従来のベイズリストフィルタ又はハッシュブラックリストフィルタ又はその双方と組合わせれば、この実施の形態のＳＦＳＣＡ４６はそれらのスパムチェック機能をさらに向上させることができる。 As described above, the SFSCA 46 of this embodiment can determine whether an e-mail is spam using a dictionary that can be easily used in a normal office environment. It is not necessary to further use a dictionary database for third-party spell checking. However, this embodiment does not exclude the use of such an additional dictionary. This embodiment can itself determine spam email messages (with the aid of a spell check dictionary) and can therefore be used as independent software. When combined with a conventional Bayes list filter or hash black list filter or both, the SFSCA 46 of this embodiment can further improve their spam checking capabilities.

ここで強調すべきは、この実施の形態では、明示的な「トレーニング」段階が必要ないことである。もしユーザが頻繁に使用する単語がスペルチェック用辞書に定義されていないと分かれば、ＳＦＳＣＡ４６のスパムフィルタ能力を向上させることになるとは意識せずに、ユーザはその語をスペルチェック用辞書に追加するであろう。 It should be emphasized that this embodiment does not require an explicit “training” phase. If a user knows that a frequently used word is not defined in the spell check dictionary, he / she will add the word to the spell check dictionary without being conscious of improving the spam filtering capability of the SFSCA 46. Will do.

上述の実施の形態において、未定義語の率及び未定義語中の文字の率をともに計算した。しかし、この発明はそのような実施の形態に限定されない。例えば、未定義語の率のみ、または未定義語中の文字の率のみを用いてもよい。
‐コンピュータによる実現例‐
上述のＳＦＳＣＡ４６は通常のコンピュータハードウェア上で実行されるソフトウェアとして実現できる。図５はこのような通常のコンピュータシステム１２０の外観図であり、図６はシステム１２０を機能ブロック図で示す。 In the above embodiment, the ratio of undefined words and the ratio of characters in undefined words are both calculated. However, the present invention is not limited to such an embodiment. For example, only the rate of undefined words or only the rate of characters in undefined words may be used.
-Example of implementation by computer-
The above-described SFSCA 46 can be realized as software executed on ordinary computer hardware. FIG. 5 is an external view of such a normal computer system 120, and FIG. 6 shows the system 120 in a functional block diagram.

図５を参照して、コンピュータシステム１２０は、ＦＤ（フレキシブルディスク）ドライブ１５２及びＣＤ−ＲＯＭ(コンパクトディスク読出専用メモリ）ドライブ１５０を有するコンピュータ１４０と、キーボード１４６と、マウス１４８と、モニタ１４２とを含む。 Referring to FIG. 5, the computer system 120 includes a computer 140 having an FD (flexible disk) drive 152 and a CD-ROM (compact disk read only memory) drive 150, a keyboard 146, a mouse 148, and a monitor 142. Including.

図６のブロック図を参照して、コンピュータ１４０はさらに、ＣＰＵ（中央処理装置）１５６と、ＣＰＵ１５６、ＣＤ−ＲＯＭドライブ１５０及びＦＤドライブ１５２に接続されたバス１６６と、ブートアッププログラム等のプログラムを記憶する読出専用メモリ（ＲＯＭ）１５８と、ＣＰＵ１６０に接続され、実行されるアプリケーションプログラム命令、システムプログラム、データ等を記憶するランダムアクセスメモリ（ＲＡＭ）１６０と、ハードディスクドライブ１５４とを含む。 Referring to the block diagram of FIG. 6, the computer 140 further includes a CPU (central processing unit) 156, a bus 166 connected to the CPU 156, the CD-ROM drive 150, and the FD drive 152, and a program such as a bootup program. A read-only memory (ROM) 158 for storing, a random access memory (RAM) 160 connected to the CPU 160 for storing application program instructions, system programs, data and the like to be executed, and a hard disk drive 154 are included.

コンピュータ１４０はさらに、コンピュータ１４０にローカルエリアネットワーク（ＬＡＮ）１８４への接続を提供するためのネットワークアダプタボード１７４を含む。図示しないメールサーバがＬＡＮ１７６を介してコンピュータシステム１２０に接続される。 Computer 140 further includes a network adapter board 174 for providing computer 140 with a connection to a local area network (LAN) 184. A mail server (not shown) is connected to the computer system 120 via the LAN 176.

図１及び図２に示されたシソーラス４８はハードディスクドライブ１５４に格納され、プログラムが実行される際に、シソーラス４８はハードディスクドライブ１５４から読出されてＲＡＭ１６０に記憶される。 The thesaurus 48 shown in FIGS. 1 and 2 is stored in the hard disk drive 154, and when the program is executed, the thesaurus 48 is read from the hard disk drive 154 and stored in the RAM 160.

図７はコンピュータ１２０にこの実施の形態に従ったメールのフィルタリングを実行させるためのプログラムの制御構造を示す。このプログラムはＣＤ−ＲＯＭ１８０またはＦＤ１８２に記憶されてＣＤ−ＲＯＭドライブ１５０またはＦＤドライブ１５２に挿入され、さらにハードディスクドライブ１５４に転送される。または、プログラムはＬＡＮ１７６を介してコンピュータ１４０に送信され、ハードディスクドライブ１５４に記憶されても良い。プログラムは、実行時にはＲＡＭ１６４にロードされる。プログラムはＣＤ−ＲＯＭ１８０、ＦＤ１８２またはＬＡＮ１７６を介してＲＡＭ１６４に直接ロードされてもよい。 FIG. 7 shows a control structure of a program for causing the computer 120 to perform mail filtering according to this embodiment. This program is stored in the CD-ROM 180 or FD 182, inserted into the CD-ROM drive 150 or FD drive 152, and further transferred to the hard disk drive 154. Alternatively, the program may be transmitted to the computer 140 via the LAN 176 and stored in the hard disk drive 154. The program is loaded into the RAM 164 at the time of execution. The program may be directly loaded into the RAM 164 via the CD-ROM 180, the FD 182 or the LAN 176.

以下で説明するプログラムは、コンピュータ１４０をＳＦＳＣＡ４６として動作させる複数の命令を含む。このメールフィルタ機能を実行するのに必要とされる基本的機能の幾つかはコンピュータ１４０で実行されているオペレーティングシステム（ＯＳ）、サードパーティのプログラムまたはコンピュータ１４０にインストールされるツールボックス等のモジュールで提供されるので、プログラムはこの実施の形態に係るＳＦＳＣＡ４６の機能を実現するのに必要な基本的機能を必ずしも全て含んでいなくても良い。プログラムが含む必要があるのは、命令のうち、所望の結果が得られるように、制御された態様で適切な機能または「ツール」を呼出すことによってメールフィルタ処理を行なう部分のみである。コンピュータシステム１２０がどのように動作するかは周知であり、従ってここでは詳細は述べない。 The program described below includes a plurality of instructions that cause the computer 140 to operate as the SFSCA 46. Some of the basic functions required to perform this mail filtering function are operating system (OS) running on the computer 140, third party programs or modules such as a toolbox installed on the computer 140. Since the program is provided, the program may not necessarily include all the basic functions necessary to realize the function of the SFSCA 46 according to this embodiment. The program only needs to include the portion of the instructions that performs mail filtering by calling the appropriate function or “tool” in a controlled manner so that the desired result is obtained. How the computer system 120 operates is well known and will not be described in detail here.

図７を参照して、この実施の形態のＳＦＳＣＡ４６を実現するプログラムは、開始後、ますステップ２００で、必要とされるリソースを初期化する。例えば、プログラムはこのステップで、ＲＡＭ１６０に常駐するシソーラス４８に接続する。次に、ステップ２０２で、システムはインボックス４４内の次の未読ｅ−メールを読出そうとする。ステップ２０４で、未読のｅ−メールがないか判定される。未読のｅ−メールメッセージがなければ、プログラムはステップ２４０でクリーンナップ処理を行ない、終了する。さもなければ、ステップ２０６で変数ＣＮＴ＿Ｃ及びＣＮＴ＿Ｗがゼロに設定される。変数ＣＮＴ＿Ｃは未定義語中の文字の数を計数するためのものである。ＣＮＴ＿Ｗ未定義語の数を計数するためのものである。 Referring to FIG. 7, the program that implements the SFSCA 46 of this embodiment initializes the required resources in step 200 after starting. For example, the program connects to the thesaurus 48 resident in the RAM 160 at this step. Next, in step 202, the system attempts to read the next unread e-mail in inbox 44. In step 204, it is determined whether there is an unread e-mail. If there is no unread e-mail message, the program performs a cleanup process at step 240 and ends. Otherwise, in step 206, the variables CNT_C and CNT_W are set to zero. The variable CNT_C is for counting the number of characters in the undefined word. CNT_W is for counting the number of undefined words.

ステップ２０８で、ｅ−メールメッセージ中の語の総数が確認される。同様にステップ２１０で、文字の総数が確認される。 At step 208, the total number of words in the email message is confirmed. Similarly, at step 210, the total number of characters is confirmed.

ｅ−メールメッセージ中のすべての語について、ステップ２２０でその語がシソーラス４８にあるか否かが判定される。もしなければ、ＣＮＴ＿Ｗに１が加算される。同様にｅ−メールメッセージ中の全ての文字について、ステップ２２２でその文字がシソーラス４８中の語のものか否か判定される。もしなければ、ＣＮＴ＿Ｃに１が加算される。 For all words in the e-mail message, it is determined in step 220 whether the word is in the thesaurus 48. If not, 1 is added to CNT_W. Similarly, for all characters in the e-mail message, it is determined in step 222 whether the characters are for words in the thesaurus 48. If not, 1 is added to CNT_C.

ステップ２２４で、未定義語の率（ＲＡＴＩＯ＿Ｗ）と未定義語中の文字の率（ＲＡＴＩＯ＿Ｃ）が、ステップ２０８及び２１０で計算された語数及び文字数ならびに変数ＣＮＴ＿Ｃ及びＣＮＴ＿Ｗを用いて計算される。 In step 224, the rate of undefined words (RATIO_W) and the rate of characters in undefined words (RATIO_C) are calculated using the number of words and characters calculated in steps 208 and 210 and the variables CNT_C and CNT_W.

ステップ２２６で、計算されたＲＡＴＩＯ＿Ｗがしきい値ＴＨ₁を超えるか否かが判定される。もしＲＡＴＩＯ＿Ｗ＞ＴＨ₁であれば、制御はステップ２２８に進み、さもなければステップ２０２に戻る。 In step 226, the calculated RATIO_W whether more than a threshold value TH ₁ is determined. If RATIO_W> TH ₁ , control proceeds to step 228, otherwise returns to step 202.

ステップ２２８で、計算されたＲＡＴＩＯ＿Ｃがしきい値ＴＨ₂を超えるか否かが判定される。もしＲＡＴＩＯ＿Ｃ＞ＴＨ₂であれば、制御はステップ２３０に進み、さもなければステップ２０２に戻る。 In step 228, the calculated RATIO_C whether more than a threshold value TH ₂ is determined. If RATIO_C> TH ₂ , control proceeds to step 230, otherwise returns to step 202.

ステップ２３０で、ｅ−メールの件名に“ＳＰＡＭ”のマークが付され、制御はステップ２０２に戻る。 At step 230, the subject of the e-mail is marked with “SPAM” and control returns to step 202.

このように構成されたプログラムは、コンピュータシステム１２０上で実行されると、コンピュータシステム１２０に図１及び図２で示されたＳＦＳＣＡ４６のすべての機能を実行させる。 When the program configured as described above is executed on the computer system 120, it causes the computer system 120 to execute all the functions of the SFSCA 46 shown in FIGS.

この実施の形態の判定モジュール６４はｅ−メールメッセージがもし不適当なメッセージであると判定されるとそのヘッダー部分に特定のキーワードを付す。この発明はこのような実現例に限定されない。例えば、ＳＦＳＣＡは不要のｅ−メールメッセージを廃棄フォルダ等の所定の宛先に送り、他のメールをｅ−メールエージェントの通常のインボックスに送ってもよい。 If the e-mail message is determined to be an inappropriate message, the determination module 64 of this embodiment adds a specific keyword to the header portion. The present invention is not limited to such an implementation. For example, the SFSCA may send unnecessary e-mail messages to a predetermined destination such as a discard folder and send other mails to the normal inbox of the e-mail agent.

図２に示された計算モジュール６２は、シソーラス４８中にないトークンの率を計算し、判定モジュール６４はもしその率がしきい値を超えればそのｅ−メールがスパムであると判定する。しかし、この発明はこのような実現例に限定されない。計算モジュールは、シソーラス４８中にあるトークンの率を計算し、判定モジュールはもしその率がしきい値より低ければそのｅ−メールがスパムであると判定してもよい。これらは互いに等価である。同様に、ｅ−メールがスパムであるか否かを判定するにあたって、どのような合理的な基準を採用してもよい。 The calculation module 62 shown in FIG. 2 calculates the rate of tokens not in the thesaurus 48, and the determination module 64 determines that the email is spam if the rate exceeds a threshold. However, the present invention is not limited to such an implementation. The calculation module may calculate the rate of tokens in the thesaurus 48 and the determination module may determine that the email is spam if the rate is below a threshold. These are equivalent to each other. Similarly, any reasonable criteria may be employed in determining whether an e-mail is spam.

この実施の形態では、未定義語の率及び未定義語中の文字の率がそれぞれのしきい値を超えれば、ｅ−メールはスパムであると分類される。この発明はこの実施の形態に限定されない。未定義語の率または未定義語中の文字の率のいずれかが対応するしきい値を超えれば、ｅ−メールがスパムであると分類してもよい。 In this embodiment, an email is classified as spam if the rate of undefined words and the rate of characters in undefined words exceed their respective thresholds. The present invention is not limited to this embodiment. An email may be classified as spam if either the rate of undefined words or the rate of characters in undefined words exceeds a corresponding threshold.

さらに、メッセージは必ずしもｅ−メールメッセージには限定されない。インスタントメッセンジャのメッセージであってもよいし、他のどのような形式の電子メッセージであってもよい。 Further, the message is not necessarily limited to an e-mail message. It may be an instant messenger message or any other form of electronic message.

上述の実施の形態は単なる例示であって制限的なものと解してはならない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The above-described embodiments are merely examples and should not be construed as limiting. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

この発明の一実施の形態に係るＳＦＳＣＡ４６を含むｅ−メールシステムの全体構成を示す図である。1 is a diagram showing an overall configuration of an e-mail system including an SFSCA 46 according to an embodiment of the present invention. この発明の一実施の形態に係るＳＦＳＣＡ４６の詳細なブロック図である。It is a detailed block diagram of SFSCA46 concerning one embodiment of this invention. スパムでないｅ−メールの例を示す図である。It is a figure which shows the example of the e-mail which is not spam. スパムｅ−メールの例を示す図である。It is a figure which shows the example of spam e-mail. この発明の実施の形態に係るＳＦＳＣＡ４６を実現するコンピュータシステムの外観図である。1 is an external view of a computer system that implements an SFSCA 46 according to an embodiment of the present invention. コンピュータシステムの機能ブロック図である。It is a functional block diagram of a computer system. この発明の一実施の形態に係るＳＦＳＣＡ４６を実現するプログラムの構造を示すフローチャートである。It is a flowchart which shows the structure of the program which implement | achieves SFSCA46 which concerns on one embodiment of this invention.

Explanation of symbols

４０ｅ−メールサーバ、４２ｅ−メールエージェント、４４インボックス、４６ＳＦＳＣＡ（スパムフィルタ用スペルチェックソフトウェアエージェント）、４８シソーラス、６０トークンルックアップモジュール、６２計算モジュール、６４判定モジュール、６６及び６８しきい値メモリ、７０ワークメモリ 40 e-mail server, 42 e-mail agent, 44 inbox, 46 SFSCA (spell check software agent for spam filter), 48 thesaurus, 60 token lookup module, 62 calculation module, 64 decision module, 66 and 68 threshold Value memory, 70 work memory

Claims

Connection means for providing a connection to a predetermined dictionary for use in spell checking;
Reading means for reading electronic messages;
Means for determining, for each token in the electronic message read by the reading means, via a connection provided by the connecting means whether the token has a heading in the dictionary;
An electronic message filter system comprising: processing means for processing the electronic message according to a result obtained by the means for determining for tokens in the electronic message.

The processing means is
Means for obtaining a ratio of tokens in the electronic message that do not have a heading in the dictionary to all tokens in the electronic message;
A determination means for determining whether a rate of tokens without a heading in the dictionary satisfies a predetermined condition;
Means for classifying the electronic message according to a determination of a determination means.

The system of claim 2, wherein the determining means includes means for determining whether the rate of tokens without headings in the dictionary exceeds a predetermined threshold.

The means for classifying comprises
First classifying means for classifying an electronic message having a token rate exceeding the predetermined threshold as an unnecessary message;
The system according to claim 3, further comprising: a second classifying unit for classifying a message whose token rate does not exceed the predetermined threshold as a normal message.

5. The system of claim 4, wherein the first classifying means includes means for inserting a predetermined text into an electronic message in which the token rate exceeds the predetermined threshold.

The electronic message includes a subject part and a body part,
6. The system of claim 5, wherein the means for inserting includes means for inserting a predetermined text into a subject portion of an electronic message where the rate of the token exceeds a predetermined threshold. .

The first classifying means includes means for sending the electronic message to a predetermined first destination;
5. The system of claim 4, wherein the second classifying means includes means for sending the electronic message to a predetermined second destination that is different from the first destination.

The means for determining includes, for each word of the electronic message, means for determining via the connection provided by the connecting means whether the word has a heading in the dictionary; The system of claim 1.

For each character of the electronic message further comprising means for determining whether the character belongs to a word without a heading in the dictionary;
The processing means is
Means for obtaining a rate of words in the electronic message that do not have a heading in the dictionary with respect to all words in the electronic message;
Means for obtaining a rate of characters contained in words in the electronic message that do not have a heading in the dictionary;
Determining means for determining whether the word rate and the character rate satisfy a predetermined condition;
9. A system according to claim 8, comprising means for classifying the electronic message according to a determination of the determination means.

10. The means of claim 9, wherein the determining means includes means for determining whether the word rate exceeds the predetermined threshold and the character rate exceeds a predetermined threshold. The described system.

The means for classifying comprises
First classification means for classifying an electronic message in which the word rate and the character rate exceed respective threshold values as an unnecessary message;
11. A system according to claim 10, comprising: a second classifying means for classifying as an ordinary message an electronic message in which either the word rate or the character rate does not exceed a corresponding threshold.

12. The system of claim 11, wherein the first classification means includes means for inserting predetermined text into an electronic message in which the word rate and the character rate exceed respective thresholds. .

The electronic message includes a subject part and a body part,
The means for inserting includes means for inserting predetermined text into a subject portion of an electronic message in which the word rate and the character rate exceed respective thresholds. The described system.

The means for determining includes, for each word of the electronic message, means for determining via a connection provided by the connecting means whether the word has a heading in the dictionary;
The system further includes means for determining, for each character in the electronic message, whether the character belongs to a word without a heading in the dictionary;
The processing means is
Means for obtaining a rate of characters contained in words in the electronic message that do not have a heading in the dictionary;
Determining means for determining whether the character rate satisfies a predetermined condition;
Means for classifying the electronic message according to a determination of the determining means.

The system of claim 14, wherein the determining means includes means for determining whether the rate of the characters exceeds a predetermined threshold.

The means for classifying comprises
First classification means for classifying an electronic message having a character rate exceeding a threshold as an unnecessary message;
16. A system according to claim 15, comprising: a second classifying means for classifying an electronic message whose character rate does not exceed a threshold value as a normal message.

A computer-executable program that, when executed on a computer, causes the computer to perform all of the functions recited in any one of claims 1 to 16.