JP4878468B2

JP4878468B2 - E-mail evaluation apparatus and e-mail evaluation method

Info

Publication number: JP4878468B2
Application number: JP2005309803A
Authority: JP
Inventors: 尚中川; 広樹谷岡
Original assignee: 株式会社ジャストシステム
Priority date: 2005-10-25
Filing date: 2005-10-25
Publication date: 2012-02-15
Anticipated expiration: 2025-10-25
Also published as: JP2007122145A

Description

本発明は、文書ファイルの内容を評価するための技術に関し、特に、電子メールの内容の適否を判定するための技術に関する。 The present invention relates to a technique for evaluating the contents of a document file, and more particularly to a technique for determining the suitability of the contents of an electronic mail.

近年、コンピュータの普及とネットワーク技術の進展に伴い、ネットワークを介した電子情報の交換が盛んになっている。電子メールもそのひとつであるが、電子メールの多くはスパムメールなどとよばれる迷惑メールであるといわれている。 In recent years, with the spread of computers and the advancement of network technology, the exchange of electronic information via the network has become popular. E-mail is one of them, but most e-mails are said to be spam mails called spam mails.

このような状況に対応して、電子メールの内容の適否を自然言語処理により判定し、迷惑メールを自動的に排除するための技術が開発されている。
その一例として、電子メールに含まれる単語ごとの適切さを判定することにより、電子メールが迷惑メールに該当するかを総合評価する方法がある。たとえば、１００通の電子メールのうち、７０通が迷惑メールであり、残りの３０通が通常の電子メール（以下、「正規メール」とよぶ）であったとする。ここで、ある単語Ａが迷惑メール７０通のうちの６０通、正規メール３０通のうちの３通から検出されたとする。この場合、単語Ａは、迷惑メールに現れやすい単語であるから、単語Ａが含まれている電子メールは迷惑メールである可能性が高いともいえる。このような観点から、単語ごとの適切さまたは不適切さを指標化してデータベース化し、新たに受信された電子メールに含まれている各単語からその電子メールが迷惑メールである可能性を総合評価する。
特開２００３−１８３２４号公報 In response to such a situation, a technique has been developed for determining the suitability of the contents of an electronic mail by natural language processing and automatically eliminating junk mail.
As an example, there is a method of comprehensively evaluating whether an e-mail corresponds to a junk e-mail by determining appropriateness of each word included in the e-mail. For example, out of 100 e-mails, 70 are junk mails and the remaining 30 are normal e-mails (hereinafter referred to as “regular mails”). Here, it is assumed that a word A is detected from 60 of 70 spam mails and 3 of 30 regular mails. In this case, since the word A is a word that easily appears in junk mail, it can be said that there is a high possibility that an e-mail containing the word A is a junk mail. From this point of view, the appropriateness or inappropriateness of each word is indexed into a database, and a comprehensive evaluation of the possibility that the email is spam from each word contained in the newly received email To do.
JP 2003-18324 A

このような方法によって電子メールの内容の適否を正確に評価するためには、データベースの充実が重要である。多くの電子メールが受信されるほど、単語ごとの評価も定まってくる。また、評価対象となる単語数も増加する。その反面、評価対象となる単語数の増加はデータベースの肥大化を招く。特に、無意味な単語が羅列されるタイプの迷惑メールが受信されると、データベースに登録される単語数が一気に増加してしまう。 In order to accurately evaluate the suitability of e-mail content by such a method, it is important to enhance the database. The more emails you receive, the more word-by-word ratings will be. In addition, the number of words to be evaluated increases. On the other hand, an increase in the number of words to be evaluated leads to an enlargement of the database. In particular, when a junk mail of a type in which meaningless words are listed is received, the number of words registered in the database increases at a stretch.

本発明はこうした状況に鑑みてなされたものであり、その主たる目的は、電子メールの内容を評価するために用いるデータベースの肥大化を効率的に抑制するための技術を提供することにある。 The present invention has been made in view of such circumstances, and a main object thereof is to provide a technique for efficiently suppressing the enlargement of a database used for evaluating the contents of an electronic mail.

本発明のある態様は、電子メール評価装置である。
この装置は、外部装置から送信された電子メールが受け手のユーザにとって適切な内容であるかを判定するために、単語ごとの適切さを指標化した適合度を適合度情報として保持する適合度情報保持部と、評価対象となるべき電子メールを取得するメール取得部と、電子メールに含まれる単語を抽出する単語抽出部と、適合度情報を参照して電子メールに含まれる各単語の適合度を検出し、それらの適合度から電子メールが適切な内容であるか否かを判定する適合判定部と、判定対象となった電子メールに含まれる各単語についての適合度をその電子メールに対する判定結果に応じて再計算することにより、適合度情報を更新する適合度更新部と、電子メールから適合度情報に登録されていない単語が抽出されたとき、その単語を適合度情報に新規登録する単語登録部と、新規登録された単語を含む電子メールの取得後に更に取得された電子メール群において新規登録された単語の出現頻度が所定の閾値より小さいとき、新規登録された単語を適合度情報から除外する単語削除部と、を備える。 One embodiment of the present invention is an electronic mail evaluation apparatus.
In order to determine whether or not the e-mail transmitted from the external device has appropriate contents for the user of the recipient, the degree-of-fit information holds the degree of relevance obtained by indexing the appropriateness of each word as the degree-of-fit information. A holding unit, a mail acquisition unit that acquires an e-mail to be evaluated, a word extraction unit that extracts a word included in the e-mail, and a fitness of each word included in the e-mail with reference to the fitness information And a suitability determination unit that determines whether or not the e-mail has appropriate content based on the degree of suitability thereof, and determination of the suitability for each word included in the e-mail to be judged with respect to the e-mail The relevance level update unit that updates the relevance level information by recalculating according to the result, and when a word that is not registered in the relevance level information is extracted from the e-mail, the relevance level information When the appearance frequency of a newly registered word is smaller than a predetermined threshold in a group of newly registered words and an email group further acquired after acquiring an email containing the newly registered word, the newly registered word A word deletion unit excluded from the fitness information.

なお、以上の構成要素の任意の組み合わせ、本発明の表現を方法、装置、システム、記録媒体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a conversion of the expression of the present invention between a method, an apparatus, a system, a recording medium, a computer program, and the like are also effective as an aspect of the present invention.

本発明によれば、電子メールの内容を評価するために用いるデータベースの肥大化を効率的に抑制することができる。 ADVANTAGE OF THE INVENTION According to this invention, the enlargement of the database used in order to evaluate the content of an email can be suppressed efficiently.

図１は、電子メール評価装置とメールブラウザの関係を示す模式図である。
クライアント端末８０は、ユーザが使用するパーソナルコンピュータや携帯端末などの情報機器である。このクライアント端末８０には、メールの送受信や閲覧のためのメールブラウザ９０がインストールされている。本実施例においては、クライアント端末８０が受信した電子メールは、まず電子メール評価装置１００によって評価される。電子メール評価装置１００は、電子メールが迷惑メールでなければメールブラウザ９０に転送し、迷惑メールであればメールブラウザ９０には転送しない。すなわち、電子メール評価装置１００は電子メールのフィルタとしての機能を果たす。 FIG. 1 is a schematic diagram showing the relationship between an electronic mail evaluation apparatus and a mail browser.
The client terminal 80 is an information device such as a personal computer or a portable terminal used by the user. The client terminal 80 is installed with a mail browser 90 for mail transmission / reception and browsing. In this embodiment, the email received by the client terminal 80 is first evaluated by the email evaluation device 100. The e-mail evaluation apparatus 100 transfers the e-mail to the mail browser 90 if the e-mail is not junk mail, and does not transfer it to the mail browser 90 if the e-mail is junk mail. In other words, the e-mail evaluation apparatus 100 functions as an e-mail filter.

本実施例の電子メール評価装置１００は、ベイジアンフィルタ法に基づいて電子メールの内容を評価し、電子メールが迷惑メールであるか否かを判定する。
その判定原理について説明する。
一例として、迷惑メールのサンプル１００通と、正規メールのサンプル１００通をあらかじめ用意されているとする。ここで「懸賞」という単語がこの迷惑メール群では９８回、正規メール群では２回現れたとする。この場合、「懸賞」という単語が含まれる電子メールは迷惑メールである確率が高い。電子メール評価装置１００は、単語ごとに「その単語が含まれている電子メールが迷惑メールである可能性の高さ」を「スパム単語確率」として指標化する。 The e-mail evaluation apparatus 100 according to the present embodiment evaluates the content of the e-mail based on the Bayesian filter method, and determines whether the e-mail is a junk mail.
The determination principle will be described.
As an example, it is assumed that 100 spam mail samples and 100 regular mail samples are prepared in advance. Here, it is assumed that the word “sweep award” appears 98 times in this spam mail group and twice in the regular mail group. In this case, it is highly probable that an e-mail containing the word “price” will be a spam mail. For each word, the e-mail evaluation apparatus 100 indexes “the probability that an e-mail containing the word is a junk mail” as a “spam word probability”.

広く知られているPaul Graham方式の場合、ある単語ｗのスパム単語確率Ｐ（ｗ）は、
Ｐ（ｗ）＝（ｍ／Ｍ）／（２×ｎ／Ｎ＋ｍ／Ｍ）
という式によって定義される。
ここで、
ｍ：単語ｗが迷惑メール群において登場した回数
Ｍ：迷惑メールの総数
ｎ：単語ｗが正規メール群において登場した回数
Ｎ：正規メールの総数
である。
さきほどの「懸賞」という単語のスパム確率をPaul Graham方式によって計算すると、ｍ＝９８、Ｍ＝１００、ｎ＝２、Ｎ＝１００であることから、
Ｐ（「懸賞」）＝（９８／１００）／（２×２／１００＋９８／１００）
により、約９６％として計算される。
電子メール評価装置１００は、これらの２００通の電子メールに含まれる単語それぞれについてのスパム単語確率をデータベース化する。本実施例においては、このようなデータベースのことを、「適合度情報」とよぶ。 In the case of the well-known Paul Graham method, the spam word probability P (w) of a word w is
P (w) = (m / M) / (2 × n / N + m / M)
Is defined by the expression
here,
m: Number of times the word w appeared in the spam mail group M: Total number of spam mails n: Number of times the word w appeared in the regular mail group N: Total number of regular mails
When calculating the spam probability of the word “price” just before using the Paul Graham method, m = 98, M = 100, n = 2, N = 100.
P (“Stakes Prize”) = (98/100) / (2 × 2/100 + 98/100)
Is calculated as about 96%.
The e-mail evaluation apparatus 100 creates a database of spam word probabilities for each word included in these 200 e-mails. In this embodiment, such a database is referred to as “fitness information”.

この適合度情報において、「懸賞：９６％」、「冷凍：３０％」、「うどん：５％」というスパム単語確率が設定されていたとする。
以上の初期設定がなされた後において、電子メール評価装置１００は、「私は、懸賞によく応募します。こないだ冷凍うどんを当てました。」という内容の電子メールを新たに受信したとする。
この電子メールが迷惑メールである確率（以下、「スパムメール確率」とよぶ）は、（０．９６×０．３×０．０５）／｛（０．９６×０．３×０．０５）＋（１−０．９６）×（１−０．３）×（１−０．０５）｝＝３５％として算出される。
電子メール評価装置１００は、スパムメール確率が９０％以上となる電子メールを迷惑メールとして判定する。また、このときには、迷惑メール数の合計が１０１通となるため、それにあわせて適合度情報における各単語のスパム単語確率も再計算される。
一方、９０％未満であれば、電子メール評価装置１００はその電子メールを一応迷惑メールではないとして、メールブラウザ９０に転送する。メールブラウザ９０のユーザは、転送された電子メールが確かに正規メールであるか、それともやはり迷惑メールであるかを判定する。その判定結果は電子メール評価装置１００にフィードバックされる。この判定結果を反映して、電子メール評価装置１００は適合度情報を更新する。電子メール評価装置１００は、電子メールを受信するごとに適合度情報を更新、充実させていくことになる。
なお、ユーザは、電子メール評価装置１００における各種判定条件を変更することもできる。これについては後述する。 It is assumed that spam word probabilities of “prayer: 96%”, “frozen: 30%”, and “udon: 5%” are set in the fitness information.
It is assumed that after the above initial settings are made, the e-mail evaluation apparatus 100 newly receives an e-mail with the content “I often apply for sweepstakes.
The probability that this e-mail is spam (hereinafter referred to as “spam mail probability”) is (0.96 × 0.3 × 0.05) / {(0.96 × 0.3 × 0.05). It is calculated as + (1−0.96) × (1−0.3) × (1−0.05)} = 35%.
The e-mail evaluation apparatus 100 determines an e-mail having a spam e-mail probability of 90% or more as a junk e-mail. At this time, since the total number of spam mails is 101, the spam word probability of each word in the fitness information is recalculated accordingly.
On the other hand, if it is less than 90%, the e-mail evaluation apparatus 100 transfers the e-mail to the mail browser 90 as not being a spam mail. The user of the mail browser 90 determines whether the transferred electronic mail is indeed a regular mail or is also a junk mail. The determination result is fed back to the electronic mail evaluation apparatus 100. Reflecting this determination result, the e-mail evaluation apparatus 100 updates the fitness information. The e-mail evaluation apparatus 100 updates and enhances the fitness information every time an e-mail is received.
Note that the user can also change various determination conditions in the e-mail evaluation apparatus 100. This will be described later.

図２は、電子メール評価装置の機能ブロック図である。
ここに示す各ブロックは、ハードウェア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組み合わせによっていろいろなかたちで実現できることは、当業者には理解されるところである。本実施例においては、電子メール評価装置１００は、クライアント端末８０にインストールされるアプリケーションソフトウェアによってその機能が発揮されるものとして説明する。
また、ここでは、主として各ブロックの発揮すべき機能について、その具体的な作用については、図３以降に関連して説明する。 FIG. 2 is a functional block diagram of the electronic mail evaluation apparatus.
Each block shown here can be realized in hardware by an element such as a CPU of a computer or a mechanical device, and in software it is realized by a computer program or the like. Draw functional blocks. Therefore, those skilled in the art will understand that these functional blocks can be realized in various forms by a combination of hardware and software. In the present embodiment, the e-mail evaluation apparatus 100 will be described assuming that the function is exhibited by application software installed in the client terminal 80.
In addition, here, the functions that should be exhibited by each block will be described with reference to FIG.

電子メール評価装置１００は、ユーザインタフェース処理部１１０、メール取得部１１２、メール転送部１１４、データ処理部１１６およびデータ格納部１１８を含む。
ユーザインタフェース処理部１１０は、ユーザからの入力処理やユーザに対する情報表示のようなユーザインタフェース全般に関する処理を担当する。メール取得部１１２は、図示しない外部のメールサーバから電子メールを取得する。メール転送部１１４は、取得した電子メールのうちデータ処理部１１６において一応迷惑メールではないと判定した電子メールをメールブラウザ９０に転送する。 The e-mail evaluation apparatus 100 includes a user interface processing unit 110, a mail acquisition unit 112, a mail transfer unit 114, a data processing unit 116, and a data storage unit 118.
The user interface processing unit 110 is in charge of processing related to the entire user interface such as input processing from the user and information display for the user. The mail acquisition unit 112 acquires an e-mail from an external mail server (not shown). The mail transfer unit 114 transfers, to the mail browser 90, an e-mail that is determined not to be a junk mail by the data processing unit 116 among the acquired e-mails.

データ処理部１１６は、ユーザインタフェース処理部１１０やメール取得部１１２から取得されたデータを元にして各種のデータ処理を実行する。データ処理部１１６は、ユーザインタフェース処理部１１０、メール取得部１１２、メール転送部１１４およびデータ格納部１１８の間のインタフェースの役割も果たす。
データ格納部１１８は、あらかじめ用意された各種の設定データや、データ処理部１１６から受け取ったデータを格納する。 The data processing unit 116 executes various types of data processing based on data acquired from the user interface processing unit 110 and the mail acquisition unit 112. The data processing unit 116 also serves as an interface among the user interface processing unit 110, the mail acquisition unit 112, the mail transfer unit 114, and the data storage unit 118.
The data storage unit 118 stores various setting data prepared in advance and data received from the data processing unit 116.

データ格納部１１８は、適合度情報保持部１３８を含む。適合度情報保持部１３８は、単語とそのスパム単語確率を対応づけた適合度情報を保持する。 The data storage unit 118 includes a fitness information holding unit 138. The goodness-of-fit information holding unit 138 holds goodness-of-fit information in which a word is associated with its spam word probability.

データ処理部１１６は、適合度情報処理部１２０とメール評価部１２２を含む。
メール評価部１２２は、メール取得部１１２が電子メールを取得したときに、その電子メールのスパムメール確率を計算することにより、電子メールの適否を判定する。適合度情報処理部１２０は、その判定結果に応じて適合度情報保持部１３８における適合度情報を更新する。 The data processing unit 116 includes a fitness information processing unit 120 and a mail evaluation unit 122.
When the mail acquisition unit 112 acquires an email, the mail evaluation unit 122 determines whether the email is appropriate by calculating a spam mail probability of the email. The fitness information processing unit 120 updates the fitness information in the fitness information holding unit 138 according to the determination result.

メール評価部１２２は、単語抽出部１３４と適合判定部１３６を含む。
単語抽出部１３４は、電子メールに含まれる単語を抽出する。なお、ここでいう単語とは、単語群、バイトストリームであってもよく、必ずしも文構成の最小単位としての「単語」に限る必要はない。適合判定部１３６は、抽出された単語のスパム単語確率を適合度情報保持部１３８から読み出して、スパムメール確率を算出する。既に述べたように、適合判定部１３６は、スパムメール確率が９０％未満であれば、その電子メールをメール転送部１１４からメールブラウザ９０に転送させ、９０％以上であれば転送しない。
正規メールとは、スパムメール確率が９０％未満であり、かつ、クライアント端末８０においてユーザによって適切と判定された電子メールである。迷惑メールとは、スパムメール確率が９０％以上、または、スパムメール確率は９０％未満でありながらクライアント端末８０においてユーザによって不適と判定された電子メールである。このように、電子メールの適否は、電子メール評価装置１００のメール評価部１２２とメールブラウザ９０のユーザの双方または一方によって判定される。
なお、電子メールの判定基準となる９０％という数値は、ユーザインタフェース処理部１１０を介してユーザは任意に変更できる。 The mail evaluation unit 122 includes a word extraction unit 134 and a match determination unit 136.
The word extraction unit 134 extracts words included in the e-mail. The word here may be a word group or a byte stream, and is not necessarily limited to a “word” as a minimum unit of sentence structure. The suitability determining unit 136 reads the spam word probability of the extracted word from the suitability information holding unit 138 and calculates the spam mail probability. As already described, the conformity determination unit 136 causes the e-mail to be transferred from the mail transfer unit 114 to the mail browser 90 if the spam mail probability is less than 90%, and does not transfer it if it is 90% or more.
The regular mail is an electronic mail having a spam mail probability of less than 90% and determined to be appropriate by the user at the client terminal 80. The junk mail is an e-mail that is determined to be inappropriate by the user at the client terminal 80 with a spam mail probability of 90% or more or a spam mail probability of less than 90%. As described above, whether or not the e-mail is appropriate is determined by both or one of the mail evaluation unit 122 of the e-mail evaluation apparatus 100 and the user of the mail browser 90.
Note that the user can arbitrarily change the numerical value of 90%, which is an e-mail determination criterion, via the user interface processing unit 110.

適合度情報処理部１２０は、更新部１２４、単語登録部１２６、単語削除部１２８、計数部１３０および閾値設定部１３２を含む。
更新部１２４は、適合度情報を更新する。すなわち、新たな電子メールについての判定結果に応じて、適合度情報に含まれる各単語のスパム単語確率をPaul Grahamの式にしたがって再計算する。単語登録部１２６は、電子メールに含まれる単語のうち、適合度情報に未登録の単語があれば、新たにこれを適合度情報に新規登録する。以降において、この新規登録単語についてのスパム単語確率の計算が開始される。 The fitness information processing unit 120 includes an updating unit 124, a word registration unit 126, a word deletion unit 128, a counting unit 130, and a threshold setting unit 132.
The update unit 124 updates the fitness level information. That is, the spam word probability of each word included in the fitness information is recalculated according to Paul Graham's formula according to the determination result for the new e-mail. If there is an unregistered word in the fitness information among the words included in the e-mail, the word registration unit 126 newly registers this in the fitness information. Thereafter, the calculation of the spam word probability for the newly registered word is started.

単語削除部１２８は、削除条件が成立した単語を適合度情報から削除する。具体的には、ある電子メールＭから検出された単語ｗが新規登録されたとき、以後において、
（ｒ≧Ｒ）∩［｛（ｓ／ｒ）＜Ｔ｝∪｛０．５−Ｐ_ｔ≦ｐ≦０．５＋Ｐ_ｔ｝］
の削除条件が成り立つときに単語ｗは適合度情報から削除される。
ここで、
ｒ：電子メールＭが取得された後において、更に取得された電子メールの数。
Ｒ：第１閾値。１００以上の整数であり、ユーザにより設定される。初期設定値は１０００。
ｓ：電子メールＭが取得された後において、更に取得されたｒ通の電子メールのうち、単語ｗが含まれている電子メールの数。
Ｔ：第２閾値。０．０１以上１．０未満の範囲でユーザにより設定される。初期値は０．１
ｐ：電子メールＭが取得された後において更に取得されたｒ通の電子メールに基づいて計算される単語ｗのスパム単語確率
Ｐ_ｔ：第３閾値。０以上０．５以下の範囲でユーザにより設定される。初期値は０．２
である。 The word deletion unit 128 deletes words that satisfy the deletion condition from the fitness information. Specifically, when a word w detected from a certain e-mail M is newly registered,
(R ≧ R) ∩ [{(s / r) <T} ∪ {0.5−P _t ≦ p ≦ 0.5 + P _t }]
When the deletion condition is established, the word w is deleted from the fitness information.
here,
r: The number of e-mails further acquired after the e-mail M is acquired.
R: 1st threshold value. It is an integer of 100 or more and is set by the user. The default value is 1000.
s: The number of e-mails including the word w among the r e-mails acquired after the e-mail M is acquired.
T: Second threshold value. It is set by the user in the range of 0.01 or more and less than 1.0. The initial value is 0.1
p: Spam word probability of word w calculated based on r emails further acquired after email M is acquired P _t : third threshold. It is set by the user in the range of 0 to 0.5. The initial value is 0.2
It is.

各項の意味を説明する。
１．（ｓ／ｒ）＜Ｔ・・・第１削除条件
電子メールＭが取得されて以降におけるｒ通の電子メールにおいて、単語ｗの出現頻度が小さいときには削除対象とする。電子メールＭによって単語ｗが適合度情報に新規登録されて以降、この単語ｗがあまり出現しない場合には、単語ｗはスパムメール確率を計算する上で重要な判断材料ではないと考えられる。このため、このようなときには、単語ｗを削除対象としている。
２．０．５−Ｐ_ｔ≦ｐ≦０．５＋Ｐ_ｔ・・・第２削除条件
単語ｗのスパム単語確率が０．５付近にあるときには削除対象とする。単語ｗのスパム単語確率が０．５に近いときには、すなわち、中立的な適切さを持つ単語ｗはスパムメール確率を計算する上で重要な判断材料ではないと考えられる。そのため、このようなときには、単語ｗを削除対象としている。
３．ｒ≧Ｒ・・・第３削除条件
単語ｗの削除可否判定にあたって統計的な安定性を担保するための条件である。第３削除条件が成立したことを条件として、単語ｗは削除の対象となり得る。
まとめると、単語ｗが新規登録されてから、ある程度の電子メールが取得された段階で、単語ｗの出現頻度が小さいか、単語ｗのスパム単語確率が中立的であるときには、単語ｗは適合度情報から削除されることになる。このような処理によって、適合度情報に含まれる評価対象となるべき単語数が過度に大きくならないように処置している。 The meaning of each item will be described.
1. (S / r) <T... First deletion condition In r e-mails after the e-mail M is acquired, if the frequency of appearance of the word w is low, the e-mail is considered to be deleted. If the word w does not appear so much after the word w is newly registered in the fitness information by the electronic mail M, it is considered that the word w is not an important judgment material in calculating the spam mail probability. For this reason, in such a case, the word w is to be deleted.
2.0.5−P _t ≦ p ≦ 0.5 + P _t ... Second deletion condition When the spam word probability of the word w is near 0.5, it is determined as a deletion target. When the spam word probability of the word w is close to 0.5, that is, the word w having neutral suitability is considered not to be an important judgment material in calculating the spam mail probability. Therefore, in such a case, the word w is a deletion target.
3. r ≧ R... Third deletion condition This is a condition for ensuring statistical stability in determining whether or not the word w can be deleted. The word w can be a deletion target on condition that the third deletion condition is satisfied.
In summary, when a certain amount of e-mail is acquired after the word w is newly registered, if the appearance frequency of the word w is small or the spam word probability of the word w is neutral, the word w is a goodness of fit. It will be deleted from the information. By such processing, measures are taken so that the number of words to be evaluated included in the fitness information is not excessively large.

計数部１３０は、単語が新規登録された後に受信される電子メールの数ｒを計数する。このときの計数値は、削除条件の成否判定において使用される。閾値設定部１３２は、第１〜第３閾値をユーザからの設定入力に応じて変更する。
次に、電子メール評価装置１００が新たに電子メールを受信したときの処理過程を説明する。 The counting unit 130 counts the number of emails r received after a word is newly registered. The count value at this time is used in determining whether the deletion condition is successful. The threshold setting unit 132 changes the first to third thresholds according to the setting input from the user.
Next, a process when the e-mail evaluation apparatus 100 newly receives an e-mail will be described.

図３は、電子メール受信時における電子メール評価装置の基本的な処理過程を示すフローチャートである。
まず、メール取得部１１２は外部装置から送信された電子メールを取得する（Ｓ１０）。この電子メールのスパムメール確率を計算することによりメール評価処理が実行され（Ｓ１２）、必要に応じて適合度情報からの単語削除の実行可否を判定するための単語削除判定処理が実行される（Ｓ１４）。
Ｓ１２およびＳ１４の処理内容については後に詳述する。 FIG. 3 is a flowchart showing the basic processing steps of the e-mail evaluation apparatus when receiving e-mail.
First, the mail acquisition unit 112 acquires an electronic mail transmitted from an external device (S10). A mail evaluation process is executed by calculating the spam mail probability of this e-mail (S12), and a word deletion determination process for determining whether word deletion can be executed from the fitness information is executed as necessary (S12). S14).
The processing contents of S12 and S14 will be described in detail later.

図４は、図３のＳ１２におけるメール評価処理の内容を詳細に示すフローチャートである。
単語抽出部１３４は、電子メールに含まれている単語を抽出する（Ｓ１６）。この中で、適合度情報に登録されていない単語があれば（Ｓ１８のＹ）、単語登録部１２６は新たにこの未登録単語を適合度情報に登録する（Ｓ２０）。計数部１３０は、この新規登録単語に対して、以降に取得される電子メール数ｒのカウントを開始する。計数部１３０は、新規登録単語ごとに電子メール数ｒをカウントする。電子メールから抽出された単語の中に未登録単語がなければ（Ｓ１８のＮ）、Ｓ２０はスキップされる。こうしてスパムメール確率が計算される（Ｓ２２）。 FIG. 4 is a flowchart showing in detail the contents of the mail evaluation process in S12 of FIG.
The word extraction unit 134 extracts words included in the email (S16). If there is a word that is not registered in the fitness information (Y in S18), the word registration unit 126 newly registers the unregistered word in the fitness information (S20). The counting unit 130 starts counting the number of e-mails acquired thereafter for this newly registered word. The counting unit 130 counts the number of emails r for each newly registered word. If there is no unregistered word among the words extracted from the e-mail (N in S18), S20 is skipped. Thus, the spam mail probability is calculated (S22).

スパムメール確率が、所定の閾値以上であれば（Ｓ２４のＮ）、適合判定部１３６はその電子メールを迷惑メールと判定する（Ｓ２９）。なお、本実施例においては、この閾値は９０％として設定されるが、ユーザからの設定入力により変更可能である。一方、スパムメール確率がこの閾値未満であれば（Ｓ２４のＹ）、適合判定部１３６は、一応、正規メールと仮判定する。メール転送部１１４はメールブラウザ９０に電子メールを転送する（Ｓ２６）。ユーザによって、転送した電子メールが迷惑メールであると判定されたときには（Ｓ２７のＹ）、このメールは迷惑メールとして扱われる（Ｓ２９）。ユーザによって、転送した電子メールが正規メールであると判定されたときには（Ｓ２７のＮ）、この電子メールは正規メールとして扱われる（Ｓ２８）。
更新部１２４は、電子メールについての判定結果に応じて、適合度情報における各単語のスパム単語確率を再計算する（Ｓ３０）。 If the spam mail probability is equal to or higher than a predetermined threshold (N in S24), the conformity determination unit 136 determines that the electronic mail is a spam mail (S29). In this embodiment, the threshold is set as 90%, but can be changed by a setting input from the user. On the other hand, if the spam mail probability is less than this threshold (Y in S24), the conformity determination unit 136 temporarily determines that the mail is a regular mail. The mail transfer unit 114 transfers the e-mail to the mail browser 90 (S26). When it is determined by the user that the transferred electronic mail is a junk mail (Y in S27), this mail is treated as a junk mail (S29). When the user determines that the transferred electronic mail is a regular mail (N in S27), the electronic mail is treated as a regular mail (S28).
The update unit 124 recalculates the spam word probability of each word in the fitness information according to the determination result for the email (S30).

図５は、図３のＳ１４における単語削除判定処理の内容を詳細に示すフローチャートである。
単語削除部１２８は、新規登録単語について、その登録後に取得された電子メール数ｒが第１閾値Ｒ以上となっている単語が存在するか、すなわち、第３削除条件が成立している単語が存在するかを判定する（Ｓ３４）。存在しなければ（Ｓ３４のＮ）、Ｓ１４の処理はそのまま終了する。存在すれば（Ｓ３４のＹ）、単語削除部１２８は、その単語ｗが登録された後に取得されたｒ通の電子メール群において、単語ｗの出現頻度が所定の閾値よりも小さいか、すなわち、第２削除条件が成立しているかを判定する（Ｓ３６）。第２削除条件が成立していれば（Ｓ３６のＹ）、単語削除部１２８は当該単語ｗを適合度情報から削除する（Ｓ４０）。一方、成立していなければ（Ｓ３６のＮ）、単語削除部１２８は第２削除条件について判定する（Ｓ３８）。成立していれば（Ｓ３８のＹ）、単語削除部１２８は当該単語ｗを適合度情報から削除する（Ｓ４０）。成立していなければ（Ｓ３８のＮ）、Ｓ１４の処理は終了する。 FIG. 5 is a flowchart showing in detail the contents of the word deletion determination process in S14 of FIG.
For the newly registered word, the word deletion unit 128 determines whether there is a word whose number of e-mails r acquired after the registration is equal to or greater than the first threshold value R, that is, a word that satisfies the third deletion condition. It is determined whether it exists (S34). If it does not exist (N of S34), the process of S14 is ended as it is. If it exists (Y in S34), the word deletion unit 128 determines whether the appearance frequency of the word w is smaller than a predetermined threshold in the r email groups acquired after the word w is registered, that is, It is determined whether the second deletion condition is satisfied (S36). If the second deletion condition is satisfied (Y in S36), the word deletion unit 128 deletes the word w from the fitness information (S40). On the other hand, if not established (N in S36), the word deletion unit 128 determines the second deletion condition (S38). If established (Y in S38), the word deletion unit 128 deletes the word w from the fitness information (S40). If not established (N in S38), the process in S14 ends.

以上、実施例に基づいて本発明を説明した。
本実施例に示した電子メール評価装置１００によれば、ベイジアンフィルタ方式によってスパムメール確率を求めるときに、その判定の元となる適合度情報のデータ量の肥大化を効果的に抑制できる。 The present invention has been described above based on the embodiments.
According to the e-mail evaluation apparatus 100 shown in the present embodiment, when the spam mail probability is obtained by the Bayesian filter method, the enlargement of the data amount of the fitness information that is the basis of the determination can be effectively suppressed.

迷惑メールの中には、ベイジアンフィルタを攪乱するために無意味に単語を羅列するタイプのものもある。ベイジアンフィルタに基づく従来の電子メールフィルタリング法の場合、このようなタイプの迷惑メールが受信されると、適合度情報に含まれる単語数が飛躍的に増大してしまう。適合度情報の肥大化は、適合度情報の更新処理に伴う負荷も増大させる。
これに対し、本実施例における電子メール評価装置１００は、適合度情報として登録される単語を適宜削除することにより、適合度情報のデータ量が無制限に増大しないように処置している。スパムメール確率を計算する上で有用な単語を残しつつ、それほど有用でない単語を排除していくため、適合度情報のデータ量の肥大化を抑制しつつも、適否の判定基準の変化や、迷惑メールのタイプの変化に対応できる。このように、電子メール評価装置１００は単語学習機能によって発生し得る不具合を、効果的に解決することができる。
本実施例においては、ベイジアンフィルタの特にPaul Graham方式を前提として説明したが、これに限らず、単語ごとの適切さをベースとした分類方法等に広く応用可能である。本実施例においては、単語と適合度から正規メールと迷惑メールに電子メールを分類する態様を示した。このほか、データの分類方法に際しては、単語に限らず、さまざまな属性ごとの適切さをベースとした分類方法も考えられる。たとえば、文書データを分類する場合、属性としてその文書データ中における単語の出現頻度や作者、作成日時などが利用できる。また、画像データを分類する場合、属性として色の頻度や明度の頻度などが利用できる。 Some junk emails use words that are meaninglessly enumerated to disrupt the Bayesian filter. In the case of the conventional e-mail filtering method based on the Bayesian filter, when such a type of junk mail is received, the number of words included in the fitness information greatly increases. The enlargement of the fitness information increases the load accompanying the update processing of fitness information.
On the other hand, the e-mail evaluation apparatus 100 according to the present embodiment takes measures so that the data amount of the fitness information does not increase unlimitedly by appropriately deleting words registered as fitness information. In order to eliminate the words that are not very useful while keeping the words that are useful in calculating the spam mail probability, while suppressing the enlargement of the data amount of the fitness information, changes in the judgment criteria for suitability and inconvenience Respond to changes in email types. As described above, the e-mail evaluation device 100 can effectively solve the problems that may occur due to the word learning function.
In the present embodiment, the description has been made on the premise of the Bayesian filter, particularly, the Paul Graham method. In the present embodiment, an embodiment has been shown in which electronic mail is classified into regular mail and spam mail based on words and fitness. In addition, the data classification method is not limited to words, and a classification method based on the appropriateness of various attributes is also conceivable. For example, when classifying document data, the appearance frequency, author, creation date, etc. of the word in the document data can be used as attributes. Further, when classifying image data, color frequency, brightness frequency, or the like can be used as an attribute.

なお、請求項に記載の閾値入力部の機能は、本実施例においてはユーザインタフェース処理部１１０によって実現される。
これら請求項に記載の各構成要件が果たすべき機能は、本実施例において示された各機能ブロックの単体もしくはそれらの連係によって実現されることも当業者には理解されるところである。 Note that the function of the threshold value input unit described in the claims is realized by the user interface processing unit 110 in this embodiment.
It should be understood by those skilled in the art that the functions to be fulfilled by the constituent elements described in the claims are realized by a single function block or a combination of the functional blocks shown in the present embodiment.

以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組み合わせにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are within the scope of the present invention. is there.

本実施例においては、第１削除条件（ｓ／ｒ）＜Ｔにおいて、ｓとは、ｒ通の電子メールのうち、判定対象となる単語ｗが出現している電子メールの数であるとして説明した。そのため、ｓ≦ｒの関係が成立している。
これに対し、別例として、ｓとは、ｒ通の電子メール中に含まれる単語ｗの数としてもよい。この場合、ｓはｒよりも大きくなり得る。それに対応して、第２閾値Ｔの取り得る範囲も、１以上、たとえば、１０００のような数値に設定されてもよい。 In this embodiment, in the first deletion condition (s / r) <T, s is described as the number of e-mails in which the word w to be determined appears among r e-mails. did. Therefore, the relationship of s ≦ r is established.
On the other hand, as another example, s may be the number of words w included in r emails. In this case, s can be greater than r. Correspondingly, the possible range of the second threshold T may also be set to a numerical value of 1 or more, for example, 1000.

また、本実施例のほかにも、以下の式で全ての属性の重要度を算出し、重要度の上位ｎ個にあたる属性のみを残すという変形例も考えられる。
まず、
重要度（属性）＝ｆ（属性の出現頻度）×ｇ（属性の判定寄与度）
ｆ：任意の単調増加関数
ｇ：任意の単調増加関数
として重要度を定義する。 In addition to the present embodiment, a modification in which the importance of all attributes is calculated by the following formula and only the attributes corresponding to the top n importances are left is also conceivable.
First,
Importance (attribute) = f (attribute appearance frequency) × g (attribute determination contribution)
f: Arbitrary monotonically increasing function g: Importance is defined as an arbitrary monotonically increasing function.

上記に基づき、単語ごとの重要度を特定するための式として、以下の様な式を定義する。
重要度（ｗ）＝√{Ｍａｘ(ｎ／Ｎ，ｍ／Ｍ)}×abs（p-0.5)
ｍ：単語ｗが迷惑メール群において登場した回数
Ｍ：迷惑メールの総数
ｎ：単語ｗが正規メール群において登場した回数
Ｎ：正規メールの総数
ｐ：単語ｗのスパム単語確率（0.5から遠い方が判定寄与度が高い）
ここで、「Ｍａｘ(ｎ／Ｎ，ｍ／Ｍ)」は、変数ｎ／Ｎとｍ／Ｍのうちのいずれか大きい方を選択する関数である。また、「abs（p-0.5）」は、p-0.5の絶対値を示す。この変形例に示す方法の場合、単語の重要度を加味して、削減すべき単語を選択できる。たとえば、削除条件が成立した単語であっても、その重要度が所定の閾値よりも大きいときには削除対象としないとしてもよい。高い単語は削除対象となりにくく、低い単語は削除対象となりやすいように設定することにより、いっそう効率的に適合度情報のデータ量を削減できる。 Based on the above, the following formula is defined as a formula for specifying the importance for each word.
Importance (w) = √ {Max (n / N, m / M)} × abs (p-0.5)
m: Number of times the word w appeared in the spam mail group M: Total number of spam mails n: Number of times the word w appeared in the regular mail group N: Total number of regular mails p: Spam word probability of the word w (the one far from 0.5 (Decision contribution is high)
Here, “Max (n / N, m / M)” is a function that selects the larger one of the variables n / N and m / M. “Abs (p-0.5)” indicates the absolute value of p-0.5. In the case of the method shown in this modification, it is possible to select words to be reduced in consideration of the importance of words. For example, even a word that satisfies the deletion condition may not be deleted when its importance is greater than a predetermined threshold. The data amount of the fitness information can be reduced more efficiently by setting the high word so that it is difficult to be deleted and the low word is likely to be deleted.

電子メール評価装置とメールブラウザの関係を示す模式図である。It is a schematic diagram which shows the relationship between an electronic mail evaluation apparatus and a mail browser. 電子メール評価装置の機能ブロック図である。It is a functional block diagram of an electronic mail evaluation apparatus. 電子メール受信時における電子メール評価装置の基本的な処理過程を示すフローチャートである。It is a flowchart which shows the basic process of an e-mail evaluation apparatus at the time of e-mail reception. 図３のＳ１２におけるメール評価処理の内容を詳細に示すフローチャートである。It is a flowchart which shows the content of the mail evaluation process in S12 of FIG. 3 in detail. 図３のＳ１４における単語削除判定処理の内容を詳細に示すフローチャートである。It is a flowchart which shows the content of the word deletion determination process in S14 of FIG. 3 in detail.

Explanation of symbols

８０クライアント端末、９０メールブラウザ、１００電子メール評価装置、１１０ユーザインタフェース処理部、１１２メール取得部、１１４メール転送部、１１６データ処理部、１１８データ格納部、１２０適合度情報処理部、１２２メール評価部、１２４更新部、１２６単語登録部、１２８単語削除部、１３０計数部、１３２閾値設定部、１３４単語抽出部、１３６適合判定部、１３８適合度情報保持部。 80 client terminal, 90 mail browser, 100 electronic mail evaluation device, 110 user interface processing unit, 112 mail acquisition unit, 114 mail transfer unit, 116 data processing unit, 118 data storage unit, 120 fitness information processing unit, 122 mail evaluation Part, 124 update part, 126 word registration part, 128 word deletion part, 130 counting part, 132 threshold value setting part, 134 word extraction part, 136 conformity determination part, 138 conformity information holding part.

Claims

In order to determine whether or not the e-mail transmitted from the external device has appropriate contents for the user of the recipient, a fitness information holding unit that holds the fitness that indexes the appropriateness of each word as fitness information,
An email acquisition unit for acquiring emails to be evaluated;
A word extractor for extracting words contained in the email;
A fitness determination unit that detects the fitness of each word included in the e-mail with reference to the fitness information, and determines whether the e-mail has appropriate content from the fitness; and
A relevance level update unit that updates the relevance level information by recalculating the relevance level for each word included in the e-mail to be determined according to the determination result for the e-mail;
When a word that is not registered in the fitness information is extracted from an email, a word registration unit that newly registers the word in the fitness information;
Further acquisition after acquisition of an e-mail containing the newly registered word, provided that the number of e-mails acquired after acquisition of the e-mail including the newly registered word exceeds a predetermined first threshold. A word deletion unit that excludes the newly registered word from the fitness information when the appearance frequency of the newly registered word is smaller than a predetermined second threshold in the group of emails
A threshold value input unit for detecting an instruction input by a user for setting the first threshold value and the second threshold value;
A threshold value setting unit for setting a value designated by an instruction input as the first threshold value or the second threshold value;
An e-mail evaluation apparatus comprising:

The e-mail evaluation apparatus according to claim 1, wherein the fitness level update unit recalculates the fitness level for each word included in the e-mail based on a Bayesian Filtering method.

The word deletion unit is configured to exclude the newly registered word from the fitness level information when the fitness level calculated by the fitness level update unit is within a predetermined range for the newly registered word. The electronic mail evaluation apparatus according to claim 1 or 2.

The word deletion unit excludes the newly registered word from the fitness level information when the fitness level of the newly registered word is included in a predetermined range including a median value of the range of possible fitness levels. 4. The e-mail evaluation apparatus according to claim 3, wherein

  The threshold value input unit detects an instruction input by a user for setting the predetermined range,
  The threshold setting unit sets a value designated by an instruction input as the predetermined range.
  The e-mail evaluation apparatus according to claim 3 or 4, characterized by the above.

A step of acquiring an email to be evaluated by a mail acquisition unit provided in the computer ;
A word extraction unit provided in the computer for extracting words included in the e-mail;
The matching determination unit provided in the computer refers to the matching level information indicating the matching level obtained by indexing the appropriateness of each word, detects the matching level of each word included in the acquired e-mail, and Determining whether the email obtained from the fitness is appropriate content;
A fitness level update unit provided in the computer updates the fitness level information by recalculating the fitness level of each word included in the email to be determined according to the determination result for the email. Steps,
A word registration unit provided in the computer, when a word not registered in the fitness information is extracted from an e-mail, newly registering the word in the fitness information;
The word deletion unit provided in the computer is newly registered on the condition that the number of e-mails further acquired after acquiring e-mails including the newly registered words exceeds a predetermined first threshold . A step of excluding the newly registered word from the fitness information when the appearance frequency of the newly registered word is smaller than a predetermined second threshold in the group of e-mails acquired after acquiring the e-mail including the word When,
Detecting a command input by a user for setting the first threshold and the second threshold by a threshold input unit provided in the computer;
A threshold setting unit provided in the computer sets a value designated by an instruction input as the first threshold or the second threshold;
An e-mail evaluation method comprising:

Computer
A function of holding a fitness indexed by the appropriateness of each word as fitness information in order to determine whether the email sent from the external device is appropriate for the recipient user;
Email acquisition means for acquiring emails to be evaluated,
Word extraction means for extracting words contained in the e-mail;
A match determination unit that detects the match of each word included in the e-mail with reference to the match information, and determines whether the e-mail has an appropriate content from the suitability;
Relevance level update means for updating the relevance level information by recalculating the relevance level for each word included in the e-mail to be determined according to the determination result for the e-mail;
When a word that is not registered in the fitness information is extracted from an e-mail, a word registration unit that newly registers the word in the fitness information,
Further acquisition after acquisition of an e-mail containing the newly registered word, provided that the number of e-mails acquired after acquisition of the e-mail including the newly registered word exceeds a predetermined first threshold. A word deletion means for excluding the newly registered word from the fitness information when the frequency of appearance of the newly registered word in the group of emails is less than a predetermined second threshold;
Threshold input means for detecting an instruction input by a user for setting the first threshold and the second threshold;
Threshold setting means for setting a value designated by an instruction input as the first threshold or the second threshold;
E-mail evaluation program to function as .