JP2005011326A

JP2005011326A - Obfuscation of spam filter

Info

Publication number: JP2005011326A
Application number: JP2004149663A
Authority: JP
Inventors: Joshua T Goodman; ティー．グッドマンジョシュア; Robert L Rounthwaite; エル．ラウンスウェイトロバート; John C Platt; シー．プラットジョン
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2003-06-20
Filing date: 2004-05-19
Publication date: 2005-01-13
Anticipated expiration: 2024-05-19
Also published as: KR101143194B1; EP2498458A2; US20050015454A1; CN1573780A; KR20040110086A; CN1573780B; EP1489799A2; EP2498458A3; US7519668B2; JP4572087B2; EP1489799A3

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system and method for obfuscating a spam filtering system to hinder reverse engineering of a spam filter and/or to mitigate a spammer from finding out a message that consistently gets through a spam filter almost every time. <P>SOLUTION: This system includes a randomization component that randomizes a message score before the message is classified as spam or non-spam in order to obscure the functionality of a spam filter. The randomization of the message score can be accomplished in part by adding a random number or pseudo-random number to the message score before the message is classified as spam or non-spam. The number added thereto can vary depending on at least one of several input types such as time, user, message content, and hash of the contents of the message, and the hash of particularly important features of the message. Alternatively, multiple spam filters can be deployed rather than a single best spam filter. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、スパムの送信を低減するシステムおよび方法に関し、より詳細には、スパムフィルタのリバースエンジニアリングを妨げ、かつ／またはスパムフィルタのパフォーマンスをスパマーがモデリングし、予測するのを抑えることに関する。 The present invention relates to systems and methods for reducing spam transmission, and more particularly to preventing spam filter reverse engineering and / or preventing spammers from modeling and predicting spam filter performance.

インターネットなどグローバルな通信ネットワークの出現によって、膨大な数の潜在顧客を得られるビジネスチャンスが与えられている。電子メッセージ通信、および特に電子メールは、次第にネットワークユーザに不要な広告および宣伝をばらまく手段（「スパム」とも呼ぶ）として広がる傾向にある。 With the advent of global communication networks such as the Internet, there are business opportunities for obtaining a huge number of potential customers. Electronic message communications, and in particular electronic mail, are becoming increasingly popular as a means of disseminating unwanted advertisements and promotions to network users (also called “spam”).

コンサルティングおよび市場調査会社であるＲａｄｉｃａｔｉＧｒｏｕｐ．Ｉｎｃ．の概算によると、２００２年８月の時点で、毎日２０億のジャンクメールメッセージが送信されている。この数字は、２年ごとに３倍になると予測されている。個人およびエンティティ（企業、政府機関など）は、次第にジャンクメッセージに迷惑し、しばしば不快に感じるようになってきている。したがってスパムは、現在または間もなく信頼できるコンピュータの使用を脅かすものとなる。 Radicati Group, a consulting and market research company. Inc. According to the estimate, as of August 2002, 2 billion junk mail messages are sent every day. This figure is expected to triple every two years. Individuals and entities (businesses, government agencies, etc.) are increasingly annoying and often uncomfortable with junk messages. Spam thus threatens the use of computers that are currently or soon to be trusted.

スパムを阻止するために使用する一般的な技術は、フィルタリングシステム／方法の利用を必要とする。実績のあるフィルタリング技術は、機械学習手法に基づくものである。機械学習フィルタは、着信メッセージに、メッセージがスパムである確率を割り当てる。この手法では、特徴は一般に２つのクラスのメッセージ例（例えばスパムメッセージおよび非スパムメッセージ）から抽出され、２つのクラスの間の差異を確率的に識別するために学習フィルタが適用される。多くのメッセージの特徴は内容（例えばメッセージの件名および／または本文の単語および句全体など）に関係するため、一般にこうしたタイプのフィルタを「内容ベースのフィルタ（ｃｏｎｔｅｎｔ−ｂａｓｅｄｆｉｌｔｅｒ）」と呼ぶ。こうしたタイプの機械学習フィルタは、適切なメッセージからスパムメッセージを検出し、識別するために、一般に完全一致技術を使用する。 Common techniques used to deter spam require the use of filtering systems / methods. Proven filtering techniques are based on machine learning techniques. The machine learning filter assigns incoming messages a probability that the message is spam. In this approach, features are generally extracted from two classes of message examples (eg, spam messages and non-spam messages), and a learning filter is applied to probabilistically identify differences between the two classes. Because many message characteristics relate to content (eg, the subject of the message and / or the entire body of words and phrases), this type of filter is commonly referred to as a “content-based filter”. These types of machine learning filters typically use exact match techniques to detect and identify spam messages from appropriate messages.

米国特許第６，１６１，１３０号明細書US Pat. No. 6,161,130 米国特許第６，０２３，７２３号明細書US Pat. No. 6,023,723

残念ながらスパマーは、機械学習システムを使用するものを含めて、従来のスパムフィルタを回避する方法を絶えず見つけている。例えばスパマーは、数学的処理および連続的な電子メールの修正を使用して、スパムフィルタのパフォーマンスをテストし、予測する。さらに、一般の人々は、一般的なスパムフィルタがどのように動作するかを説明する多くの情報を入手可能である。中には、特定のフィルタにメッセージを通し、そうしたフィルタのそれぞれの判定を戻すことを申し出るインターネットサービスさえある。したがってスパマーは、様々な既知のスパムフィルタにスパムを通し、かつ／またはメッセージがフィルタをうまく通り抜けるまでそのメッセージを修正する機会に恵まれる。上記のことを考えると、こうした従来のフィルタが提供するスパムに対する防護は限られている。 Unfortunately, spammers are constantly finding ways to avoid traditional spam filters, including those that use machine learning systems. For example, spammers use mathematical processing and continuous email modification to test and predict the performance of spam filters. In addition, the general public has a lot of information available that explains how a general spam filter works. Some Internet services even offer to pass a message through a particular filter and return the decision of each such filter. Thus, spammers have the opportunity to pass spam through various known spam filters and / or modify the message until the message passes through the filter successfully. Given the above, the spam protection provided by these conventional filters is limited.

本発明の一部の態様を基本的に理解できるようにするために、以下に本発明の簡単な概略を示す。この概略は、本発明の広範にわたる概要を示すものではない。本発明の鍵となる／重要な要素を識別するためのもの、または本発明の範囲を画定するためのものではない。単に、後述するより詳細な説明の前置きとして、本発明の一部の概念を簡単な形式で提示するためのものである。 The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is not intended to identify key / critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.

従来の機械学習スパムフィルタは、スパマーによるリバースエンジニアリングが可能であり、それによってスパマーは、フィルタに捕らえられないメッセージを見つけることができる。さらに、スパムフィルタがユーザに関係なく常に同じメッセージを捕らえる場合、スパマーは、試行錯誤により、通り抜けるメッセージを容易に見つけることができる。そのメッセージを見つけた後、スパマーは、フィルタを活用して潜在的に何百万もの人々にメッセージを送信することができる。何らかの方法によるスパムフィルタの修正がなければ、この形式のスパマーの策略を永久に続行することができる。 Traditional machine learning spam filters can be reverse engineered by spammers, which allow spammers to find messages that are not caught by the filter. Furthermore, if the spam filter always captures the same message regardless of the user, spammers can easily find messages that pass through by trial and error. After finding the message, spammers can use filters to send messages to potentially millions of people. Without some form of spam filter modification, this form of spammer trick can continue forever.

発明主題は、スパムフィルタの不明瞭化を容易にし、それによりスパマーがリバースエンジニアリングを行うことがより難しくなり、かつ／またはフィルタを常に通り抜けるメッセージを見つけることがより難しくなるシステムおよび方法を提供する。本発明は本質的に、スパムフィルタの挙動を修正する方法を提供する。これは、一部にはスパムフィルタリングプロセスに無作為化の要素を追加することによって達成することができる。 The inventive subject matter provides a system and method that facilitates obfuscation of spam filters, thereby making it more difficult for spammers to reverse engineer and / or find messages that always pass through the filter. The present invention essentially provides a method for modifying the behavior of spam filters. This can be achieved in part by adding a randomization factor to the spam filtering process.

従来のスパムフィルタの大部分は、メッセージを処理し、メッセージに対するある種の得点を戻す。これは、メッセージの確率、独断の得点、メッセージの確率のログ、現在のメッセージと非スパムメッセージとの間の一致の度合い、または他の任意の数字とすることができる。ある閾値を上回る得点は、何らかの方法でスパムとしてラベル付けされる。こうしたラベルには、それだけには限定されないが、削除、特殊フォルダへの移動、チャレンジ（ｃｈａｌｌｅｎｇｅ）、および／またはマークなどがある。したがって、スパムフィルタリングプロセスの挙動を修正する手法の１つは、メッセージの得点の無作為化を含む。無作為化は、それだけには限定されないが、得点に何らかの数字を加える、および／または得点に例えば１．１や０．９など何らかの因数を掛けることを含む。 Most conventional spam filters process a message and return some score for the message. This can be a message probability, an arbitrary score, a message probability log, a degree of match between the current message and a non-spam message, or any other number. A score above a certain threshold is labeled as spam in some way. Such labels include, but are not limited to, deleting, moving to special folders, challenges, and / or marks. Thus, one technique for modifying the behavior of the spam filtering process involves randomization of message scores. Randomization includes, but is not limited to, adding a number to the score and / or multiplying the score by some factor, such as 1.1 or 0.9.

無作為化の実行の第２の手法は、時間の使用を含む。より具体的には、メッセージの得点に加えられる乱数が現在の時刻または現在の時間増分とともに変わる、かつ／またはそれに応じて決まる。例えば、１５分ごと、または所望の他の任意の時間増分ごとに異なる乱数を使用するように無作為化をプログラムすることができる。あるいは、時刻の変化とともに乱数を変えることもできる。その結果、スパマーは、例えば（スパムと見なされるか非スパムと見なされるかの）閾値付近にあるもので、通り抜けるのを阻止されていたが、わずかな（例えばマイナーな）修正を加えた後フィルタを通り抜けるようになったメッセージが修正によって変わったのか、任意の要因によって変わったのかを決定するのが難しいことがわかる。 A second approach to randomization involves the use of time. More specifically, the random number added to the message score varies with and / or depends on the current time or current time increment. For example, the randomization can be programmed to use a different random number every 15 minutes, or every other desired time increment. Alternatively, the random number can be changed as the time changes. As a result, spammers, for example, are near the threshold (considered as spam or non-spam) and were prevented from passing through, but after a slight (eg minor) modification, the filter It can be seen that it is difficult to determine whether the message that has passed through has changed due to modification or due to arbitrary factors.

フィルタの無作為化の第３の手法は、一部には、メッセージを受信するユーザおよび／またはドメインに応じて決まる。例えば、ユーザに応じて決まる乱数を使用することによって、スパマーは、他のユーザではなく、テストユーザにしか届かないメッセージを見つけることになる。したがってスパマーは、メッセージをテストするためにより負担がかかることになる。 A third approach to filter randomization depends in part on the user and / or domain receiving the message. For example, by using a random number that depends on the user, the spammer will find a message that only reaches the test user and not other users. Spammers are therefore more burdensome to test messages.

メッセージの内容は、本発明による無作為化の別の態様である。例えば、乱数を、少なくとも一部メッセージの内容に基づいて算出することができる。関連の技術はハッシュ法である。メッセージのハッシュは、内容から確定的に生成される擬似乱数であり、したがって内容へのわずかな変更によってハッシュに大きな変更がもたらされるようになる。スパマーがメッセージをリバースエンジニアリングしようと試みる場合、メッセージの内容のわずかな変更によって、メッセージの得点に相対的に大きな変更がもたらされる。あるいは、またはそれに加えて、メッセージの得点へのその寄与（ｃｏｎｔｒｉｂｕｔｉｏｎｓ）が閾値を上回るメッセージの特定の特徴を抽出し、ハッシュすることができる。次いでこのハッシュを、乱数ジェネレータに入力として使用し、それによって最も重要な特徴の寄与をより見つけにくくすることができる。 Message content is another aspect of randomization according to the present invention. For example, a random number can be calculated based on the content of at least a part of the message. A related technique is the hash method. The hash of the message is a pseudo-random number that is deterministically generated from the content, so that minor changes to the content can result in significant changes to the hash. If a spammer attempts to reverse engineer a message, a slight change in the message content will result in a relatively large change in the message score. Alternatively, or in addition, certain features of the message whose contributions to the score of the message exceed a threshold can be extracted and hashed. This hash can then be used as an input to a random number generator, thereby making it more difficult to find the most important feature contributions.

無作為化をスパムフィルタリングのプロセスに追加することはできるが、制御されたやり方で行うことが重要であることにさらに注意されたい。特に、スパムフィルタがたまに明らかにスパムであるメッセージを通過させると、道理にかなったユーザは憤慨することになる。逆に、明らかに適切なメッセージがたまにスパムとしてタグ付けされると、道理にかなったユーザは再度憤慨することになる。したがって、発明主題は、スパムまたは非スパムの境界に「近い」メッセージに影響を与えるのを助ける。言い換えれば、フィルタリングプロセスの無作為化は、明らかにスパムである、または明らかに非スパムであるメッセージには本質的に影響を与えない。代わりに、非スパムとスパムの間の閾値付近および／または閾値のメッセージのフィルタリングに影響を与える。 It should be further noted that although randomization can be added to the spam filtering process, it is important to do so in a controlled manner. In particular, if a spam filter occasionally passes messages that are clearly spam, a sensible user will be hesitant. Conversely, if an apparently appropriate message is sometimes tagged as spam, a sensible user will be hesitant again. Thus, the inventive subject matter helps influence messages that are “close” to spam or non-spam boundaries. In other words, the randomization of the filtering process has essentially no effect on messages that are clearly spam or clearly non-spam. Instead, it affects the filtering of near-threshold and / or threshold messages between non-spam and spam.

最後に、最適な単一のスパムフィルタを使用する代わりに、複数のスパムフィルタを使用して、スパマーによるスパムフィルタのパフォーマンスのモデリングおよび予測を妨げることができる。複数のスパムフィルタを使用することによって、メッセージをスパムまたは非スパムとして分類する前に、メッセージの異なる側面が強制的に検査される。したがって、あるフィルタをリバースエンジニアリングしたり、あるフィルタを通り抜けるメッセージを見つけたりするスパマーが必ずしも異なるフィルタを通り抜けられるとは限らない。さらに、メッセージを処理し、分類するのにどのフィルタを使用するかの選択は、本明細書で上述した無作為化の技術のいずれか、またはその組合せを含み得る。 Finally, instead of using an optimal single spam filter, multiple spam filters can be used to prevent spammers from modeling and predicting spam filter performance. Using multiple spam filters forces different aspects of the message to be inspected before classifying the message as spam or non-spam. Thus, spammers who reverse engineer a filter or find a message that passes through a filter may not always pass through a different filter. Further, the selection of which filter to use to process and classify messages can include any of the randomization techniques described herein above, or a combination thereof.

上記および関連の目的を達成するために、本発明のいくつかの態様の例を、上記の説明および添付の図面との関連で本明細書に記載している。しかし、これらの態様は、本発明の原理を使用し得る様々な方法の極一部を示しているにすぎず、本発明は、こうしたすべての態様およびその均等物を含む。本発明の他の利点および新規の特徴は、本発明の以下の詳細な説明を図面と併せ読めば明らかになる。 To the accomplishment of the foregoing and related ends, certain embodiments of the invention are described herein in connection with the foregoing description and the accompanying drawings. However, these aspects are merely illustrative of the various ways in which the principles of the invention may be used and the invention includes all such aspects and their equivalents. Other advantages and novel features of the invention will become apparent from the following detailed description of the invention when read in conjunction with the drawings.

次に、本発明を図面との関連で説明する。図中、図面を通じて同様の要素には同様の参照番号を付す。以下の説明では、説明上、本発明を完全に理解できるようにするために様々な特定の詳細を記載している。しかし、こうした特定の詳細なしに本発明を実施できることは明らかである。他の例では、本発明を説明しやすくするために、よく知られている構造および装置をブロック図の形式で示している。 The invention will now be described in connection with the drawings. In the drawings, like reference numerals denote like elements throughout the drawings. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent that the invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the present invention.

本出願で使用する場合、「構成要素」および「システム」という用語は、ハードウェア、ハードウェアおよびソフトウェアの組合せ、ソフトウェア、または実行中のソフトウェアのいずれかのコンピュータ関連のエンティティを指すものとする。例えば、構成要素は、それだけには限定されないが、プロセッサ上で動作するプロセス、プロセッサ、オブジェクト、実行可能ファイル、実行スレッド、プログラム、および／またはコンピュータとすることができる。例として、サーバ上で動作するアプリケーションおよびサーバはいずれも構成要素である。１つまたは複数の構成要素がプロセスおよび／または実行スレッド内に存在していてもよく、１つの構成要素を１つのコンピュータ上に配置する、および／または２つ以上のコンピュータ間に分散することもできる。 As used in this application, the terms “component” and “system” shall refer to computer-related entities, either hardware, a combination of hardware and software, software, or running software. For example, a component can be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and / or a computer. By way of illustration, both an application running on a server and the server are components. One or more components may be present in a process and / or thread of execution, and one component may be located on one computer and / or distributed between two or more computers. it can.

発明主題は、機械学習したスパムフィルタリングのためのトレーニングデータの生成に関連して、様々な推論方式および／または技術を組み込む。本明細書で使用する場合、「推論」という用語は一般に、イベントおよび／またはデータを介して捕捉された１組の観察結果からシステム、環境、および／またはユーザの状態について推理または推論する過程を指す。例えば、推論を使用して特定の内容または動作を識別する、または状態にわたる確率分布を生成することができる。推論は、確率的、つまりデータおよびイベントの考慮に基づいた対象の状態にわたる確率分布の算出とすることができる。また推論は、１組のイベントおよび／またはデータから上位レベルのイベントを作り上げるために使用する技術を指す。こうした推論によって、複数のイベントが時間的にごく近接して相関していようといまいと、またイベントおよびデータが派生したイベントおよびデータソースが１つであろうと複数であろうと、１組の観察されたイベントおよび／または格納されたイベントデータから新しいイベントまたは動作が構築される。 The inventive subject matter incorporates various inference schemes and / or techniques in connection with generating training data for machine-learned spam filtering. As used herein, the term “inference” generally refers to the process of inferring or inferring the state of a system, environment, and / or user from a set of observations captured via events and / or data. Point to. For example, inference can be used to identify specific content or actions, or to generate a probability distribution over states. Inference can be probabilistic, i.e., calculating a probability distribution over the state of interest based on data and event considerations. Inference also refers to the techniques used to create higher level events from a set of events and / or data. With this reasoning, a set of observations is made whether events are closely related in time or whether events and data are derived from one or more sources. New events or actions are constructed from the recorded events and / or stored event data.

メッセージという用語は、本明細書を通じて広範に使用されているが、こうした用語は、電子メール自体に限定されるものではなく、適切に変更して、適した任意の通信アーキテクチャにわたって配信することができる任意の形式の電子メッセージ通信を含むようにすることができることを理解されたい。例えば、２人以上の間の会議を助ける会議アプリケーション（対話式チャットプログラム、インスタントメッセージ用プログラムなど）でも、本明細書に開示したフィルタリングの利点を利用することができる。というのは、ユーザがメッセージを交換するときに不要なテキストが通常のチャットメッセージに電子的に挿入される、かつ／または開始メッセージ、終了メッセージ、または上記のすべてとして挿入される可能性があるからである。この特定のアプリケーションでは、特定のメッセージ内容（テキストおよび画像）を自動的にフィルタするようにフィルタをトレーニングして、不要な内容（コマーシャル、宣伝、広告など）を捕捉し、ジャンクとしてタグ付けするようにすることができる。別のアプリケーション例には、セル方式の電話または同様の装置上のＳＭＳメッセージがある。 The term message is used extensively throughout this specification, but the term is not limited to email itself, and can be appropriately modified and distributed across any suitable communications architecture. It should be understood that any form of electronic message communication can be included. For example, a conference application (interactive chat program, instant message program, etc.) that helps a conference between two or more people can also take advantage of the filtering advantages disclosed herein. This is because unwanted text may be inserted electronically into normal chat messages and / or as start messages, end messages, or all of the above when users exchange messages It is. In this particular application, train the filter to automatically filter specific message content (text and images), capture unwanted content (commercials, promotions, advertisements, etc.) and tag it as junk Can be. Another example application is an SMS message on a cellular phone or similar device.

スパムフィルタの内部の仕組みを不明瞭にする多くの目的の１つは、フィルタがどのように動作するかの知識なしに、ほぼ常に通り抜けることを保証されているメッセージをスパマーが見つけるのを防ぐことである。別の目的は、スパマーがスパムフィルタの仕組みを理解するのを抑えて、スパマーがフィルタのリバースエンジニアリングを試みるのを妨げることである。これは特に、メッセージへのこうしたわずかな変更（メッセージのいくつかの単語または特徴の追加や削除など）が、フィルタがメッセージをスパムとして「見る」かどうかに影響を与えるような、スパムとの境界付近にあるメッセージに適用可能である。例えばスパマーは、「バイアグラ（登録商標）」などある単語を含むメッセージが常にスパムとして分類されることがわかると、メッセージをこの単語なしに巧みに作るだけでよい。したがって、本質的にリバースエンジニアリングの試行を妨げるスパムフィルタまたはスパムフィルタリングシステムを構築することが有利となる。 One of the many objectives that obscure the internals of the spam filter is to prevent spammers from finding messages that are almost always guaranteed to go through without knowledge of how the filter works. It is. Another objective is to prevent spammers from understanding how spam filters work and prevent spammers from trying to reverse engineer filters. This is especially the case for spam boundaries where these minor changes to the message (such as adding or deleting some words or features of the message) affect whether the filter "sees" the message as spam. Applicable to nearby messages. For example, if a spammer finds that a message containing a word, such as “Viagra®”, is always classified as spam, he / she only has to craft the message without this word. Therefore, it would be advantageous to build a spam filter or spam filtering system that essentially prevents reverse engineering attempts.

多くのスパムフィルタは、線形モデルを使用する。線形モデルでは、メッセージ内の単語などメッセージの特徴、およびメッセージが真夜中に送信されたかどうかなどの特殊な特徴が抽出される。各特徴に重みまたは得点が関連付けられる。メッセージに関連付けられたすべての重みの合計が計算されて、重みの合計（合計得点など）が得られる。重みの合計がある閾値を超える場合、特定のメッセージは、通り抜けられず、配信されない。逆に、重みの合計がある閾値を下回る場合、メッセージは受信者に到達し得る。 Many spam filters use a linear model. The linear model extracts message features such as words in the message and special features such as whether the message was sent at midnight. A weight or score is associated with each feature. The sum of all weights associated with the message is calculated to obtain a sum of weights (such as a total score). If the total weight exceeds a certain threshold, the particular message is not passed through and is not delivered. Conversely, if the total weight falls below a certain threshold, the message can reach the recipient.

得点を次の式を有するシグモイド関数に通すなど、他のタイプのモデルをスパムフィルタに使用することができる。 Other types of models can be used for spam filters, such as passing the score through a sigmoid function having the following formula:

これによって得点が０から１までのある数字に変換される（例えば最終得点と呼ぶ）。この数字をさらに、メッセージがスパムであるかどうかの決定を助けることができる確率に変換することができる。 As a result, the score is converted into a number from 0 to 1 (for example, called the final score). This number can be further transformed into a probability that can help determine whether the message is spam.

しかし、スパムフィルタに使用するモデルまたは戦略に関わらず、スパマーは、メッセージから抽出された少なくとも１つの特徴に関連する採点方式の決定を試みることがある。スパマーは、異なる特徴を含む大量のメッセージを作成して、どのメッセージがスパムとして分類される（例えば配信を阻止される）か、どのメッセージがスパムとして分類されない（例えば受信者に配信される）かを調べることによってこれを決定することができる。最後にスパマーは、どの特徴の得点がこうした分類をもたらすことになるかを推定しようとする。 However, regardless of the model or strategy used for the spam filter, a spammer may attempt to determine a scoring scheme associated with at least one feature extracted from the message. Spammers create a large number of messages with different characteristics, which messages are classified as spam (for example, blocked from delivery), and which messages are not classified as spam (for example, delivered to recipients) This can be determined by examining. Finally, spammers try to estimate which feature score will result in such a classification.

このタイプのスパマーの挙動を抑える１つの手法は、メッセージに関連付けられている実際の得点にノイズを有効に加えるための何らかのわずかな方法でメッセージに関連付けられている様々な得点のうちの少なくとも１つを修正することを含む。スコアの修正は、一部には、合計得点または最終得点またはその両方を無作為化することによって達成することができる。例えば一般のスパムフィルタでは、メッセージの最終得点をある（確率）閾値と比較して、そのメッセージがスパムであるか、非スパムよりスパム的であるかを決定することができる。したがって、それに乱数または疑似乱数を加えるまたは掛けることにより最終得点を修正することによって、ある閾値と比較したときに、以前は閾値内であったが、現時点では閾値を超えるように、最終得点値をあるわずかな量だけ有効に増減する。したがって、この時点で、総合得点のわずかな変更によりメッセージにスパムまたはスパム候補としてタグ付けすることができる。 One approach to reducing the behavior of this type of spammer is to at least one of the various scores associated with the message in some slight way to effectively add noise to the actual score associated with the message. Including modifying. Score correction can be achieved, in part, by randomizing the total score, the final score, or both. For example, in a typical spam filter, the final score of a message can be compared to a certain (probability) threshold to determine whether the message is spam or more spam than non-spam. Therefore, by modifying the final score by adding or multiplying it with a random or pseudo-random number, when compared to a threshold, the final score value was previously within the threshold but now exceeds the threshold. Effectively increase or decrease by a small amount. Thus, at this point, the message can be tagged as spam or spam candidates with a slight change in overall score.

あるいは、合計得点値は最終得点が閾値を超えるかどうかに影響を与えるため、それに対して同様の形式の修正を行うことができる。したがって、合計得点閾値または最終得点閾値に到達すると、メッセージは通り抜けられない可能性が大きい。さらに、スパムまたは非スパムの閾値付近のメッセージの得点にノイズを加えることによって、スパマーは、メッセージの現在の状況（スパム／非スパム、阻止／配信など）が何らかの無作為化の特徴によるものか、メッセージの内容の何らかの変更によるものかを決定するのが難しくなる。 Alternatively, the total score value affects whether the final score exceeds the threshold, so a similar type of modification can be made to it. Therefore, when the total score threshold or the final score threshold is reached, there is a high possibility that the message cannot be passed. In addition, by adding noise to the score of messages near spam or non-spam thresholds, spammers can determine whether the current status of the message (spam / non-spam, blocking / delivery, etc.) is due to some randomization characteristics, It becomes difficult to determine if it is due to some change in the content of the message.

次に図１を参照すると、本発明の一態様によるスパムフィルタの機能の不明瞭化を助けるスパムフィルタリングシステム１００の概略ブロック図を示している。システム１００は、スパムフィルタ１１０を含み、それによってメッセージ１２０を処理して、最終的にメッセージをスパム（またはスパム的）として分類するか、非スパム（または非スパム的）として分類するかを決定するメッセージの得点１３０が得られる。 Referring now to FIG. 1, a schematic block diagram of a spam filtering system 100 that helps obfuscate the functionality of a spam filter according to one aspect of the present invention is shown. The system 100 includes a spam filter 110 that processes the message 120 to ultimately determine whether to classify the message as spam (or spammy) or non-spam (or non-spammy). A message score 130 is obtained.

より具体的には、スパムフィルタ１１０は、フィルタスコアリング構成要素１４０およびそれに動作可能に結合されている無作為化構成要素１５０を含む。フィルタスコアリング構成要素１４０は、メッセージ１２０がスパムであるかどうかの確率を評価する機械学習システムを使用することができる。フィルタは、メッセージの特定の特徴を調べて、メッセージのその評価を提供することができる。例えば、任意の発信情報に関連する特徴、およびメッセージの特定の内容に関連する特徴（埋め込み画像、ＵＲＬ、スパムの単語および／または句の特性など）を抽出し、分析することができる。結果として得られた得点は、次いで少なくとも一部について無作為化構成要素１５０によって変更することができる。 More specifically, spam filter 110 includes a filter scoring component 140 and a randomization component 150 operably coupled thereto. The filter scoring component 140 can use a machine learning system that evaluates the probability of whether the message 120 is spam. The filter can examine a particular characteristic of the message and provide its evaluation of the message. For example, features associated with any outgoing information and features associated with specific content of the message (such as embedded images, URLs, spam word and / or phrase characteristics, etc.) can be extracted and analyzed. The resulting score can then be changed by the randomization component 150 at least in part.

無作為化構成要素１５０は、１つまたは複数の入力構成要素１７０（例えば入力構成要素_１１７２、入力構成要素_２１７４から入力構成要素_Ｎ１７６まで。この場合Ｎは１以上の整数）から入力を受信して、結果として得られた得点の値（例えばシグモイド関数が使用されている場合は合計得点および／または最終得点）の小さいまたはわずかな増減を引き起こすことができる乱数ジェネレータ１６０を含む。 Randomization component 150 takes input from one or more input components 170 (eg, input component ₁ 172, input component ₂ 174 to input component _N 176, where N is an integer greater than or equal to 1). It includes a random number generator 160 that can receive and cause a small or slight increase or decrease in the resulting score value (eg, total score and / or final score if a sigmoid function is used).

入力構成要素１７０からの入力は、メッセージをスパムまたは非スパムとして分類する前の各得点へのある乱数または疑似乱数の加える形式とすることができる。このように、メッセージの得点が変更され、フィルタを通り抜けるメッセージを見つけたスパマーは、好ましい乱数により一時通り抜けるメッセージを見つけたにすぎない。例えば、特定のスパムメッセージに加えられた乱数が０．７であると想定する。この特定のメッセージの場合、０．７を加えることは、スパムメッセージの分類にわずかな影響しか与えない。したがって、メッセージは通り抜けることができる。次いでスパマーは、このメッセージを模して将来のスパムを作成することができる。しかし、それに加えられた乱数は常に変わり得るため、こうした将来のスパムメッセージは、スパマーには気づかれずに通り抜けられないことになる。さらに、スパマーは、前のメッセージは通り抜けたのに、最近のスパムメッセージはなぜ通り抜けられないのかを決定するのが難しい。 Input from the input component 170 may be in the form of a random or pseudo-random number added to each score before classifying the message as spam or non-spam. In this way, a spammer who has found a message that has passed through the filter with the message score changed has only found a message that will go through temporarily with a preferred random number. For example, assume that the random number added to a particular spam message is 0.7. For this particular message, adding 0.7 has a minor impact on the classification of spam messages. Thus, the message can go through. Spammers can then imitate this message to create future spam. However, since the random numbers added to it can always change, these future spam messages will not pass through without being noticed by spammers. Furthermore, it is difficult for a spammer to determine why a recent spam message cannot be passed while a previous message has passed.

一方、乱数が１であると想定する。この乱数は、特定のスパムメッセージに不利に働くほど大きい。言い換えれば、数字１をスパムメッセージの得点に加えることによって、メッセージの合計得点または総合得点は、この時点である閾値を超える可能性がある。その結果、メッセージはスパムとして分類され、フィルタを通り抜けることができない。したがって乱数または疑似乱数を加えることによって、フィルタのリバースエンジニアリングが困難になる。というのは、メッセージの得点、またはメッセージがスパムとして分類されるかどうかは、メッセージへの軽微な修正と同時に変わるかもしれないし、変わらないかもしれないからである。したがって、送信者にとっては、メッセージが今回通り抜けたのは、メッセージへのマイナーな修正によるものか、好ましい乱数によるものかを断定できないままである。 On the other hand, it is assumed that the random number is 1. This random number is large enough to work against certain spam messages. In other words, by adding the number 1 to the spam message score, the total or total score of the message may exceed a certain threshold at this point. As a result, the message is classified as spam and cannot pass through the filter. Therefore, reverse engineering of the filter becomes difficult by adding random numbers or pseudo-random numbers. This is because the score of a message, or whether a message is classified as spam, may or may not change with minor modifications to the message. Therefore, it remains impossible for the sender to determine whether the message has passed this time due to a minor modification to the message or a preferred random number.

別の形式の入力は、時間の使用を含み得る。例えば、日にちまたは時間に応じて決まる乱数を計算することによって、スパマーは、フィルタのリバースエンジニアリングのために長期間にわたって分類を行う必要がある。場合によっては、毎日など定期的に自動でフィルタを更新することができるため、例えば４時間ごとに変わる無作為化構成要素１５０を含むフィルタ自体は、スパマーがフィルタへのリバースエンジニアリングを行う前に変わり得る。つまり、例えば５分、１０分、１時間、および／または４時間の増分など、様々な時間増分で異なる乱数を使用するように乱数ジェネレータ１６０をプログラムすることができる。 Another form of input may include the use of time. For example, by calculating a random number that depends on the date or time, a spammer needs to perform a long-term classification for reverse engineering of the filter. In some cases, the filter can be updated automatically on a regular basis, such as daily, so the filter itself, including the randomized component 150, which changes every four hours, changes before the spammers reverse engineer the filter. obtain. That is, the random number generator 160 can be programmed to use different random numbers at various time increments, eg, 5 minutes, 10 minutes, 1 hour, and / or 4 hour increments.

さらにスパマーは、メッセージが最初の時間増分で現在通り抜けつつあることを見つけることがある。その直後、スパマーは、そのメッセージの２、３のコピーを送出して、さらにフィルタを「テスト」することができる。これらのメッセージが通り抜けたことがわかると、スパマーは、次いでそのメッセージを何百万も送信する。しかし、それを行うまでに、無作為化構成要素１６０は、別の入力構成要素１７０まで、したがって別の時間増分（例えば第２の時間増分）まで移動している。したがって、第２の時間増分で異なる乱数が加えられ、それによってスパムの境界付近のメッセージ、あるいはそれどころか前の乱数によりいったん非スパムとして分類されたメッセージに悪影響を及ぼす。その結果、ほんのわずかな割合のメッセージがフィルタの通り抜けに成功したスパマーは、メッセージへのわずかな変更によってメッセージがフィルタを通り抜けたのか、乱数が変わったのかを容易に決定することができない。 In addition, spammers may find that messages are currently passing through in the first time increment. Immediately thereafter, the spammer can send out a few copies of the message to further “test” the filter. When it is found that these messages have passed, the spammer then sends millions of the messages. However, by doing so, the randomization component 160 has moved to another input component 170 and thus to another time increment (eg, a second time increment). Therefore, a different random number is added at the second time increment, thereby adversely affecting messages near the spam boundary, or even messages that were once classified as non-spam by the previous random number. As a result, spammers who have successfully passed through a filter with only a small percentage of messages cannot easily determine whether the message has passed through the filter or the random number has changed due to a small change to the message.

無作為化構成要素１５０によって生成された乱数に影響を与え得るさらに別のタイプの入力は、メッセージを受信するユーザおよび／またはドメイン、および／またはスパムフィルタが稼働しているドメインを含む。特に、生成される乱数は、少なくとも一部について、メッセージの受信者に応じて決まり得る。例えばスパマーのテストユーザは、その電子メールアドレス、その表示名、および／またはそのドメインなど、その識別情報の少なくとも一部によって認識することができる。したがって、スパマーのテストユーザのために生成された乱数は、スパムメッセージがほぼいつでもテストユーザに到達できるほど低い可能性がある。 Yet another type of input that can affect the random number generated by the randomization component 150 includes the user and / or domain receiving the message, and / or the domain where the spam filter is running. In particular, the generated random number may depend at least in part on the recipient of the message. For example, a spammer test user can be recognized by at least a portion of its identification information, such as its email address, its display name, and / or its domain. Thus, the random number generated for a spammer test user may be so low that spam messages can reach the test user almost anytime.

これに対して、メッセージを受信するように指定された他のドメイン名および／または他のユーザは、生成された乱数を、スパマーのメッセージが通り抜けられないほど高くすることができる。したがってスパマーは、テストユーザには届き、しかし他のユーザには届かないメッセージを見つけ得る。スパマーは、テストユーザのみがそのスパムを受信していることに気づいていない場合、だまされて、テストユーザのみに届くメッセージを模して将来のスパムメッセージを作成する。その結果、テストユーザ以外の他のユーザに送信されるスパムの量は低減する。しかし、生成された乱数が少なくとも一部についてメッセージの受信者のある側面に応じて決まるようにすることによって、スパマーがスパムフィルタをテストするのにより負担がかかるようになる。 In contrast, other domain names and / or other users who are designated to receive messages can make the generated random number so high that the spammer's message cannot pass through. Spammers can therefore find messages that reach the test user but not the other users. If the spammer is unaware that only the test user is receiving the spam, the spammer will be tricked into creating a future spam message that mimics the message that only reaches the test user. As a result, the amount of spam sent to other users other than the test user is reduced. However, by making the generated random number at least partially dependent on certain aspects of the message recipient, it becomes more burdensome for spammers to test spam filters.

あるいは、またはそれに加えて、入力は、少なくとも一部メッセージの内容を基にすることができる。これは、スパマーがスパムフィルタの内部の仕組みをリバースエンジニアリングするのを抑えるのに有用となり得る。より具体的には、乱数は、メッセージの内容に基づいて計算される。つまり、メッセージの内容のハッシュが得られる。ハッシュ法とは、文字列を元の文字列を表す通常のより短い固定長値またはキーに変換することである。この例では、メッセージごとに計算されたハッシュ値が乱数である。 Alternatively, or in addition, the input can be based at least in part on the content of the message. This can be useful to prevent spammers from reverse engineering the internal mechanism of the spam filter. More specifically, the random number is calculated based on the content of the message. That is, a hash of the message content is obtained. The hash method is to convert a character string into a normal shorter fixed length value or key representing the original character string. In this example, the hash value calculated for each message is a random number.

スパマーは、スパムフィルタを回避するために、しばしばメッセージの内容をわずかに修正しようと試みる。したがって、スパマーがメッセージをリバースエンジニアリングしようとすると、メッセージの内容の少しの変更によって、メッセージの得点が相対的に大きく変わる可能性がある。例えば、メッセージ「Ｘ」がスパムとして分類されることを想定する。スパマーは、メッセージを有効によりスパム的にする「ＦＲＥＥ！！！」などの単語を追加する。しかし、本発明による無作為化の態様のため、スパマーは、メッセージは現時点で非スパムに分類されると確信する。残念ながらスパマーは、「ＦＲＥＥ！！！」という単語ではメッセージはそれほどスパム的にはならないと誤って判断するが、実際にはその逆が真である。 Spammers often try to modify the message content slightly to avoid spam filtering. Thus, if a spammer attempts to reverse engineer a message, a small change in the message content can cause the message score to change relatively significantly. For example, assume that message “X” is classified as spam. Spammers add words such as “FREE !!!” to make the message more spammy. However, because of the randomization aspect of the present invention, spammers believe that messages are currently classified as non-spam. Unfortunately, spammers mistakenly determine that the message “FREE !!!” does not make the message so spammy, but in fact the opposite is true.

メッセージの内容に基づく無作為化を考慮して、メッセージの潜在的に不利な扱いに対抗するために、スパマーは、例えば「ｔｈｅ」や「ｏｎ」など、メッセージに影響を与えそうにないと確信する任意の単語を追加しようと試みる。その結果、スパマーは、これらの単語のみを変更した後多くのメッセージを分類し、次いで平均を計算してメッセージに対するどのタイプの修正によってフィルタの通り抜けに最も成功するかを決定することができる。 Considering randomization based on message content, spammers are convinced that they are unlikely to affect the message, for example, “the” or “on” to combat the potentially detrimental treatment of the message. Try to add any word you want. As a result, spammers can classify many messages after changing only these words and then calculate the average to determine which type of modification to the message will most successfully pass through the filter.

こうしたスパマーの挙動を見越して、本質的にメッセージの得点に寄与する特徴のハッシュを計算することができる。より具体的には、特徴をメッセージから抽出できることを想起されたい。抽出された多くの特徴から、その寄与が所与の閾値（例えば閾値０．０１）を超える特徴を選択することができる。次いで選択された特徴のハッシュを計算し、そのハッシュを乱数ジェネレータ１６０への入力として使用することができる。スパマーは、メッセージのどの特徴がメッセージの得点に最も寄与しているかを見つけるのが比較的難しいため、このタイプのスパムフィルタの機能のリバースエンジニアリングは非常に難しい。 In anticipation of such spammer behavior, a hash of features that essentially contribute to the score of the message can be computed. More specifically, recall that features can be extracted from messages. From the many extracted features, features whose contribution exceeds a given threshold (eg, threshold 0.01) can be selected. A hash of the selected feature can then be calculated and used as an input to the random number generator 160. Spammers are relatively difficult to reverse engineer the functionality of this type of spam filter, as it is relatively difficult to find which characteristics of a message contribute most to the score of the message.

あるいは、またはそれに加えて、送信者が主張したＩＰアドレスのハッシュを計算して、そのメッセージのためにどの乱数が生成されるかを決定することができる。したがって、この場合もまた、スパマーは、メッセージのどの特徴を使用してハッシュを決定し、次いでどの乱数がハッシュに対応しているかを決定することは特に難しい。 Alternatively or in addition, a hash of the IP address claimed by the sender can be calculated to determine which random number is generated for the message. Thus, again, it is particularly difficult for a spammer to determine which characteristic of the message is used to determine the hash and then which random number corresponds to the hash.

無作為化構成要素１５０が特定のメッセージのための乱数を出力すると、乱数を、例えばフィルタスコアリング構成要素１４０が評価した得点または重みに加えることができる。最終的に、メッセージの合計得点または最終得点１３０を取得して、メッセージのスパムまたは非スパムの分類を容易に行うことができる。 When the randomization component 150 outputs a random number for a particular message, the random number can be added to the score or weight evaluated by the filter scoring component 140, for example. Finally, the total score or final score 130 of the message can be obtained to easily classify the message as spam or non-spam.

スパムフィルタの機能を不明瞭にするために、メッセージ得点に加えられる確率関数を有するのではなく、複数のスパムフィルタを複数のドメインにわたって、かつ／または複数のユーザのために配置することができる。特に、ユーザは、１つまたは複数のスパムフィルタを無作為に、または作為的に選択して、メッセージの分類に使用することができる。フィルタ自体は、異なるタイプのスパムフィルタとしてもよく、かつ／または異なる組のトレーニングデータを使用してトレーニングすることもできる。したがってスパマーは、そのスパムメッセージの特定の受信者によってどのフィルタが使用されるかを解読するのがほぼ間違いなく非常に難しくなる。さらに、一度に複数のフィルタをメッセージの分類に関与させることができ、それによってほぼ毎回フィルタを通り抜けるメッセージを見つけることがほぼ不可能になる。 Rather than having a probability function added to the message score to obscure the functionality of the spam filter, multiple spam filters can be deployed across multiple domains and / or for multiple users. In particular, the user can randomly or intentionally select one or more spam filters to use for message classification. The filter itself may be a different type of spam filter and / or may be trained using a different set of training data. Thus, it is almost certainly very difficult for a spammer to decipher which filter is used by a particular recipient of the spam message. In addition, multiple filters can be involved in message classification at a time, making it almost impossible to find messages that pass through the filter almost every time.

図２は、本発明の一態様によるマルチフィルタスパムフィルタリングシステム（ｍｕｌｔｉ−ｆｉｌｔｅｒｓｐａｍｆｉｌｔｅｒｉｎｇｓｙｓｔｅｍ）２００の例を示すブロック図である。システム２００は、複数のユーザ２１０を含む（例えばユーザ_１２１２、ユーザ_２２１４、および／またはユーザ_Ｙ２１６まで。この場合Ｙは１以上の整数）。ユーザ２１０は一般に、スパムメッセージを含む任意の着信メッセージの受信者である。システム２００は、複数のスパムフィルタ２２０（例えばスパムフィルタ_１２２２、スパムフィルタ_２２２４、および／またはスパムフィルタ_Ｗ２２６まで。この場合Ｗは１以上の整数である）も含む。 FIG. 2 is a block diagram illustrating an example of a multi-filter spam filtering system 200 according to an aspect of the present invention. System 200 includes a plurality of users 210 (e.g., up to user ₁ 212, user ₂ 214, and / or user _Y 216, where Y is an integer greater than or equal to 1). User 210 is generally the recipient of any incoming message, including spam messages. The system 200 also includes a plurality of spam filters 220 (eg, up to spam filter ₁ 222, spam filter ₂ 224, and / or spam filter _W 226, where W is an integer greater than or equal to 1).

各スパムフィルタ２２０は、少なくとも一部、異なる組のトレーニングデータに基づいてトレーニングすることができる。より具体的には、第１のフィルタ２２２は、第１のサブセットのトレーニングデータを使用して機械学習システムを介してトレーニングすることができる。同様に第２のフィルタ２２４は、同じように第２のサブセットのトレーニングデータを使用してトレーニングすることができる。第２のサブセットのトレーニングデータは、第１のサブセットのデータと部分的に重なっていても、重なっていなくてもよい。例えば、第１のフィルタ２２２は一般的な用語を含み、第２のフィルタ２２４はまれな単語を含む。両方のフィルタの使用は、フィルタがメッセージをスパムまたは非スパムと分類する前に異なる基準またはメッセージ内の特徴または内容を検査することを意味する。 Each spam filter 220 can be trained based at least in part on a different set of training data. More specifically, the first filter 222 can be trained via a machine learning system using the first subset of training data. Similarly, the second filter 224 can be similarly trained using the second subset of training data. The training data of the second subset may or may not overlap with the data of the first subset. For example, the first filter 222 includes common terms and the second filter 224 includes rare words. The use of both filters means that the filter examines different criteria or features or content in the message before classifying the message as spam or non-spam.

同じように、ユーザの要望通りに、あるデータを１つまたは複数のフィルタ２１０のトレーニングから除外することができる。除外されたデータは、乱数ジェネレータに従って除外することができる。さらに、抽出され、トレーニングデータの作成に使用されたメッセージのいくつかの特徴に、特定の値を割り当てることができる。したがって、スパムフィルタ２２０をユーザ専用または個人専用にして、一部にはユーザの選好および指示に応じて様々な度合いまでカスタマイズすることができる。 Similarly, certain data can be excluded from training one or more filters 210 as desired by the user. Excluded data can be excluded according to a random number generator. In addition, certain values can be assigned to some characteristics of the messages that are extracted and used to create the training data. Thus, the spam filter 220 can be dedicated to the user or private and can be customized to various degrees depending in part on the user's preferences and instructions.

その後、複数のユーザ２１０および複数のスパムフィルタ２２０に動作可能に結合されたフィルタ選択構成要素２３０は、少なくとも一部において特定のユーザおよび／またはユーザの選択に基づいて１つまたは複数のフィルタ２２０を選択するために、ユーザ２１０と通信することができる。あるいは、フィルタ選択は無作為とする、または少なくとも一部についてメッセージの内容のハッシュまたはメッセージのサイズに基づいていてもよい。 Thereafter, a filter selection component 230 operatively coupled to the plurality of users 210 and the plurality of spam filters 220 may at least partially select one or more filters 220 based on a particular user and / or user selection. Communication with the user 210 can be made to select. Alternatively, the filter selection may be random or based at least in part on a hash of message content or message size.

図に示すように、フィルタ選択は、一部には時間入力構成要素２４０から受信した入力を基にすることもできる。つまり、異なるフィルタをその日の異なる時刻に動作可能とすることができる。例えば、メッセージが午後２時に送信された場合、複数のフィルタ２２０を使用することができる。しかし、メッセージが午前３時に送信されると、例えば第１、第２、第４および第６のフィルタなど、フィルタ２２０のサブセットのみが使用可能である。あるいは、時刻に応じてフィルタを選択して単一のフィルタのみを使用する。 As shown, the filter selection may be based in part on input received from the time input component 240. That is, different filters can be activated at different times of the day. For example, if the message is sent at 2 PM, multiple filters 220 can be used. However, if the message is sent at 3 am, only a subset of the filters 220 can be used, for example the first, second, fourth and sixth filters. Alternatively, a filter is selected according to time and only a single filter is used.

上記に加えて、ユーザ２１０を、クラスタ化構成要素２５０によって、いくつかの同様の質または特性あるいはタイプに基づいてサブグループにクラスタ化することができる。同様に、トレーニングデータを同じようにクラスタ化し、それによって少なくとも１つのデータのクラスタまたはタイプに対してトレーニングされたフィルタが得られる。したがってフィルタ選択構成要素２３０は、ユーザの特定のクラスタに対応して１つまたは複数のスパムフィルタ２２０を選択することができる。本明細書に記載したような無作為または作為的なやり方で複数のフィルタを使用することは、最適な単一のスパムフィルタに依存する代わりに、一般にスパムフィルタリングにとってより有利となり得る。あるメッセージが今回は通り抜けても、異なるフィルタが無作為または作為的に選択されると、次回同一または類似のメッセージが必ずしも通り抜けるとは限らないため、リバースエンジニアリング、スパムフィルタパフォーマンスの予測、および毎回通り抜ける単一メッセージの発見は、スパマーにとってはより難しくなる。しかし、メッセージが毎回または次回送信されるときになぜ通り抜けられないかを決定することは、スパマーにとって、不可能とはいえないまでもかなり難しい。フィルタの内部の仕組みを容易にリバースエンジニアリングし、かつ／または予測することができないからである。さらに、スパムの境界付近の少量のメッセージは通り抜けるかもしれないが、ほぼ「スパム」メッセージの大部分の配信は、スパムフィルタリングプロセスを不明瞭にすることによって、有効に阻止することができる。 In addition to the above, users 210 can be clustered into subgroups by clustering component 250 based on several similar qualities or characteristics or types. Similarly, the training data is clustered in the same way, resulting in a trained filter for at least one cluster or type of data. Accordingly, the filter selection component 230 can select one or more spam filters 220 corresponding to a particular cluster of users. Using multiple filters in a random or artificial manner as described herein may generally be more advantageous for spam filtering instead of relying on an optimal single spam filter. Even if a message passes this time, the next time the same or similar message is not necessarily passed if a different filter is randomly or randomly selected, reverse engineering, spam filter performance prediction, and every pass Finding a single message is more difficult for spammers. However, it is quite difficult if not impossible for a spammer to determine why a message cannot be passed every time or next time it is sent. This is because the internal mechanism of the filter cannot be easily reverse engineered and / or predicted. In addition, while a small amount of messages near the spam boundary may pass through, the delivery of the majority of “spam” messages can be effectively prevented by obfuscating the spam filtering process.

次に、図３〜８に示すように、一連の動作を介して発明主題による様々な方法について説明する。本発明は、動作の順序によって限定されるものではなく、本発明による一部の動作は、本明細書に示し、説明したものと異なる順序で、かつ／または他の動作と同時に行うことができることを理解されたい。例えば、代わりに方法を状態図など相互に関係する一連の状態またはイベントとして表すことができることを当分野の技術者であれば理解されよう。さらに、本発明による方法を実施するのに示したすべての動作が必要であるとは限らない。 Next, as shown in FIGS. 3 to 8, various methods according to the inventive subject matter will be described through a series of operations. The present invention is not limited by the order of operations, and some operations according to the present invention may be performed in a different order and / or concurrently with other operations than shown and described herein. I want you to understand. For example, those skilled in the art will appreciate that a method can alternatively be represented as a series of interrelated states or events, such as a state diagram. Moreover, not all actions shown for carrying out the method according to the invention may be necessary.

次に図３を参照すると、本発明の一態様によるスパムフィルタで生成されたメッセージの得点の無作為化を行うプロセス３００の例のフロー図を示している。プロセス３００は、３１０で開始し、メッセージをスパムフィルタに通す。３２０で、スパムフィルタは、得点をメッセージに割り当てる。得点は、１つまたは複数のメッセージの特徴を抽出し、それによって各特徴がそれに関連する重みを有するようになるものなど、一般のスパムフィルタリングシステムおよび方法を基にすることができる。重みの合計が計算され、メッセージの得点が得られる。しかし、メッセージがスパムまたは非スパムとして分類される前に、３３０で乱数または疑似乱数を得点に加え、スパムフィルタリングプロセスのリバースエンジニアリングを抑えることができる。 Turning now to FIG. 3, a flow diagram of an example process 300 for randomizing the score of a message generated with a spam filter in accordance with an aspect of the present invention is shown. Process 300 begins at 310 and passes the message through a spam filter. At 320, the spam filter assigns a score to the message. The score can be based on common spam filtering systems and methods, such as extracting one or more message features, such that each feature has a weight associated with it. The sum of the weights is calculated and a message score is obtained. However, before the message is classified as spam or non-spam, a random or pseudo-random score can be scored at 330 to reduce reverse engineering of the spam filtering process.

３４０でメッセージの最終得点を取得し、その後３５０でメッセージをスパムまたは非スパムとして分類する。スパムフィルタによって与えられた元の得点に加えられた乱数または疑似乱数は、ノイズを元の得点に有効に加え、スパマーがスパムフィルタのリバースエンジニアリングを行う、かつ／またはスパムフィルタを常に通り抜けることができるメッセージを発見するのを抑えるようにする。いずれの場合も、スパマーは、スパムフィルタがどのように動作するかを知っている場合、またはスパムフィルタの応答を予測できる場合、実質的に毎回スパムフィルタを通り抜けるメッセージを容易に作成できることになる。しかし、無作為化構成要素をスパムフィルタに組み込むことによって、スパマーは、メッセージへの外見上軽微な変更、またはフィルタの何らかの特徴のいずれかによってメッセージが「スパム」状態から「非スパム」状態（またはその逆）に変わったかを定義することは難しくなり、それによってスパムフィルタのリバースエンジニアリングがほぼ不可能になる。 At 340, the final score of the message is obtained, and then at 350, the message is classified as spam or non-spam. A random or pseudo-random number added to the original score given by the spam filter can effectively add noise to the original score, allowing spammers to reverse engineer the spam filter and / or always pass through the spam filter. Try to suppress the discovery of messages. In any case, if a spammer knows how the spam filter works or can predict the response of the spam filter, it will be able to easily create a message that passes through the spam filter virtually every time. However, by incorporating a randomization component into the spam filter, spammers can change the message from a “spam” state to a “non-spam” state (or either due to an apparent minor change to the message, or some characteristic of the filter) It is difficult to define what has changed), which makes it almost impossible to reverse engineer spam filters.

乱数または任意の要因によって、スパムの境界付近のメッセージに影響を与えるのにちょうど十分なほどメッセージの得点が変わる。つまり、この無作為化手法によって、スパムメッセージと非スパムメッセージの間のラインに沿ったメッセージが最も影響を受ける。明らかにスパムである（得点または確率が非常に高い）または明らかに非スパムである（得点または確率が非常に低い）他のメッセージは、実質的に得点の無作為化によって影響を受けない。さらに、毎回メッセージの得点に加えられる純然たる乱数は、本発明ほど有効ではない。スパマーは最終的に、そのメッセージがフィルタを通り抜ける確率または平均確率を突きとめることができ、したがって、フィルタのリバースエンジニアリング、または常にフィルタを通り抜けるメッセージの発見、および／またはその両方を行うことができるからである。 Random numbers or arbitrary factors change the message score just enough to affect messages near the spam boundary. In other words, this randomization technique most affects messages along the line between spam and non-spam messages. Other messages that are clearly spam (very high score or probability) or obviously non-spam (very low score or probability) are not substantially affected by randomization of the score. Furthermore, the pure random number added to the message score every time is not as effective as the present invention. Spammers can ultimately determine the probability or average probability that the message will pass through the filter, and thus can reverse engineer the filter, or always find the message that passes through the filter, and / or both. It is.

得点の無作為化は、図４を参照する図３の３６０に示した１つまたは複数の入力のタイプに応じて決まり得る。図４には、どの乱数を使用するかを決定するプロセス４００の例のフロー図を示している。４１０で、プロセス４００は、乱数がそれに応じて決まる入力のタイプ、時間４２０、ユーザ４３０、および／またはメッセージの内容４４０のうちの少なくとも１つを選択することを含む。 The randomization of scores may depend on the type of input or inputs shown at 360 in FIG. 3 with reference to FIG. FIG. 4 shows a flow diagram of an example process 400 for determining which random number to use. At 410, the process 400 includes selecting at least one of an input type, time 420, user 430, and / or message content 440 for which the random number is determined accordingly.

時間４２０は、時間増分または時刻を指す。より具体的には、使用する乱数は、例えば５分、１０分、３０分、２時間など使用する時間増分、または時刻に応じて変わり得る。例えば、乱数の値は、真夜中に変わり、次いで再度午前５時、午前７時半、午前１１時、午後４時１３分などに変わる。 Time 420 refers to a time increment or time. More specifically, the random number used may vary depending on the time increment used, such as 5 minutes, 10 minutes, 30 minutes, 2 hours, or the time. For example, the value of the random number changes at midnight, and then changes again at 5 am, 7:30 am, 11:00 am, 4:13 pm, and the like.

ユーザ４３０の識別（表示名、電子メールアドレスなど）および／またはユーザのドメインおよび／またはメッセージを送受信するドメインを使用してどの乱数を使用するかに影響を与えることもできる。この方策が実施されると、スパマーは、スパムフィルタをテストしてどのメッセージがどのユーザに届くかを決定するのがより難しくなる。最後に、メッセージの内容４４０または少なくともその一部分は、どの乱数が元の（基本）得点に加えられるかを決定することができる。 It may also affect which random number to use using the identity of the user 430 (display name, email address, etc.) and / or the domain of the user and / or the domain that sends and receives messages. Once this strategy is implemented, it becomes more difficult for spammers to test spam filters to determine which messages reach which users. Finally, message content 440 or at least a portion thereof can determine which random numbers are added to the original (basic) score.

次に図５を参照すると、本発明に従ってメッセージの内容を使用してメッセージの基本得点に加えられる乱数を決定するプロセス５００の例のフロー図を示している。特に、プロセス５００は、５１０でメッセージの少なくとも一部のハッシュを計算することによって開始する。例えば、乱数は、メッセージの本文に基づいて計算することができる。したがって、このメッセージと同一の別のメッセージが現れると、それには同じ乱数またはハッシュ値が割り当てられる。しかし、メッセージ本文へのわずかな変更によって、メッセージの得点に大幅な変更を与えることができる。例えば、スパマーは、スパムメッセージがそれほどスパム的に見えないようにするために、外見上無意味な単語をメッセージに追加する、またはそこから削除するように試みる。スパムメッセージの比率が比較的小さい場合、これは真となり得る。しかし、どのタイプの単語が乱数および／またはメッセージの総合得点を増減させるかはわからないため、スパムメッセージの大部分については、そのスパムは配信が阻止される。 Referring now to FIG. 5, there is shown a flow diagram of an example process 500 for determining a random number to be added to the basic score of a message using the message content according to the present invention. In particular, the process 500 begins at 510 by calculating a hash of at least a portion of the message. For example, the random number can be calculated based on the body of the message. Therefore, when another message identical to this message appears, it is assigned the same random number or hash value. However, minor changes to the message body can make a significant change to the message score. For example, spammers attempt to add or remove seemingly meaningless words from a message in order to make the spam message less visible. This can be true if the spam message ratio is relatively small. However, since it is not known which type of word will increase or decrease the random number and / or the overall score of the message, for the majority of spam messages, the spam will be blocked from delivery.

メッセージの内容のハッシュに代わる方法は、実質的にメッセージの得点に寄与するメッセージから抽出されたいくつかの特徴のハッシュを計算することである。実質的にメッセージの得点に寄与する特徴は、無作為または作為的に変更することもできる。このようにスパマーは、平均がわからず、大量のメッセージを通して平均を見つけ、したがってメッセージのどんな特徴がハッシュされても通り抜けるメッセージを見つけることができない。さらに、ハッシュは送信者のＩＰアドレス上で計算される。したがってメッセージの分類は、少なくとも一部において送信者の発信情報に直接依存することができる。 An alternative to hashing message content is to compute a hash of several features extracted from the message that contribute substantially to the score of the message. Features that contribute substantially to the scoring of the message can be changed randomly or randomly. In this way, spammers do not know the average, find the average through a large number of messages, and therefore cannot find a message that passes through whatever message characteristics are hashed. In addition, the hash is calculated on the sender's IP address. Thus, the classification of messages can depend directly on the sender's outgoing information, at least in part.

５２０で、無作為化に関係なくスパムフィルタによって前もって決定された元の得点または基本得点に乱数が加えられる。次いで５３０でメッセージの合計得点が取得され、次いで５４０でメッセージをスパムまたは非スパムと分類することができる。 At 520, a random number is added to the original or base score previously determined by the spam filter regardless of randomization. The total score for the message is then obtained at 530 and then at 540 the message can be classified as spam or non-spam.

本明細書の図３〜５で上述した無作為化手法は、スパマーによるスパムフィルタのリバースエンジニアリングを妨げ、かつ／またはスパムフィルタパフォーマンスのモデリングを妨げるために使用する１つの戦略にすぎない。別の戦略は、複数のユーザおよび／またはドメインにわたって複数のフィルタを配置することを含む。最初に、何らかの方法で重複していてもしていなくてもよいトレーニングデータの様々なサブセットを使用して、複数のフィルタを個々にトレーニングすることができる。複数のフィルタを使用してメッセージを検査し、分析することによって、フィルタリングシステムは、メッセージの特定の１つの側面に焦点を当てるだけではなく、本質的に同時にメッセージ内の異なる基準を調べるようになる。したがって、複数のフィルタを使用すると、メッセージのより正確な分類の提供、およびフィルタリングシステムのリバースエンジニアリングの軽減が促進される。というのは、どのフィルタを使用したか、およびメッセージのどの側面が分類に考慮されたかを決定するのが困難となるからである。 The randomization approach described above in FIGS. 3-5 herein is just one strategy used to prevent spammers from reverse engineering spam filters and / or to model spam filter performance. Another strategy involves placing multiple filters across multiple users and / or domains. Initially, multiple filters can be individually trained using various subsets of training data that may or may not overlap in some way. By inspecting and analyzing the message using multiple filters, the filtering system will not only focus on one particular aspect of the message but also examine different criteria within the message at essentially the same time. . Thus, the use of multiple filters facilitates providing a more accurate classification of messages and mitigating reverse engineering of the filtering system. This is because it is difficult to determine which filter was used and which aspects of the message were considered for classification.

図６は、ユーザタイプのクラスタに基づいてカスタマイズされた方式で複数のスパムフィルタをトレーニングし、使用するプロセス６００の例のフロー図を示している。プロセス６００は、６１０で、例えばユーザタイプに従って１つまたは複数のグループにユーザをクラスタ化することによって開始することができる。６２０で、トレーニングデータを同様のやり方でクラスタ化して、ユーザタイプのクラスタに対応させることができる。６３０で、複数のフィルタをトレーニングデータのクラスタごとに個々にトレーニングすることができる。次いで６４０で複数のフィルタを使用する準備ができており、それによってユーザタイプの特定のクラスタに対応するフィルタを使用して、そのクラスタのメッセージを分類することができる。これをさらに説明するために、フィルタＲをクラスタＲトレーニングデータでトレーニングすると想定する。クラスタユーザタイプＲ内のユーザは、次いでフィルタＲを使用してメッセージを分類することができる。トレーニングデータは、ユーザがクラスタ化されたのと同様のやり方でクラスタ化されることを理解されたい。 FIG. 6 shows a flow diagram of an example process 600 for training and using multiple spam filters in a customized manner based on user type clusters. Process 600 may begin at 610 by clustering users into one or more groups, eg, according to user type. At 620, the training data can be clustered in a similar manner to correspond to user type clusters. At 630, multiple filters can be individually trained for each cluster of training data. The plurality of filters are then ready for use at 640, whereby the messages corresponding to a particular cluster of user types can be used to classify the messages in that cluster. To further illustrate this, assume that filter R is trained with cluster R training data. Users in cluster user type R can then use the filter R to classify messages. It should be understood that the training data is clustered in the same way that users are clustered.

あるいは、図７のプロセス７００の例に示すように、７１０でトレーニングデータの様々なサブセットを使用して複数のフィルタをトレーニングすることができる。任意選択で、７２０で、１つまたは複数の特徴または関連のデータを、トレーニングデータの１つまたは複数のサブセットから抽出することができる。図示していないが、メッセージから抽出したいくつかの特徴に、強制的にいくつかの値または重みを持たせるようにすることができる。７３０で、１つまたは複数のスパムフィルタは、トレーニングデータのそれぞれのサブセットを使用してトレーニングされ、その後、７４０でそれを使用してメッセージを処理する。本明細書で上述したように、７５０で、メッセージをスパムまたは非スパムとして分類することができる。図示していないが、時間もどのスパムフィルタを使用してメッセージを分類するかを決定する１つの要因とすることができる。言い換えれば、特定のフィルタは、その日のいくつかの時間帯のみ使用可能とすることができる。したがって、フィルタ選択は、一部にはメッセージの受信ユーザ、および／または時刻に基づいて無作為、または作為的に行うことができる。 Alternatively, as shown in the example process 700 of FIG. 7, multiple filters can be trained at 710 using various subsets of training data. Optionally, at 720, one or more features or related data can be extracted from one or more subsets of training data. Although not shown, some features extracted from the message can be forced to have some value or weight. At 730, the one or more spam filters are trained using the respective subset of training data, and then use it at 740 to process the message. As described above herein, at 750, the message can be classified as spam or non-spam. Although not shown, time can also be a factor in determining which spam filter to use to classify messages. In other words, a particular filter may only be available for some time periods of the day. Thus, the filter selection can be made randomly or randomly based in part on the user receiving the message and / or the time of day.

本発明の様々な態様に関する状況をさらに提供するために、図８および以下の説明は、本発明の様々な態様の実施に適した動作環境８１０の概略説明を提供するものである。本発明は、１つまたは複数のコンピュータまたは他の装置が実行するプログラムモジュールなどのコンピュータ実行可能命令の一般的な文脈で説明しているが、本発明を他のプログラムモジュールとの組合せで実施する、かつ／またはハードウェアおよびソフトウェアの組合せとして実施することもできることを当分野の技術者であれば理解されよう。 To further provide context for various aspects of the present invention, FIG. 8 and the following description provide a general description of an operating environment 810 suitable for implementing various aspects of the present invention. Although the invention has been described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices, the invention is implemented in combination with other program modules. Those skilled in the art will appreciate that, and / or can be implemented as a combination of hardware and software.

しかし、一般にプログラムモジュールは、特定のタスクを実行する、または特定のデータ型を実装するルーチン、プログラム、オブジェクト、構成要素、データ構造などを含む。動作環境８１０は、適した動作環境の一例にすぎず、本発明の使用または機能の範囲に関する限定を示唆するものではない。本発明とともに使用するのに適したよく知られている他のコンピュータシステム、環境、および／または構成の例には、それだけには限定されないが、パーソナルコンピュータ、ハンドヘルドまたはラップトップ装置、マルチプロセッサシステム、マイクロプロセッサベースのシステム、プログラム可能家庭用電化製品、ネットワークＰＣ、ミニコンピュータ、メインフレームコンピュータ、上記のシステムまたは装置を含む分散コンピューティング環境などがある。 Generally, however, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular data types. The operating environment 810 is only one example of a suitable operating environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Examples of other well-known computer systems, environments, and / or configurations suitable for use with the present invention include, but are not limited to, personal computers, handheld or laptop devices, multiprocessor systems, micro-computers, Processor-based systems, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments including the systems or devices described above, and the like.

図８を参照すると、本発明の様々な態様を実施する環境８１０の例はコンピュータ８１２を含む。コンピュータ８１２は、処理ユニット８１４、システムメモリ８１６、およびシステムバス８１８を含む。システムバス８１８は、それだけには限定されないが、システムメモリ８１６を含むシステム構成要素を処理ユニット８１４に結合する。処理ユニット８１４は、使用可能な様々なプロセッサのうちのどんなものでもよい。デュアルマイクロプロセッサおよび他のマルチプロセッサアーキテクチャを処理ユニット８１４として使用することもできる。 With reference to FIG. 8, an example environment 810 for implementing various aspects of the invention includes a computer 812. Computer 812 includes a processing unit 814, system memory 816, and system bus 818. System bus 818 couples system components including, but not limited to, system memory 816 to processing unit 814. The processing unit 814 may be any of various available processors. Dual microprocessors and other multiprocessor architectures may also be used as the processing unit 814.

システムバス８１８は、使用可能な様々なバスアーキテクチャのうちの任意のものを使用するメモリバスまたはメモリコントローラ、周辺バスまたは外部バス、および／またはローカルバスを含むいくつかのタイプのバス構造のうちどんなものでもよい。こうしたアーキテクチャには、それだけには限定されないが、８ビットバス、業界標準アーキテクチャ（ＩｎｄｕｓｔｒｉａｌＳｔａｎｄａｒｄＡｒｃｈｉｔｅｃｔｕｒｅ：ＩＳＡ）、ＭＳＡ（Ｍｉｃｒｏ−ＣｈａｎｎｅｌＡｒｃｈｔｅｃｔｕｒｅ）、拡張ＩＳＡ（ＥｘｔｅｎｄｅｄＩＳＡ）、ＩＤＥ（ＩｎｔｅｇｒａｔｅｄＤｒｉｖｅＥｌｅｃｔｒｏｎｉｃｓ）、ＶＥＳＡローカルバス（ＶＬＢ）、ＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ）、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）、ＡＧＰ（ＡｃｃｅｌｅｒａｔｅｄＧｒａｐｈｉｃｓＰｏｒｔ）、ＰＣＭＣＩＡ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒＭｅｍｏｒｙＣａｒｄＩｎｔｅｒｎａｔｉｏｎａｌＡｓｓｏｃｉａｔｉｏｎ）バス、およびＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）などがある。 The system bus 818 may be any of several types of bus structures including a memory bus or memory controller, a peripheral or external bus, and / or a local bus using any of the various bus architectures available. It may be a thing. Such architectures include, but are not limited to, 8-bit buses, Industrial Standard Architecture (ISA), MSA (Micro-Channel Architecture), Extended ISA (Extended Drive Local EDI), IDE (Integrated Drives Local EDI) Bus (VLB), PCI (Peripheral Component Interconnect), USB (Universal Serial Bus), AGP (Accelerated Graphics Port), PCMCIA (Personal Computer Memory Card) bus) and SCSI (Small Computer System Interface).

システムメモリ８１６は、揮発性メモリ８２０および不揮発性メモリ８２２を含む。基本入出力システム（ＢＩＯＳ）は、例えば起動中など、コンピュータ８１２内の要素間で情報を転送する基本ルーチンを含み、不揮発性メモリ８２２に格納されている。不揮発性メモリ８２２には、それだけには限定されないが一例として、読取り専用メモリ（ＲＯＭ）、プログラマブルＲＯＭ（ＰＲＯＭ）、電気的プログラマブルＲＯＭ（ＥＰＲＯＭ）、電気的消去可能ＲＯＭ（ＥＥＰＲＯＭ）、フラッシュメモリなどがある。揮発性メモリ８２０には、ランダムアクセスメモリ（ＲＡＭ）などがあり、これは外部キャッシュメモリとして働く。ＲＡＭは、それだけには限定されないが一例として、シンクロナスＲＡＭ（ＳＲＡＭ）、ダイナミックＲＡＭ（ＤＲＡＭ）、シンクロナスＤＲＡＭ（ＳＤＲＡＭ）、ダブルデータレートＳＤＲＡＭ（ＤＤＲＳＤＲＡＭ）、拡張ＳＤＲＡＭ（ＥＳＤＲＡＭ）、シンクリンクＤＲＡＭ（ＳＬＤＲＡＭ）、およびＤＲＤＲＡＭ（ＤｉｒｅｃｔＲａｍｂｕｓＤＲＡＭ）など多くの形態で使用可能である。 The system memory 816 includes volatile memory 820 and nonvolatile memory 822. The basic input / output system (BIOS) includes a basic routine for transferring information between elements in the computer 812, such as during startup, and is stored in the non-volatile memory 822. Non-volatile memory 822 includes, but is not limited to, read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), flash memory, and the like. . Volatile memory 820 includes random access memory (RAM), which acts as external cache memory. Examples of RAM include, but are not limited to, synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), extended SDRAM (ESDRAM), and sync link DRAM ( It can be used in many forms such as SLDRAM) and DRDRAM (Direct Rambus DRAM).

コンピュータ８１２は、リムーバブル／非リムーバブル、揮発性／不揮発性コンピュータ記憶媒体も含む。図８は、例えばディスク記憶装置８２４を含む。ディスク記憶装置８２４には、それだけには限定されないが、磁気ディスクドライブ、フロッピー（登録商標）ディスクドライブ、テープドライブ、Ｊａｚドライブ、Ｚｉｐドライブ、ＬＳ−１００ドライブ、フラッシュメモリカード、メモリスティックなどの装置がある。さらに、ディスク記憶装置８２４は、記憶媒体を別個に含む、または、それだけには限定されないが、コンパクトディスクＲＯＭ装置（ＣＤ−ＲＯＭ）、ＣＤレコーダブルドライブ（ＣＤ−ＲＤｒｉｖｅ）、ＣＤリライタブルドライブ（ＣＤ−ＲＷＤｒｉｖｅ）、またはデジタル多用途ディスクＲＯＭドライブ（ＤＶＤ−ＲＯＭ）などの光ディスクドライブなど他の記憶媒体と組み合わせて含むことができる。ディスク記憶装置８２４のシステムバス８１８への接続を容易にするために、一般にインターフェース８２６などリムーバブルまたは非リムーバブルインターフェースが使用される。 Computer 812 also includes removable / non-removable, volatile / nonvolatile computer storage media. FIG. 8 includes a disk storage device 824, for example. The disk storage device 824 includes, but is not limited to, devices such as a magnetic disk drive, a floppy (registered trademark) disk drive, a tape drive, a Jaz drive, a Zip drive, an LS-100 drive, a flash memory card, and a memory stick. . Further, the disk storage device 824 includes, but is not limited to, a separate storage medium, such as, but not limited to, a compact disk ROM device (CD-ROM), a CD recordable drive (CD-R Drive), a CD rewritable drive (CD-). RW Drive), or other storage media such as an optical disk drive such as a digital versatile disk ROM drive (DVD-ROM). In order to facilitate connection of the disk storage device 824 to the system bus 818, a removable or non-removable interface, such as interface 826, is generally used.

図８は、ユーザと適した動作環境８１０で説明した基本的なコンピュータリソースとの間の媒介として働くソフトウェアを説明していることを理解されたい。こうしたソフトウェアには、オペレーティングシステム８２８などがある。オペレーティングシステム８２８は、ディスク記憶装置８２４に格納することができ、コンピュータシステム８１２のリソースを制御し、割り振るよう働く。システムアプリケーション８３０は、システムメモリ８１６またはディスク記憶装置８２４のいずれかに格納されているプログラムモジュール８３２およびプログラムデータ８３４を介してオペレーティングシステム８２８によるリソースの管理を利用する。本発明は、様々なオペレーティングシステムまたはオペレーティングシステムの組合せとともに実施できることを理解されたい。 It should be understood that FIG. 8 describes software that acts as an intermediary between the user and the basic computer resources described in a suitable operating environment 810. Such software includes an operating system 828 and the like. Operating system 828 can be stored on disk storage 824 and serves to control and allocate the resources of computer system 812. The system application 830 utilizes management of resources by the operating system 828 via program modules 832 and program data 834 stored either in the system memory 816 or the disk storage 824. It should be understood that the present invention can be implemented with various operating systems or combinations of operating systems.

ユーザは、入力装置８３６を介してコマンドまたは情報をコンピュータ８１２に入力する。入力装置８３６には、それだけには限定されないが、マウスなどのポインティング装置、トラックボール、スタイラス、タッチパッド、キーボード、マイクロフォン、ジョイスティック、ゲームパッド、衛星パラボラアンテナ、スキャナ、ＴＶチューナカード、デジタルカメラ、デジタルビデオカメラ、Ｗｅｂカメラなどがある。これらおよび他の入力装置は、インターフェースポート８３８を経由してシステムバス８１８によって処理ユニット８１４に接続される。インターフェースポート８３８には、例えば、シリアルポート、パラレルポート、ゲームポート、ユニバーサルシリアルバス（ＵＳＢ）などがある。出力装置８４０は、入力装置８３６と同じタイプの何らかのポートを使用する。したがって、例えばＵＳＢポートを使用して、コンピュータ８１２への入力を提供し、コンピュータ８１２から出力装置８４０に情報を出力することができる。出力アダプタ８４２は、出力装置８４０の中でも一部の出力装置８４０にはモニタ、スピーカー、プリンタなど特殊なアダプタを必要とするものがあることを示すために提供されている。出力アダプタ８４２には、それだけには限定されないが一例として、出力装置８４０とシステムバス８１８の間の接続手段を提供するビデオカードおよびサウンドカードなどがある。リモートコンピュータ８４４など、他の装置および／または装置のシステムは、入力および出力の機能を提供することに注意されたい。 A user enters commands or information into computer 812 via input device 836. The input device 836 includes, but is not limited to, a pointing device such as a mouse, a trackball, a stylus, a touch pad, a keyboard, a microphone, a joystick, a game pad, a satellite dish, a scanner, a TV tuner card, a digital camera, and a digital video. There are cameras, web cameras, and the like. These and other input devices are connected to processing unit 814 by system bus 818 via interface port 838. Examples of the interface port 838 include a serial port, a parallel port, a game port, and a universal serial bus (USB). The output device 840 uses some port of the same type as the input device 836. Thus, for example, a USB port can be used to provide input to computer 812 and output information from computer 812 to output device 840. The output adapter 842 is provided to indicate that some of the output devices 840 require special adapters such as monitors, speakers, and printers. Examples of output adapter 842 include, but are not limited to, video cards and sound cards that provide a connection between output device 840 and system bus 818. Note that other devices and / or systems of devices, such as remote computer 844, provide input and output functionality.

コンピュータ８１２は、リモートコンピュータ８４４など１つまたは複数のリモートコンピュータへの論理接続を使用してネットワーク式環境で動作することができる。リモートコンピュータ８４４は、パーソナルコンピュータ、サーバ、ルータ、ネットワークＰＣ、ワークステーション、マイクロプロセッサベースの装置、ピア装置、または他の一般のネットワークノードなどでよく、一般にコンピュータ８１２に関連して記載した多くまたはすべての要素を含む。簡潔にするために、リモートコンピュータ８４４とともにメモリ記憶装置８４６のみを示している。リモートコンピュータ８４４は、ネットワークインターフェース８４８を介してコンピュータ８１２に論理的に接続され、次いで通信接続８５０を介して物理的に接続される。ネットワークインターフェース８４８は、ローカルエリアネットワーク（ＬＡＮ）および広域エリアネットワーク（ＷＡＮ）などの通信ネットワークを含む。ＬＡＮ技術は、光ファイバ分散データインターフェース（ＦＤＤＩ）、銅線分散データインターフェース（ＣｏｐｐｅｒＤｉｓｔｒｉｂｕｔｅｄＤａｔａＩｎｔｅｒｆａｃｅ：ＣＤＤＩ）、Ｅｔｈｅｒｎｅｔ（登録商標）／ＩＥＥＥ８０２．３、トークンリング／ＩＥＥＥ８０２．５などがある。ＷＡＮ技術には、それだけには限定されないが、ポイントツーポイントリンク、サービス総合デジタル網（ＩＳＤＮ）およびそのバリエーションなどの回線交換ネットワーク、パケット交換ネットワーク、デジタル加入者回線（ＤＳＬ）などがある。 Computer 812 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer 844. The remote computer 844 may be a personal computer, server, router, network PC, workstation, microprocessor-based device, peer device, or other common network node, and many or all of those generally described in connection with the computer 812. Contains the elements. For simplicity, only the memory storage device 846 is shown with the remote computer 844. Remote computer 844 is logically connected to computer 812 via network interface 848 and then physically connected via communication connection 850. Network interface 848 includes communication networks such as a local area network (LAN) and a wide area network (WAN). Examples of LAN technologies include an optical fiber distributed data interface (FDDI), a copper distributed data interface (CDDI), Ethernet (registered trademark) /IEEE802.3, and a token ring / IEEE802.5. WAN technologies include, but are not limited to, circuit-switched networks such as point-to-point links, integrated services digital networks (ISDN) and variations thereof, packet-switched networks, digital subscriber lines (DSL), and the like.

通信接続８５０は、ネットワークインターフェース８４８をバス８１８に接続するために使用されるハードウェア／ソフトウェアを指す。通信接続８５０は、説明上わかりやすくするために、コンピュータ８１２内に示しているが、コンピュータ８１２の外部にあってもよい。ネットワークインターフェース８４８への接続に必要なハードウェア／ソフトウェアには、一例にすぎないが、通常の電話用モデム、ケーブルモデム、ＤＳＬモデムなどのモデム、ＩＳＤＮアダプタ、Ｅｔｈｅｒｎｅｔ（登録商標）カードなど内部技術および外部技術がある。 Communication connection 850 refers to the hardware / software used to connect network interface 848 to bus 818. Communication connection 850 is shown within computer 812 for ease of explanation, but may be external to computer 812. The hardware / software required to connect to the network interface 848 is only an example, but internal technology such as an ordinary telephone modem, a cable modem, a modem such as a DSL modem, an ISDN adapter, an Ethernet card, and the like. There are external technologies.

上記で説明してきたことは、本発明の例を含む。当然、本発明を説明するために構成要素または方法の予想されるすべての組合せについて説明することは不可能であるが、本発明の他の多くの組合せおよび置換えが可能であることを当分野の技術者であれば理解できよう。したがって、本発明は、添付の特許請求の範囲の要旨および範囲内のこうしたすべての代替形態、変更形態、および変形形態を含むものとする。さらに、「含む」という用語が詳細な説明または特許請求の範囲で使用されている限り、こうした用語は、請求項で移行語として使用されるときに「備える」が解釈される「備える」という用語と同じように包含的であるものとする。 What has been described above includes examples of the subject invention. Of course, it is not possible to describe all possible combinations of components or methods to explain the invention, but it is understood that many other combinations and substitutions of the invention are possible. An engineer will understand. Accordingly, the present invention is intended to embrace all such alternatives, modifications and variances that fall within the spirit and scope of the appended claims. Further, to the extent that the term “comprising” is used in the detailed description or in the claims, such terms are the terms “comprising” that “comprising” is interpreted when used as transition terms in the claims. As inclusive.

本発明の一態様によるスパムフィルタリングプロセスの不明瞭化を助けるシステムを示す概略ブロック図である。1 is a schematic block diagram illustrating a system that helps obfuscate the spam filtering process in accordance with an aspect of the present invention. FIG. 本発明の一態様によるスパムフィルタリングプロセスの不明瞭化を助けるために複数のフィルタを使用するシステムを示す概略ブロック図である。1 is a schematic block diagram illustrating a system that uses multiple filters to help obfuscate the spam filtering process according to an aspect of the present invention. FIG. 本発明の一態様によるスパムフィルタリングプロセスの不明瞭化を助ける方法例を示すフロー図である。FIG. 6 is a flow diagram illustrating an example method that helps obfuscate the spam filtering process according to an aspect of the present invention. 本発明の一態様による乱数または擬似乱数の決定を助ける方法例を示すフロー図である。FIG. 6 is a flow diagram illustrating an example method that assists in determining a random or pseudo-random number according to an aspect of the present invention. 本発明の一態様による一部メッセージの内容に基づく無作為化の実行を助ける方法例を示すフロー図である。FIG. 6 is a flow diagram illustrating an example method that assists in performing randomization based on the content of a partial message according to an aspect of the present invention. 本発明の一態様による複数のスパムフィルタをトレーニングし、使用することによってスパムフィルタリングプロセスの不明瞭化を助ける方法例を示すフロー図である。FIG. 5 is a flow diagram illustrating an example method that helps obfuscate the spam filtering process by training and using multiple spam filters according to an aspect of the present invention. 本発明の一態様による複数のスパムフィルタをトレーニングし、使用することによってスパムフィルタリングプロセスの不明瞭化を助ける方法例を示すフロー図である。FIG. 5 is a flow diagram illustrating an example method that helps obfuscate the spam filtering process by training and using multiple spam filters in accordance with an aspect of the present invention. 本発明による通信環境例を示す概略ブロック図である。1 is a schematic block diagram illustrating an example communication environment according to the present invention.

Explanation of symbols

１００スパムフィルタリングシステム
１１０スパムフィルタ
１２０メッセージ
１３０メッセージの合計得点
１４０フィルタスコアリング構成要素
１５０無作為化構成要素
１６０乱数ジェネレータ
１７２入力構成要素_１
１７４入力構成要素_２
１７６入力構成要素_Ｎ
２００マルチフィルタスパムフィルタリングシステム
２１０複数のユーザ
２１２ユーザ_１
２１４ユーザ_２
２１６ユーザ_Ｙ
２２２スパムフィルタ_１
２２４スパムフィルタ_２
２２６スパムフィルタ_Ｗ
２３０フィルタ選択構成要素
２４０時間入力構成要素
２５０クラスタ化構成要素
100 Spam Filtering System 110 Spam Filter 120 Message 130 Total Message Score 140 Filter Scoring Component 150 Randomization Component 160 Random Number Generator 172 Input Component ₁
174 Input component ₂
176 Input component _N
200 Multi-filter spam filtering system 210 Multiple users 212 User ₁
214 User ₂
216 User _Y
222 Spam filter ₁
224 Spam filter ₂
226 Spam filter _W
230 Filter selection component 240 Time input component 250 Clustering component

Claims

One or more spam filters;
A randomization component that obfuscates the function of the spam filter and reduces reverse engineering of the one or more spam filters.

The randomization component randomizes the score of the filter so that a message that is close to a threshold and changed from one of the blocked or delivered messages is either a modification to the message or a randomization component. The system according to claim 1, which makes it difficult for a spammer to determine whether it has changed.

The system of claim 1, wherein the randomization component includes a random number generator that generates at least one of a random number and a pseudo-random number.

The random component includes one or more input components and provides input to the random number generator by the one or more input components to generate which random number for a particular message The system of claim 3, wherein the system assists in determining

The system of claim 1, wherein the random component generates a random number based at least in part on input received from one or more input components.

The system of claim 5, wherein the input from the one or more input components is based at least in part on time.

The system of claim 6, wherein the generated random number depends on at least one of a time and a time increment, and the generated number changes according to either the time or the current time increment.

The system of claim 5, wherein the input from the one or more input components is based at least in part on at least one of a user, a recipient, and a domain received from the message. .

The generated random number depends on at least one of a user, a recipient, and a domain that receives the message, and the generated number is an identification of the user, an identification of the recipient of the message 9. The system of claim 8, wherein the system varies according to any one of the domains receiving the message.

The system of claim 9, wherein the identification of either the user or the recipient includes at least one of a display name and / or an email address.

6. The system of claim 5, wherein the input from the one or more input components is based at least in part on the content of the message.

The system according to claim 11, wherein the generated random number varies according to at least a part of the content of the message.

By calculating a hash of the message content and using the hash value as the random number, even a slight change to the message content will result in a substantially large change to the generated random number. The system of claim 11.

12. The method of claim 11, wherein a hash of at least a portion of a feature extracted from a message is calculated to randomize a message score and help to randomize the function of the spam filter. system.

The system of claim 14, wherein the features used to calculate the hash have respective individual weights that are greater than some threshold.

12. The system of claim 11, wherein the function of the spam filter is obscured by helping to randomize message scores by calculating a hash of the sender's IP address.

A message that is at least a spam near the boundary is classified as spam at least once by substantially affecting the message that borders the boundary between spam and non-spam and randomizing the score of the message The system according to claim 1.

The system of claim 1, wherein the randomization component prevents a spammer from finding at least one message that substantially passes through a spam filter whenever it is sent.

The spam filtering system includes:

In order to effectively modify spammer behavior and reduce reverse engineering of the filtering system, at least one of the total score value and the final score value is randomized using a sigmoid function having the formula: The system of claim 1, characterized in that:

A multi-filter spam filtering system that suppresses reverse engineering of spam filters and virtually eliminates finding one message that always passes through spam filters,
A plurality of spam filters including at least a first spam filter and a second spam filter for processing and classifying messages;
A plurality of users including at least a first user and a second user;
A multi-filter spam filtering system, comprising: a filter selection component that selects one or more filters to be arranged for use by at least one of the plurality of users.

Time to communicate with the filter selection component such that one or more of the plurality of filters are selected and arranged for each user based at least in part on either time and time increments The system of claim 20, further comprising an input component.

The system of claim 21, wherein the time increment is any number of seconds, minutes, hours, days, weeks, months, and years.

The system of claim 20, wherein the filter selection component randomly selects the one or more filters.

The system of claim 20, wherein the filter selection component artificially selects the one or more filters.

The filter selection component is based at least in part on at least one of the respective user, the sender's domain, the domain running the filtering system, and the domain receiving the message. 21. The system of claim 20, wherein the one or more filters are selected to be placed for each user.

The system of claim 20, wherein the user is a recipient of the message.

The system of claim 20, wherein at least some of the plurality of spam filters are trained using one or more sets of training data via a machine learning system.

28. The system of claim 27, wherein the training data corresponds to features extracted from a message.

30. The system of claim 28, wherein at least some of the features extracted from the message are forced to have a specific value.

30. The system of claim 28, wherein at least some of the features extracted from the message are excluded from the training data.

29. The system of claim 28, wherein at least some of the features extracted from the message are clustered by feature type, and each cluster of data is used for training individual filters.

The at least some of the plurality of users are clustered by a user type associated with the cluster of feature types, and a spam filter corresponding to the user type is used for the user. The described system.

The first filter is trained using at least a first subset of training data, the second filter is trained using at least a second subset of training data, and the second subset of training data 21. The system of claim 20, wherein at least a portion of does not overlap with at least a portion of the first subset of training data.

Arrange to use the first filter and the second filter together so that different criteria and / or message characteristics are examined before classifying the message as spam or non-spam 34. The system of claim 33.

Passing the message through a spam filter;
Calculating at least one score associated with the message;
Randomizing the score of the message before classifying the message as spam or non-spam;
Categorizing the message as spam or non-spam. A method for helping to obfuscate a spam filter.

36. The method of claim 35, wherein the at least one score associated with the message includes a final score and a total score.

The method of claim 36, wherein the total score is the sum of all scores associated with individual features extracted from the message.

37. The method of claim 36, wherein the final score is a sigmoid function of the total score and corresponds to a value from 0 to 1 indicating a probability that the message is spam or non-spam.

36. The method of claim 35, wherein randomizing the score of the message comprises adding at least one of a random number and a pseudo-random number to the score of the message.

The number added to the score of the message is at least partly a time,
40. The method of claim 39, wherein the method is dependent on at least one of time increments.

The number added to the score of the message is at least in part with the user;
A recipient of the message;
A domain receiving the message;
The sender's domain;
40. The method of claim 39, wherein the method is dependent on at least one of a machine name that operates the filter.

The number added to the score of the message is at least partly a hash of the message content;
40. The method of claim 39, wherein the method depends on at least one of a hash of at least some of the features extracted from the message.

43. The method of claim 42, wherein the features used to calculate the hash each have a weight greater than zero.

43. The method of claim 42, wherein the feature used to calculate the hash can vary randomly or randomly depending on at least one of a time and a time increment.

40. The method of claim 39, wherein the number added to the score of the message depends at least in part on a hash of the sender's IP address.

40. The method of claim 39, wherein the number added to the score of the message depends on input from one or more input components.

Placing spam filters across multiple users to minimize reverse engineering of the spam filter and to prevent spammers from finding specific messages that always pass through the filter. How to keep it down.

The method of claim 47, wherein placing at least some of the plurality of spam filters depends on at least one of a time and a time increment.

48. The method of claim 47, wherein placing at least a portion of the plurality of spam filters depends on at least one or more users using the spam filter.

The method of claim 47, wherein placing at least some of the plurality of spam filters depends on at least one of a hash of message content and a size of the message.

48. The method of claim 47, further comprising selecting and randomly placing at least a portion of the plurality of spam filters.

48. The method of claim 47, further comprising selecting and artificially placing at least a portion of the plurality of spam filters.

The method of claim 47, wherein the plurality of spam filters are trained on a set of training data via a machine learning process.

Training the spam filter comprises:
Creating a training data set;
Training at least a first spam filter using at least a first subset of training data;
Training at least a second spam filter using at least a second subset of training data, thereby preventing the second subset from being equivalent to the first subset of training data. 54. The method of claim 53.

Training the spam filter comprises:
Clustering training data by type to correspond to user type clusters;
Training at least a first filter with a first cluster of data;
54. The method of claim 53, comprising training at least a second filter with a second cluster of data.

56. The method of claim 55, wherein the first filter is arranged for users belonging to a related type of cluster.

36. A computer readable medium comprising the method of claim 35.

48. A computer readable medium comprising the method of claim 47.

A computer-readable medium storing a computer-executable component of a randomized component that obfuscates the functionality of a spam filter so as to suppress reverse engineering of the one or more spam filters.

60. The computer readable medium of claim 59, wherein the randomization component randomizes the score of the filter.

60. The computer-readable medium of claim 59, wherein the randomization component includes a random number generator that generates at least one of a random number and a pseudo-random number.

A means for passing messages through a spam filter;
Means for calculating at least one score associated with the message;
Means for randomizing the score of the message before classifying the message as spam or non-spam;
And a means for classifying the message as spam or non-spam.

Spam means characterized by including means for deploying multiple spam filters across multiple users so as to reduce reverse engineering of the spam filter and to prevent spammers from always finding specific messages that pass through the filter A system to minimize.