TWI287720B - Junk mail filtering systems and methods based on abnormal features in e-mails - Google Patents
Junk mail filtering systems and methods based on abnormal features in e-mails Download PDFInfo
- Publication number
- TWI287720B TWI287720B TW94125105A TW94125105A TWI287720B TW I287720 B TWI287720 B TW I287720B TW 94125105 A TW94125105 A TW 94125105A TW 94125105 A TW94125105 A TW 94125105A TW I287720 B TWI287720 B TW I287720B
- Authority
- TW
- Taiwan
- Prior art keywords
- string
- abnormal
- abnormal feature
- Prior art date
Links
Abstract
Description
1287720 i、發明說明(1) 【發明所屬之技術領域】 ϋ外if:疋一種垃圾郵件過濾技術,特別是-種以異常 ,徵f基礎t垃圾郵件H系統及方法。 【先前技術】 *瘅=著:Γ基礎建設發達,衍生出許多便利的網路服務 相對地,也製造出許多問題。其中,賣方所產 生用來行銷之大量垃圾郵件(unsolicited Bulk Email或1287720 i, invention description (1) [Technical field to which the invention belongs] if outside:: A spam filtering technology, in particular, a kind of spam H system and method based on abnormality. [Prior Art] *瘅=着:ΓThe infrastructure is developed, and many convenient network services are derived. Relatively, many problems have also been created. Among them, the seller generates a large amount of spam (unsolicited Bulk Email or
ElDail),讓使用者感到相當困 Γ Πίίίί較,一般媒體進行行銷來得便宜許 者進行行:,it常電子郵件傳遞架構對消費 二可鋼通㊆發送出去的廣告郵件可多達百萬件。 有鑒於垃圾郵件所引發的困擾,目前已經有許多處理 垃=件=解決方案,大致上可分為飼^ 以及使用者端郵件過滤器兩種。傳統。 接用TP W : ^ 慮’=為非法詞彙過濾。 。知用IP過滤技術之過滤器’會先建立大量的1{1里名 =,依,經驗將會傳送垃圾郵件的1?位址納入里名單、中, 存在-些缺點,首先,廣告商通常會使用】=;慮= (pretended)的IP進行郵件發送, ^目 入黑名單中,此外,使用⑴立址來阻一/又:通/不易被納 住那些由IP黑名單所傳送來之合法電^ P件’會阻擋 電子郵件被阻擋在外。 ’讓該收到的ElDail) makes the user feel quite sleepy. Π Π ί ί ί 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般 一般In view of the troubles caused by spam, there are already many treatments, which can be divided into feeds and user-side mail filters. Tradition. Use TP W : ^ Consider '= for illegal vocabulary filtering. . Knowing the filter of IP filtering technology will first establish a large number of 1 {1 name=, according to the experience, the 1st address of the spam will be sent into the list, in the middle, there are some shortcomings. First, the advertiser usually Will use [====================================================================================================== Legitimate electricity will prevent email from being blocked. ‘Let the received
為避免上述缺點’另一種採用非、、i μ A 株用非法詞彙過濾技術之過 第6頁 0213-A40531TW(N2);ACI93003TW;SNOWBALL.ptdIn order to avoid the above disadvantages, another use of non-, i μ A strains with illegal lexical filtering technology. Page 6 0213-A40531TW(N2); ACI93003TW; SNOWBALL.ptd
1287720 i、發明說明(2) _____ 濾器,並不針對IP位址,而是依據郵件的 件。首先,會先建立大量的訓練電子郵件,龙i過濾郵 電子郵件會被歸類為正常或垃圾郵件,接每一個 進郵件,使用一個判斷方法或規則,例如貝母一封新 (bayesian classifieatiQn),依據此新進 ^ 類 ^ 正常與垃圾電子郵件之相似程度來決定此電子::勃:練之 一垃圾郵件。但目前有許多的垃圾電子郵件备,=疋否為 中之内文以外的地方加入許多奇怪 ^在電子郵件 之判斷方法或規則,避免被歸類成垃二用以影響其中 因此,需要一種以異常特徵為基礎之垃圾 統及方法,用以提高郵件過濾的正確性。 件過濾系 【發明内容】 有鑑於此,本發明之目的為提供—種 礎之垃圾郵件過濾系統及方法, ;*特徵為基 性。 用u k南郵件過濾的正確 依據上述目的’本發明實施例揭露一種 基礎之垃圾郵件過濾方法,包括:接收 杜申特徵為 件異常特徵擷取規則來擷取電子郵 ,使用郵 徵字I依據第一異常特徵字串 一異常特 徵字串之多個群組;以及依據第s ^ 聯第—異常特 組中之母一者所關聯之多個第二異 =與多個群 度,來決定電子郵件是否為垃圾郵件串之間之相似 於一些情況下,於決定電子郵 中’可包括:取得每-群組所關聯之;驟 0213-Α40531TW(Ν2);ACI93003TW;SNOWBALL.p t d 第7頁 P87720 乒、發明說明(3) 依據第一異 字串間之相 是否超過第 常特徵字 似性,計 一初始設 第一初始設定值時 當相似度值超過第一 第一初始 量代表特 中之累計數 累計數量加 以及,當累 處理程序 串與每 算相似 定值; 決定接 初始設 設定值 定群組 不超過第二 使用者可接 於一些 碰撞處理方 計數量超 使得使用 初始設定 收電子郵 情況下, 法來尋找 決定累計數 過第二 一群組所 度值;決 以及當其 收之電子 定值時, 之群組所 中所關聯 量是否超 初始設定 接收此電 執行正常 關聯之第二異 定其中之一相 中之一相似度 郵件為垃圾郵 更可包括··取 關聯之累計數 之電子郵件數 過第二初始設 值時,執行垃 子郵件;當累 郵件處理程序 常特徵 似度值 值超過 件。於 得其相 量,其 目,·將 定值, 圾郵件 計數量 ’使得 配使用雜 第一異常 凑表、雜湊函數以及 特徵字串之多個群 ^發明實施例肖露一種電腦可讀取儲存媒體 存電腦程式,該電腦程式用以恭入 用 儲 嗲雷腦系铋勃> > μ私+ 至電腦系統中並且使得 異常特徵為基礎之垃圾郵件 t發明實施例揭露一種以異常特徵為 過濾系統,包括通訊裝置與處理i 夂垃圾郵件 訊裝置,透過通訊裝置接收元福接於通 擷取規則來擷取電子郵件中之多個 卩件異常特徵 理單元依據第-異常特徵字串,尋徵字串。處 号找關聯於第一異常特徵 0213 - A40531TW( N2); ACI93003TW; SNOWBALL. p t d 第8頁 1287720 i、發明說明(4) 字串之多個群組。虑搜留^ #破松 群組中之每一者早依據第-異常特徵字串與多個 似度,來決定電子郵件是否為特徵字串之間之相 統更包括;;4;:= =基礎之垃圾郵件過據系 5關聯於多封以前所接收之 :中:巧群 電子郵件間所擁有之第二異當转n ^並且,關聯之每一 所擁^之ί常特徵字串來得相:I子串會較其他電子郵件 第二異常特i字:。J::::得每一群組所關聯之 一群組所關聯之第—I^ 二第一異常特徵字串與每 *值’決定其中::::=:=;目似性;!Ϊ相似 接收之電子郵件為垃圾郵件 J初始5又疋值時’決定 =第-初始設定值時,更取得似度值超 定值之群組所關聯之累計數量,其中;始設 :數量是否超過第值將=數巧;數er 第二初始設定值時,執及田累计數$超過 無法接收上述電子郵件。處理程序,使得使用者 一初始設定值時,並且,更者 當相似度值超過第 定值時,執行正常郵件處理;序:初始設 郵件。 便仵使用者可接收電子 其中之郵件可包含寄件飼服器名稱、寄件人電子郵件 I國 第9頁 0213-A4053mV(N2);ACI93003nV;SN〇WBALL.ptd 1287720 乒、發明說明(5) 2箱、收件人電子郵件信箱、副本收件人電子郵件信箱, 雄、件副本收件人電子郵件信箱以及郵件本文。 郵件異常特徵擷取規則可為下述規則之至少一者·· (1 )摘取相應於寄件人電子郵件信箱之字串,當作第 一異常特徵字串,· (2 )擁取相應於收件人/副本收件人/密件副本收件人 電子郵件信箱之字串,當作第一異常特徵字串; (3 )擷取相應於寄件伺服器名稱之字串,當作第一異 常特徵字串; Λ (4)擷取郵件本文中之相應於超連結之字串,告 一異常特徵字串; 田 色之(字5)串榻取Λ件/文中之具有與背景顏色相同之前景顏 色之子串,當作第一異常特徵字串; (6) 搁取郵件本文中之益〇 a上 .^ ^ 也 又r (無法在’庫裡面找到之字串, 當作第一異常特徵字串; 子甲 (7) 擷取郵件本文中之非屬中文 第一異常特徵字串; 飞央文之子串,當作 (8) 擁取郵件本文中被包含於 之相應於文字屬性值之字串,當作、不β形之標記中 (9 )擷取郵件本文中之具有特一…異常特徵字串; 作第一異常特徵字串;以及 文字效果之字串,當 (10)榻取郵件本文中之相應於 當作第一異常特徵字串。 、圾郵件語意之字串, 每一群組可關聯於多封以前 接收之電子郵件,其關1287720 i, invention description (2) _____ filter, not for the IP address, but according to the mail. First, a large number of training emails will be created first. The emails will be classified as normal or spam. Each incoming email will be judged using a judgment method or rule, such as a new one (bayesian classifieatiQn). According to this new class ^ normal and the degree of similarity of junk e-mail to determine this electronic:: Bo: practice one of the spam. However, there are a lot of junk e-mails available at the moment. If you don’t add a lot of strange methods to the e-mail, you can avoid being classified as a rally to affect them. Therefore, you need to The anomaly feature is based on the garbage system and method to improve the correctness of mail filtering. SUMMARY OF THE INVENTION In view of the above, it is an object of the present invention to provide a basic spam filtering system and method, characterized in that the characteristics are based. According to the above-mentioned purpose, the embodiment of the present invention discloses a basic spam filtering method, which includes: receiving a Duchen feature as an abnormal feature extraction rule to retrieve an electronic mail, using the postal code I according to the first An abnormal feature string-a plurality of groups of abnormal feature strings; and determining a plurality of groups according to a plurality of second different values and a plurality of group degrees associated with the parent in the s^-th in the abnormal group Whether the mail is similar to the spam string in some cases, in the decision e-mail 'may include: get the per-group associated; step 0213-Α40531TW (Ν2); ACI93003TW; SNOWBALL.ptd page 7 P87720 Pingping, invention description (3) According to whether the phase between the first heterostrings exceeds the first characteristic feature word, when the first initial setting value is initially set, when the similarity value exceeds the first first initial amount, the cumulative number is represented. The cumulative quantity plus and when the tired processing program string is similar to each calculation; the decision is made that the initial setting value is not more than the second user can be connected to some collision processing party. In the case of using the initial setting to receive e-mail, the method is to find the value that determines the cumulative number over the second group; and when the electronic value is received, whether the associated quantity in the group exceeds the initial setting reception Performing a spam mail when the number of similarity messages of the second parameter of the normal association is one of the two processes of the normal association is the spam mail, and the number of the emails of the accumulated number of the associated numbers is over the second initial setting value; When the tired mail handler often has a feature value value that exceeds the piece. In order to obtain its phasor, its purpose, the value will be fixed, and the amount of spam count will be used to match the first exception table, the hash function, and the plurality of groups of feature strings. The storage medium stores a computer program for obscuring the use of the stored thunder brain system, and the anti-aliasing-based spam t invention embodiment discloses an abnormal feature. For the filtering system, including the communication device and the processing device, the receiving device receives the meta-following rules through the communication device to retrieve the plurality of abnormal features in the email according to the first-abnormal feature string. , looking for a string. The number is associated with the first anomaly feature 0213 - A40531TW( N2); ACI93003TW; SNOWBALL. p t d Page 8 1287720 i, invention description (4) Multiple groups of strings. Each of the breakout groups can determine whether the email is a feature between the feature strings based on the first-abnormal feature string and the multiple similarities; 4;:= =Basic spam has been associated with multiple previously received ones: Medium: The second exception that is owned by Qiao Group e-mails is n ^ and that each of the associated features is associated with Come to phase: I substring will be the second exception for other emails: J:::: The first abnormal feature string associated with a group associated with each group and each *value' is determined by ::::=:=; simiency; Ϊ If the similarly received e-mail is the default value of the initial value of the spam J when the initial value is 5, the cumulative number associated with the group with the similarity value is obtained. If the number exceeds the first value, it will be the same as the number of times. The program is executed such that when the user initially sets the value, and moreover, when the similarity value exceeds the predetermined value, normal mail processing is performed; The user can receive the e-mail, and the mail can include the name of the mailing device, the sender's e-mail, page 9, 0213-A4053mV (N2); ACI93003nV; SN〇WBALL.ptd 1287720 ping, invention description (5 ) 2 boxes, recipient email address, copy recipient email address, male, copy recipient email address, and mail article. The mail abnormal feature extraction rule may be at least one of the following rules: (1) extracting a string corresponding to the sender's email address as the first abnormal feature string, (2) correspondingly The string of the recipient/copy recipient/bcc copy recipient email address is treated as the first abnormal feature string; (3) the string corresponding to the sender server name is taken as the first An abnormal feature string; Λ (4) Capture the message in the text corresponding to the hyperlink, and report an abnormal feature string; the color of the field (word 5) string to take the piece / text with the background color Substring of the same foreground color, as the first abnormal feature string; (6) Shelving the mail in the article 〇 a. ^ ^ also r (can not find the string in the 'library, as the first Abnormal feature string; sub-a (7) retrieved mail in this article is not the Chinese first abnormal feature string; flying sub-string of the text, as (8) the fetched mail contained in this text corresponds to the text attribute The string of values, in the mark of the not-shaped beta (9), the mail in this article has a special one... abnormal feature string ; as the first abnormal feature string; and the string of the text effect, when (10) the mail in the text corresponds to the first abnormal feature string, the spam semantic string, each group can Associated with multiple previously received emails,
0213·A40531TW(N2);ACI93003TW;SNOWBALL.ptd 第10頁 之第二異常特徵字串會較其他 字串來得相似。 1287720 ·0213·A40531TW(N2); ACI93003TW; SNOWBALL.ptd Page 10 The second exception feature string will be similar to other strings. 1287720 ·
聯之每一電子郵件間所擁有 電子郵件所擁有之異常特徵 【實施方式】 第1圖係表示依據本發明實施例之以異常特徵為基礎 之垃圾郵件過濾系統10之硬體架構圖,包括處理單元Η、 ^存裝置13、輸出裝置“、輸入裝置ΐ5、通訊 裝置16,並使用匯流排17將其連結在一起。除此之外,孰 習此技藝人士也可將此系統實施於其他電腦系統樣態 (configuration)上,例如,手持式設備(hand-heid devices)、多處理器系統、以微處理器為基礎或可程式化 之消費性電子產品(micr〇pr〇cessc)r —based 〇r programmable consumer electronics)、網路電腦、迷你 電腦、大型主機以及類似之設備。處理單元n可包含一單 中央處理單元(centrai一processing unit; cpu)或者是 關連於平行運算環境(parallel pr〇cessing environment)之多個平行處理單元。記憶體12包含唯讀記 憶體(read only memory ; R〇M)、快閃記憶體lash R〇M) 以及/或動態存取記憶體(rand〇m access mem〇ry; RAM), 4用以儲存可供處理單元丨丨執行之程式模組以及資料。一般 而言’程式模組包含常序(routines)、程式(program)、 物件(object )、元件(component )等,用以執行以郵件特 徵為基礎之垃圾郵件過濾功能。本發明亦可以實施於分散 式運算環境’其運算工作被一連結於通訊網路之遠端處理 設備所執行。在分散式環境中,郵件異常特徵為基礎之垃Abnormal features possessed by e-mails owned by each e-mail room [Embodiment] FIG. 1 is a diagram showing a hardware structure of a spam filtering system 10 based on anomalous features according to an embodiment of the present invention, including processing The unit ^, the storage device 13, the output device ”, the input device ΐ5, the communication device 16, and the bus bar 17 are used to connect them together. In addition, those skilled in the art can also implement the system on other computers. System configuration, for example, hand-heid devices, multiprocessor systems, microprocessor-based or programmable consumer electronics (micr〇pr〇cessc) r-based Programmabler programmable consumer electronics), network computers, minicomputers, mainframes, and the like. The processing unit n can include a single central processing unit (cpu) or a parallel computing environment (parallel pr〇) Multiple parallel processing units of cessing environment. Memory 12 contains read-only memory (R〇M), flash memory lash R〇M) and/or rm〇m access mem〇ry (RAM), 4 is used to store program modules and data that can be executed by the processing unit. Generally, the program module contains Routines, programs, objects, components, etc., are used to perform spam filtering functions based on email features. The present invention can also be implemented in a distributed computing environment. Executed by a remote processing device connected to the communication network. In a decentralized environment, the message anomaly is based on
1287720 ‘ 異、發明說明(7) 圾郵件過;慮系統1 〇之功能 — 電腦系統共同完成。儲存裝=13勺=由本地以及多部遠端 置、光碟裝置或隨身碟裝置,用碟裝置、軟碟裝 3身碟中儲存之程式模組以及/或資硬料碟诵軟碟、光 可為有線網路卡或符合GPRS、8〇2次貝格抖。通訊裝置16 件人電子郵件信箱包含寄件飼服器名稱、寄 子郵件信箱電子郵件信箱、副本收件人: 郵件本文以及夾帶槽案等内$ 件&題、 樣符合超文件桿記纽+炊 了匕含各式各 language,HTML)之劇本指 ^ 内文背景或超連結、加上聲音等等。、’文子内容、提供 之垃Γ郝圖杜係Λ示依據本發明實施例之以異常特徵為基礎 之垃圾郵件過濾方法之方法流程圖。 首先,如步驟S21,可透過通訊裝置16取得一份電子 郵件。如步驟S23,使用郵件異常特徵擷取規則來擷取電 $郵件中之多個第一異常特徵字串,郵件異常特徵擷取之 詳細規則可參考以下段落之說明。如步驟S25,依據第一 異常特徵字串,尋找關聯於第一異常特徵字串之多個群 ,。如步驟S27,依據第一異常特徵字串與多個群組中之 每一者所關聯之第二異常特徵字串之間之相似度,來決定 電子郵件是否為垃圾郵件。 第3圖係表示依據本發明實施例之以異常特徵為基礎 之垃圾郵件過濾方法之方法流程圖。 0213-A40531TW(N2);ACI93003TW;SNOWBALL.ptd 第12頁 .1287720 · _五、發明說明(8) 首先,如步驟S311,可透過 郵件。於步驟S3 13,擷取電子郵杜^裝置16取得一份電子 擷取規則之所有字串(於本f 牛中之符合郵件異常特徵 常特徵字串)。郵件異下皆稱此類字串為異 下所述規則中步_可使用以 ^ ^ 者來擷取異常特徵字串: 2則1-擷取相應於寄件人電子郵件 異常特徵字串; 祁 < 于甲田作 規則2-擷取相應於收件人/副本收件人/密件副本收件 人電子郵件信箱之字串,當作異常特徵字串; 規則3-擷取相應於寄件伺服器之字串,當作異常特徵 字串; 、 ^ 規則4-擷取郵件本文中之相應於超連結之字串,當 異常特徵字串; 規則5-擷取郵件本文中之具有與背景顏色相同之前景 顏色之子串’此類字串亦可稱為隱藏墨水(invisible ink),當作異常特徵字串; 規則6 -擷取郵件本文中之無法在詞庫裡面找到之字 _丨串,此類字串亦可稱為文字沙拉(word salad),當作異常 特徵字串; 規則7 -擷取郵件本文中之非屬中文或英文之字串,當 作異常特徵字串; 規則8-擷取郵件本文中被包含於顯示圖形之HTML標記 中之相應於文字屬性值之字串,例如,一個顯示圖形之1287720 ‘Different, invention description (7) spam; consider the function of system 1 – computer system is completed together. Storage = 13 scoops = local and multi-port remote, optical disc or flash drive, dribble device, floppy disk loaded with three-disc stored program modules and / or hard disk floppy, light Can be wired network card or GPRS, 8 〇 2 times Berg shake. The communication device 16 person's email address includes the name of the mailing device, the mailing address of the mail box, the copy of the recipient: the mailing list and the entrainment case, etc. + 炊 匕 匕 各 各 lang lang lang lang HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML And the text of the method for providing a spam filtering method based on anomalous features according to an embodiment of the present invention. First, in step S21, an e-mail can be obtained through the communication device 16. In step S23, the mail abnormal feature extraction rule is used to retrieve a plurality of first abnormal feature strings in the e-mail, and the detailed rules of the mail abnormal feature extraction can be referred to the following paragraphs. In step S25, a plurality of groups associated with the first abnormal feature string are searched for according to the first abnormal feature string. In step S27, it is determined whether the email is spam based on the similarity between the first abnormal feature string and the second abnormal feature string associated with each of the plurality of groups. Figure 3 is a flow chart showing a method of a spam filtering method based on anomalous features in accordance with an embodiment of the present invention. 0213-A40531TW(N2); ACI93003TW; SNOWBALL.ptd Page 12 .1287720 · _ V. Invention Description (8) First, in step S311, the mail can be transmitted. In step S3 13, the electronic mailing device 16 is obtained to obtain all the strings of the electronic capturing rule (in the case of the mail abnormal feature regular character string). The message is different from the above-mentioned string. The step in the rule can be used to extract the abnormal feature string: 2 - 1 - corresponding to the sender email abnormal feature string;祁< in the field of the rule 2 - draw the string corresponding to the recipient / copy recipient / secret copy recipient email mailbox, as an exception feature string; rule 3 - draw corresponding to the mail The string of the server is treated as an abnormal feature string; , ^ Rule 4 - Pick up the message in the text corresponding to the hyperlink, in the case of the exception feature string; Rule 5 - Capture the message with the background Substrings of the same color of the previous color 'This string can also be called invisible ink, as an abnormal feature string; Rule 6 - Pick up the message in this article can not be found in the thesaurus _ 丨 string Such a string may also be referred to as a word salad as an abnormal feature string; Rule 7 - Extracting a string in the text that is not Chinese or English, as an abnormal feature string; Rule 8 - Capture the message in this article is included in the HTML markup of the display graphic Should be a string of text attribute values, for example, a display graphic
0213-A40531TlV(N2);ACI93003TlV;SNOWBALL.ptd 第13頁 1287720 五、發明說明(9) HTML標記為” <img scr =,’imagel.gifn text = ” advertisement”/〉”,則其中相應於文字屬性值之 字串為"advertisement",當作異常特徵字串; 規則9-擷取郵件本文中之具有特殊文字效果之字串, 例如,放大字型、具閃爍功能等字串,當作異常特徵字 串;以及 規則1 0 -擷取郵件本文中之相應於 串,當作異常特徵字串 步驟S321至S3 25為一個反覆執行之迴圈,用以取得所 f摘取之異常特徵字串所對應之群組。儲存裝置13中儲存 ^個群組,其中之每一個群組中包含多個異常特徵字串,子 這$異常特徵字串係由相似之至少一封電子郵件所取得, 使得屬於同一個群組中之電子郵件會較其他群組中之^ 3擁有較相似之異常特徵字串集合。此外,I一個 常特^ ί計數量值,《表接收過之相似於此群組中之異 下1:串之電子郵件的累計數量。此迴圈詳細說明如、 ,如步驟S321,取得下一個擷取之異常 驟S323,檢舍相^ ^ 叮做于申。如步 加速處理之異常特徵字串之所有群組。為 table)以及雜湊函數(h +·、七 養表(hash 士,^丄 双、nasJl function)來進杆給去 她、士 表儲存於儲存裝置13,水進仃檢索。雜湊 area)與碰撞】·. 雜矣fe(hash LCco11!s!on area ),包含多筌 — 筆a己錄中已經儲存一個異 μ、、母一 字串之群組,其中,.、4子串及關聯於異常特徵 具中儲存於雜湊區之特定異常特徵字串之0213-A40531TlV(N2); ACI93003TlV; SNOWBALL.ptd Page 13 1287720 V. Description of invention (9) The HTML tag is "<img scr =, 'imagel.gifn text = "advertise"/>", which corresponds to The string of the text attribute value is "advertisement", which is treated as an exception character string; Rule 9 - Capture the string with special text effect in the text, for example, amplify the font, have a flashing function, etc. Making an abnormal feature string; and the rule 1 0 - extracting the message corresponding to the string in the text, as the abnormal feature string step S321 to S3 25 is a repeated execution loop for obtaining the abnormal feature of the f extract The group to which the string corresponds. The storage device 13 stores ^ groups, each of which contains a plurality of abnormal feature strings, and the $ abnormal feature string is obtained by at least one similar email, so that the same group belongs to the same group. The email in it will have a similar set of anomalous signature strings than ^3 in other groups. In addition, I has a constant count value, and the table has received a similar number of emails similar to the one in the group. This loop details the example, and, in step S321, the next extraction exception S323 is obtained, and the detection phase is performed. All groups of exception feature strings that are accelerated by step. For the table) and the hash function (h + ·, seven raise table (hash, ^ double, nasJl function) to go to her, the watch is stored in the storage device 13, the water into the search. Hashing area) and collision 】·. 矣 矣 fe(hash LCco11!s!on area ), including multiple 筌 笔 笔 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己 己The abnormal feature has a specific abnormal feature string stored in the hash area.
1287720 五、發明說明(10) ^際儲存位址,係由雜凑函數依據異常特徵 碼(例如ASCII碼、ΒΙ(ί 5碼、GB2313碼等) 言之,步驟S323首先依據取得之異常特而件。评而 若符合則取得儲存位址中:::::二::特: 進㈣處理(c〇llisio…cess),至群碰:區-沒有則 (collision)中之儲存空間找尋相符之異常特徵1287720 V. Inventive Note (10) The inter-storage address is determined by the hash function based on the abnormal feature code (for example, ASCII code, ΒΙ5, GB2313 code, etc.), and step S323 is based on the exception obtained first. If the match is met, the storage address is obtained:::::2:: special: advance (four) processing (c〇llisio...cess), to the group touch: area - no (collision) storage space to find the match Abnormal feature
理,皆可做適度修改並應用中雜 ::是則進行步糊丨之處理,否則進行步驟S32=J 檢幸Ϊί:?? ’將此電子郵件之異常特徵字串與每-個 檢索到之群組中之特徵字串進行相似度比對, ID表!t郵件與檢索到之群組之相似度值。如步驟 一給去1疋此電子郵件之異常特徵字串是否相似於盆中之 處Ξ否組:之異常特徵字串’是則進行步襲1之 於本實於例ΐ仃步驟S343之處理。關於步驟S33es341, 於本實鉍例中,可使用貝氏分類(Bayesian 之異‘^二10:),方法。貝氏Λ*類方法輸入此郵件所擷取 於-個群組之條件機率值,並據以計算出代二 1 第15頁 0213-A40531TW(N2) ;ACI93003TW; SNOWBALL, t 1287720 五、發明說明(U) 索i之群組之相似度值。接著,比較各個群組之相似度 值、,取得一個滿足初始設定條件並具有最大相似度之群組 ^ :、相似群組’若所有群組之相似度值皆無法滿足初始設 定條件’則代表此電子郵件與所有群組皆不相似。熟習此 ,藝^ 士皆了解,於比較相似度上,除使用貝氏分類法以 夕、,亦可使用各式各樣之相似度分析方法輔以判斷相似性 之準則來實作步驟S331與S343。 如步驟S343,執行正常電子郵件處理程序。於此步驟 中,可將此電子郵件留在收信伺服器 =子郵件,或者,可將此電子郵件透過各式各樣= =例如簡訊、即時通訊訊息等),直接傳送給使用者。 =步驟S351 ’將相似於此電子郵件之群組之累計數量加 丨決定相似群組之累計數量是否超過-個 初始叹疋值,是則進行步驟S355之處理, S3 43之處理。如+嗷批―认 企則進订步驟 於此+ = 執行垃圾電子郵件處理程序。 接收此電子郵件。 于便用者無法 之垃ΓΛ係;Λ示Λ據本發明實施例之以異常特徵為基礎 卵,用以儲存-電腦程式㈣,此儲存媒 異常特徵為基礎之垃圾郵件過遽方法。貫見以上所述之以 統’或特定型態或其騎,可以以 :月,方法與系 體媒體,如軟碟、光碟片、硬碟、或:,的型態包含於實 取(如電腦可讀取)儲存媒體,其中,其他機器可讀 田%式碼被機器,如Rational, you can make moderate modifications and apply the miscellaneous:: Yes, then proceed with the step, otherwise proceed to step S32=J check fortunate Ϊ:?? 'Retrieve the unusual character string of this email with each one The feature string in the group is compared with the similarity, ID table! The similarity value between the t-mail and the retrieved group. If the abnormal feature string of the email is given in step 1 is similar to the abnormality group in the basin: the abnormal feature string 'is the step 1 is performed in the processing of the step S343. Regarding step S33es341, in the present example, a Bayesian classification (Bayesian's different '^2:10:) method can be used. The Bayesian Λ* method enters the conditional probability value of the group taken from this message, and calculates the generation 2 1 page 15 0213-A40531TW(N2); ACI93003TW; SNOWBALL, t 1287720 V. Description of the invention (U) The similarity value of the group of I. Then, comparing the similarity values of the respective groups, obtaining a group that satisfies the initial setting condition and having the greatest similarity ^:, the similar group 'if the similarity values of all the groups cannot satisfy the initial setting condition' This email is not similar to all groups. Familiar with this, the art knows that, in terms of comparative similarity, in addition to using the Bayesian classification method, a variety of similarity analysis methods can be used together with the criterion for judging similarity to implement step S331 and S343. In step S343, a normal email processing program is executed. In this step, you can leave this email on the receiving server = sub-mail, or you can send the e-mail directly to the user through a variety of == for example, SMS, IM, etc.). = step S351', the cumulative number of groups similar to this email is added to determine whether the cumulative number of similar groups exceeds - an initial sigh value, and the process proceeds to step S355, S3 43. For example, if you want to subscribe to the order, then the order is as follows: + = Execute the junk e-mail handler. Receive this email. The user can't use the system according to the embodiment of the present invention to store the computer-based program (4), which is based on the abnormal feature of the storage medium. Throughout the above-mentioned or 'specific type or its riding, you can use: month, method and system media, such as floppy disk, CD, hard disk, or:, the type is included in the actual (such as computer Readable) storage medium, where other machine-readable fields are coded by the machine, such as
.1287720 -五、發明說明(12) 電腦載入且執行時,此機器變成 本發明之方法與裝置也可以以程 體,如電線或電纜、光纖、或是 其中’當程式碼被機器,如電腦 機器變成用以參與本發明之裝置 (general-purpose processing 合處理器提供一操作類似於應用 置。 雖然本發明已以較佳實施例 限定本發明,任何熟悉此項技藝 神和範圍内,當可做些許更動與 範圍當視後附之申請專利範圍所 用以參與本發明之裝置。 式瑪型態透過一些傳送 任何傳輸型態進行傳送,、 接收、載入且執行時,此 。當在一般用途處理單一 unit)實作時,程式碼結70 特定邏輯電路之獨特袈 揭露如上,然其並非用以 者’在不脫離本發明之精 潤飾’因此本發明之保護 界定者為準。 0213-A40531TW(N2);ACI93003TW;SNOWBALL.ptd 第17頁 1287720.1287720 - V. INSTRUCTIONS (12) When the computer is loaded and executed, the machine becomes the method and device of the present invention, and can also be used as a body, such as a wire or cable, an optical fiber, or a device in which the code is used, such as The computer machine becomes a device for participating in the present invention. The general-purpose processing of the present invention provides an operation similar to that of the application. Although the invention has been defined by the preferred embodiments, any one skilled in the art and the scope A number of changes and scopes may be made to participate in the device of the present invention as disclosed in the appended claims. The zebra pattern is transmitted, received, loaded and executed by some transmission type, as in general. The uniqueness of the specific logic circuit of the program code 70 is as described above, but it is not intended to be used without departing from the spirit of the invention. 0213-A40531TW(N2); ACI93003TW;SNOWBALL.ptd Page 17 1287720
【圖示簡單說明】 第1圖係表示依據本發明實施例之以 之垃圾郵件過濾系統之硬體架構圖; 、㊉特徵為基礎 第2、3圖係表示依據本發明實施例 礎之垃圾郵件過濾方法之方法流程圖; 共吊特徵為基 第4圖係表示依據本發明實施例之以異 之垃圾郵件過濾之電腦可讀取儲存媒 主吊特欲為基礎 【主要元件符號說明】 不思、圖。 1 0〜以異常特徵為基礎之垃圾郵件過:虔 11〜處理單元; ‘系統 1 2〜記憶體; 13〜儲存裝置 14〜輸出裝置 1 5〜輸入裝置 16〜通訊裝置 1 7〜匯流排; S21、S23、S25、S27〜操作步驟; 〜操作步驟; 郵件過濾電腦程式 S311、S313.....S353、S355BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a hardware structural diagram of a spam filtering system according to an embodiment of the present invention; and ten features are based on the second and third figures showing spam according to an embodiment of the present invention. Method flow chart of filtering method; common hanging feature is based on Fig. 4 is a diagram showing the special desire of computer readable storage medium based on the spam filtering of the embodiment of the present invention. [Main component symbol description] , map. 1 0 ~ spam based on abnormal features: 虔 11 ~ processing unit; 'system 1 2 ~ memory; 13 ~ storage device 14 ~ output device 1 5 ~ input device 16 ~ communication device 1 7 ~ bus; S21, S23, S25, S27~ operation steps; ~ operation steps; mail filtering computer programs S311, S313.....S353, S355
40〜儲存媒體; 420〜以異常特徵為基礎之垃圾40~ storage media; 420~ garbage based on abnormal features
0213-A40531TW(N2);ACI93003TW;SN〇WBALL.ptd 第18頁0213-A40531TW(N2); ACI93003TW; SN〇WBALL.ptd第18页
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW94125105A TWI287720B (en) | 2005-07-25 | 2005-07-25 | Junk mail filtering systems and methods based on abnormal features in e-mails |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW94125105A TWI287720B (en) | 2005-07-25 | 2005-07-25 | Junk mail filtering systems and methods based on abnormal features in e-mails |
Publications (2)
Publication Number | Publication Date |
---|---|
TW200705215A TW200705215A (en) | 2007-02-01 |
TWI287720B true TWI287720B (en) | 2007-10-01 |
Family
ID=39201749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW94125105A TWI287720B (en) | 2005-07-25 | 2005-07-25 | Junk mail filtering systems and methods based on abnormal features in e-mails |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI287720B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI505112B (en) * | 2014-01-06 | 2015-10-21 | Openfind Information Technology Inc | E-mail server-side profile filtering method |
-
2005
- 2005-07-25 TW TW94125105A patent/TWI287720B/en not_active IP Right Cessation
Also Published As
Publication number | Publication date |
---|---|
TW200705215A (en) | 2007-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9819634B2 (en) | Organizing messages in a messaging system using social network information | |
US10387559B1 (en) | Template-based identification of user interest | |
US9071560B2 (en) | Tagging email and providing tag clouds | |
CN104982011B (en) | Use the document classification of multiple dimensioned text fingerprints | |
US7657603B1 (en) | Methods and systems of electronic message derivation | |
US10262080B2 (en) | Enhanced search suggestion for personal information services | |
Toolan et al. | Feature selection for spam and phishing detection | |
US9906539B2 (en) | Suspicious message processing and incident response | |
CN104067567B (en) | System and method for carrying out spam detection using character histogram | |
US20100131523A1 (en) | Mechanism for associating document with email based on relevant context | |
US7895515B1 (en) | Detecting indicators of misleading content in markup language coded documents using the formatting of the document | |
WO2007143223A2 (en) | System and method for entity based information categorization | |
US9667737B2 (en) | Publisher-assisted, broker-based caching in a publish-subscription environment | |
Woitaszek et al. | Identifying junk electronic mail in Microsoft outlook with a support vector machine | |
Sethi et al. | Spam email detection using machine learning and neural networks | |
US8843574B2 (en) | Electronic mail system, user terminal apparatus, information providing apparatus, and computer readable medium | |
US20120215858A1 (en) | Caching potentially repetitive message data in a publish-subscription environment | |
TWI287720B (en) | Junk mail filtering systems and methods based on abnormal features in e-mails | |
Patidar et al. | A novel technique of email classification for spam detection | |
Chen et al. | Email visualization correlation analysis forensics research | |
Islam et al. | Machine learning approaches for modeling spammer behavior | |
Kolcz et al. | The challenges of service-side personalized spam filtering: scalability and beyond | |
Sagar et al. | An Effective Spam Classification Filter As A Web Application Using Naïve Bayes Classifier | |
Smirnov | Clustering and classification methods for spam analysis | |
JP4334210B2 (en) | Message providing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |