TW201224789A - A method for sorting the spam mail - Google Patents

A method for sorting the spam mail Download PDF

Info

Publication number
TW201224789A
TW201224789A TW099141834A TW99141834A TW201224789A TW 201224789 A TW201224789 A TW 201224789A TW 099141834 A TW099141834 A TW 099141834A TW 99141834 A TW99141834 A TW 99141834A TW 201224789 A TW201224789 A TW 201224789A
Authority
TW
Taiwan
Prior art keywords
spam
mail
weight coefficient
distance
probability
Prior art date
Application number
TW099141834A
Other languages
Chinese (zh)
Other versions
TWI457767B (en
Inventor
Shi-Jinn Horng
Jia-Chiun Wang
Original Assignee
Univ Nat Taiwan Science Tech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Taiwan Science Tech filed Critical Univ Nat Taiwan Science Tech
Priority to TW099141834A priority Critical patent/TWI457767B/en
Publication of TW201224789A publication Critical patent/TW201224789A/en
Application granted granted Critical
Publication of TWI457767B publication Critical patent/TWI457767B/en

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a method for sorting a spam mail. A mail can be determined if it is a spam mail in accordance with the amount of keywords presented in a plurality of mails of the database and the distance between the keywords. Unlike the prior art, the present invention first capture a plurality of keywords in accordance with the database, then calculate a correlation coefficient and obtain a plurality of eigen-terms in accordance with the database, then calculate a standard probability by Naive Bayes algorithms in accordance with the correlation coefficient; then obtain an amended probability by multiplying the correlation coefficient and standard probability and, at last, determine if the mail is a spam mail in accordance with the amended probability and a first threshold. By utilizing the present invention, the user can determine if the incoming mails are spam mails and should be filtered quickly and preciously with limited resources.

Description

201224789 六、發明說明: 【發明所屬之技術領域】 根攄與rfi辨Γ郵件之枝«,制係與一種 鍵詞於資料集之複數封郵件中之出現次數以^ 错以判斷標的郵件是否為垃圾郵件之垃圾 【先前技術】 發生2網人人們的生活,料生活習慣因此而 網路最大優勢在於價廉、便利、迅速、 ^專送$圍。因此’使用網路傳輪的電子郵件系統(Ele;;tronic二 tai)’早已取代雖使雜筆書寫與郵差傳遞訊息的方式。’ 牛被人們普遍接受並使用之時,垃圾郵件的氾濫與 Ξ it議題’也受_#廣泛注意與討論。雜大部份垃圾 it ϊ業廣告’但也有許多垃圾郵件是因其他電腦感染病毒 大^轉寄散佈。其中前者僅會造成網路流量增加與使用者不 $而較為無害,然後者卻有可能造成郵件伺服器癱瘓或是卷 機,甚至給與駭客更多入侵主機的機會。 ' 田 綜觀習知的自動電子郵件過遽,系統,其主要係採用字业比 ,法、N-gram斷詞法、特徵值選取方式、詞頻、詞頻-逆向 =i(=IDF) '卡_ (Chi_S—及馬可夫特徵擷取法 :運异法來分析、經過統計以辨別該電子郵件是否垃圾郵件。 規避以上各種過遽之方法,垃圾郵件業者改而將廣告 貝訊隱藏在圖片_、使用縮短網址代替真正的網站網址ϋ 郵件内容’或是在其實際内文的前後段落隨機插入一定數^的 無意義文字等手段以相對因應。 4 201224789 巧上述之說明,要如何在有限之資源中快速且精確 圾郵件並對其進行過濾,即為本發明所欲解決之 【發明内容】 方法 有鑑於此’本發明之-範脅在於提供一種分辨垃圾郵件之 根據本剌之-频#_,本㈣提供 件之方法’其個關斷-標_件衫紐;辨^= 之方法包含有下列步驟··(S1)根據一資料集 =明 :中ΐ複?個關鍵詞;(S2)根據該關鍵詞^該i料ΐ:郵ί 法計算一標準機率;⑼將該相關性 :得到一修正機率;⑽根據該修正機 檻值,來判斷該標的郵件是否為垃圾郵件。辦及第—門 於實際應财,本發财法之步鄉2)可 驟:(S21)統計該等複數個關鍵詞於 U步 該二件==¾ 數個相關係數根據該複數個相關Z以郵=之複 小於一第二門棍值,若該第二矩陣中之士 =目關係數,是否 門檻值,則將該第二門檻值設零,若該第U =小於該第二 相關係數未小_二_,财歧201224789 VI. Description of the invention: [Technical field of invention] The root and the rfi identify the branch of the mail «, the number of occurrences of the system and a key word in the plurality of mails of the data set is determined by the error to determine whether the marked mail is Spam of spam [previous technology] The life of 2 netizens occurs, and the habit of living is therefore the biggest advantage of the network is that it is cheap, convenient, and fast. Therefore, the use of the Internet-based e-mail system (Ele;; tronic two tai) has long replaced the way in which miscellaneous writing and postal messages are transmitted. When the cattle were generally accepted and used by people, the flood of spam and the issue of Ξ it were also widely watched and discussed by _#. Most of the junk is a commercial advertisement, but there are also many spam messages that are spread by other computers. The former only causes the increase of network traffic and the user is not harmless, but then it may cause the mail server to smash or roll the machine, and even give the hacker more opportunities to invade the host. 'Tian Xiaguan's automatic e-mail over the system, the system mainly uses word industry ratio, law, N-gram word phrasing, eigenvalue selection method, word frequency, word frequency-reverse = i (= IDF) 'card _ (Chi_S- and Markov feature extraction method: analysis by different methods, statistics to identify whether the email is spam. Circumventing the above various methods, the spammers instead hide the advertisement in the picture_, use shortened The URL replaces the real website URL ϋ the content of the email' or the corresponding paragraphs in the actual context of the text is randomly inserted into a certain number of non-meaningful texts and other means to respond accordingly. 4 201224789 The above description, how to quickly in a limited resource And accurate spam and filtering thereof, which is the invention to be solved by the present invention. In view of the above, the present invention is to provide a method for distinguishing spam based on the frequency of the present. (4) The method of providing the piece's one-off-standard_piece-shirt; the method of distinguishing ^= includes the following steps: (S1) according to a data set = Ming: Zhongyu complex keywords; (S2) according to The keyword ^ the i material ΐ: The postal method calculates a standard probability; (9) the correlation: obtains a correction probability; (10) according to the correction machine , value, to determine whether the subject mail is spam. The step of making a fortune method 2) can be: (S21) statistics of the plurality of keywords in the U step of the two pieces == 3⁄4 number of correlation coefficients according to the plurality of related Z to the post = the complex is less than a second door The stick value, if the number of people in the second matrix = the number of the relationship, whether the threshold value, the second threshold value is set to zero, if the U = less than the second correlation coefficient is not small _ two _, the financial difference

S 201224789 目對,之關鍵詞,設為—特賴以形成—第 性Si 矩陣之該複數個相關係數,“算該相關 步包 含以ί ΐ驟於用中,步驟(S4)及步驟(S8)間可進一 4置距離而疋義一最大間距,該最大間S 201224789 The key word, set to - to form the complex correlation coefficient of the first Si matrix, "calculate the correlation step contains ί ΐ , , , , , , , , , , , , , , , , , , ) can enter a distance of 4 and the maximum distance between the two, the maximum

,徵詞,之相互距離是否小於該最大間隔,若是 y、於该最大間隔之相互距離,來取得—距離權重係數以^ 將該修正機率無距離健係數姆以更新雜正機率。( 另外,於實際應用中,步驟(S8)進一步包含 ::5==率是否大於該第一門檻值’若是,則該標的郵 件非為垃圾郵件,若非,則該標的郵件為垃圾郵件。 、據此相較於習知技彳标’本發明揭露一種分辨垃圾郵件之 方法,其係根據關鍵詞於資料集之複數封郵件中之出現次數, 以及^鍵詞,關鍵詞間之間距’來判斷標的郵件是否為垃圾郵 件’藉以提高系統對垃圾郵件之辨析能力。, the levy words, whether the mutual distance is less than the maximum interval, if y, the mutual distance between the maximum intervals, to obtain - the distance weight coefficient to ^ the correction probability without the distance health factor to update the odd probability. (In addition, in the actual application, the step (S8) further includes: : 5 == whether the rate is greater than the first threshold value. If yes, the target mail is not spam, and if not, the target mail is spam. According to the prior art, the present invention discloses a method for distinguishing spam, which is based on the number of occurrences of keywords in a plurality of emails of a data set, and ^ key words, the distance between keywords Determine whether the underlying email is spam' to improve the system's ability to discriminate against spam.

【實施方式】 為使本發明能更清楚的被說明,請參照以下本發明詳細說 明及其中所包括之實例可更容易地理解本發明。 、 本說明書僅對本發明之必要元件作出陳述,說明書僅係用 於說明本發明其中之一可能實施例,然而說明書之記述應不限 制本發明所主張之技術本質的權利範圍。除非於說明書有明確 地排除其可能,否則本發明並不侷限於特定方法、流程、功能 或手段。亦應瞭解,目前所述僅係本發明之可能實施例,在本 201224789 =以測,中可使用與本說明書所述材料相類似或等 n万法、流程、功能或手段。 人寸 語,說明書所狀所有技術及科學術 方法及材料钟,可使用與本說明書所述 所述者僅係為實施範法及手段,但本說明書目前 數值及之—數值以上或以下,係包含該 且應瞭解,ίϋΐ示魏之紐方法、流程, 有關之結構,且^所揭示結構 資料辨垃圾郵件之方法’其係根據關鍵詞於 間距:、來判斷 凊一併參閱圖-,圖一繪示根據本發 :種分辨垃圾郵件方法的流程圖。如圖一二,=^^之 的方法,、其係用以判斷-標的郵 的郵件中之㈣f包含以下步驟:(si)根據資料集來擷取標 (s? 特β异取仔一相關性權重係數以及複數個 特徵不(S3) «該相關性權重係數而以㈣ f:算一標準機率⑽將該相關性權重係數“ 畜Γ得到Γί正機率;(S8)根據該修正機率及一第-門檻 值,來判斷S亥標的郵件是否為垃圾郵件。 本發明方法之步驟(S1)中所述之標的郵件可 同種類的文章’而該文章又巧複數段帽或不細的句子: 201224789 t言’由於文字内容的來源及輸人方法的不同,文 =方t ::包含一些雜訊而雜訊主要來自標題、頁碼、排 Si身碼方式、字體大小、字體色彩等等無關 計的浐確ϊ義二ΐ。為防止不必要的判讀錯誤,以致影響統 ’’月& β先利用程式或其他方法、^ ====標準r讓欲進行處】: 文章將於鮮倾進彳湖=,馳娜例中,標的 關鍵t ϊ該標件中之複數個 烟分析法(又纖度分析方法’其包含有 法,為觸—賴,為顧乡字串分析 ΐ二又疋該麵字串之預定長度,以將上述文字内容中 組合複ί、Γ預定ί度的擷取字串。考量中文多以二字 、甘二α /、°°忍,超過一個字以上的詞句,實際上也 土本單位’該擷取字串之默長度為大於鱗於2之^然數 定長雙4字字="='舉例朗設定擷取字串預 以雙連字料限,料㈣絲度可狀=等=本發明不 之預串ί方ί ’將標的文章中的各個字元以及接續 例說明3『二ΖΖΐ侧:成__取字串。舉 月有很夕朋友。』,由於其字元的長度為二, 201224789 故上句可被拆解成以下 多朋及朋友。將文童心If子串.小明、明有、有很、很多、 式統計分析並選取出擷取字串後,再以習知之方 資料集。 固關鍵岡以匯集成本發明所應用之 選’= 容=郵:,:器的_ 類器的過;t,對於英文_=内:日進域改’叹淆郵件分 寫字母〇」、「數字;」與的包f「^字0」與「大 外,在其主要内容文章的前入其他語言之 則是常見手法,也確實置的無意義文字 過郵件分類器的過遽。” w、、竄改的垃圾郵件能逃 從這類的垃圾郵件中可以發現, ^因此右疋在郵件中選出最具代表性的特徵詞之後, ^出現頻率賴聯性與彼此位置做 子 免竄改過的垃圾郵件造成分類器的混淆十异就月匕有效避 步郵件中的複數個關鍵詞後,本發明進行 ’,(S2).根據該_㈣於該資料集之郵件中- 取得一相關性權重係數以及複數個特徵詞。而該步: 子步驟⑽)統計該複數個關鍵詞於該標的 Ϊ子Ϊ驟(S22)則根據該複數個關鍵詞出現於該標的郵ί ίΪ 侧_出現於鄕之其轉件的次數 如上所述,在取得複數個關鍵詞後,子 該複數個關賴於該標的郵件出現之次數;而子 201224789 c复數個關鍵詞出現於該 出現_f料集之其他郵件之次數,^巧, 陣。凊參閱圖二,圖二繪述根據本發 第—矩 矩陣之示意圖。如圖二所繪述,4=之v v?之第- 二=料集榻取之第一、第二及第i個關鍵詞,忒㈡ =一件’ Xm,i代表第m個郵件中第i個關鍵= 來子步驟(S23),子步驟阳)為根據該第—矩束 =====係數-上述之s[Embodiment] The present invention will be more readily understood by the following detailed description of the invention and the examples thereof. The description is only for the essential elements of the present invention, and the description is only for explaining one of the possible embodiments of the present invention, and the description of the specification should not limit the scope of the technical nature of the claimed invention. The present invention is not limited to the specific methods, procedures, functions, or means unless the invention is specifically excluded. It should also be understood that the presently described embodiments are merely possible embodiments of the present invention. In this 201224789=test, the materials described in this specification may be used similarly or equivalent to 10,000 processes, processes, functions or means. All technical and scientific methods and material clocks in the specification, which can be used in the description of the specification, are only for the implementation of the methods and means, but the current values of the present specification and the numerical values above or below are Included in this should be understood, ϋΐ 魏 魏 魏 魏 魏 之 之 魏 之 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏 魏A flow chart showing a method for distinguishing spam according to the present invention. As shown in Figure 1-2, the method of =^^, which is used to judge the postal mail (4) f, contains the following steps: (si) according to the data set to retrieve the standard (s? special beta different take a related Sex weight coefficient and plural characteristics are not (S3) «The correlation weight coefficient is (4) f: Calculate a standard probability (10) The correlation weight coefficient "The sputum gets 正 positive probability; (S8) according to the correction probability and one The first threshold value is used to determine whether the mail of the S Hai mark is spam. The subject mail described in the step (S1) of the method of the present invention can be of the same kind of article 'and the article has a plurality of caps or no fine sentences: 201224789 t言 'Because of the source of text content and the method of input, text = square t :: contains some noise and the noise mainly comes from the title, page number, row code, font size, font color, etc. In order to prevent unnecessary interpretation errors, the influence of the ''month' and 'amplitude' first use the program or other methods, ^ ==== standard r to let the place be done]: The article will be fresh Into the lake =, in the case of Chi Na, the key of the target t 之 in the standard A number of smoke analysis methods (also known as the fineness analysis method), which contain the method, which is touch-sensitive, analyzes the string of the Guxiang string and the predetermined length of the string, in order to combine the above texts. Predict the λ degree of the string. Consider the Chinese word with two characters, Gan II α /, ° ° forbearance, more than one word or more words, in fact, the local unit 'the length of the string is greater than the scale In the case of 2^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 'In the subject of the subject, and the continuation of the description of the 3 "two side: into __ take the string. There are many friends in the moon.", because the length of the character is two, 201224789, so the last sentence can be Disassemble into the following friends and friends. If you have a small child, a small, a clear, a lot, a statistical analysis, and select a string, then use the knowledge of the data set. The collection of the cost of the application of the choice '= 容 = postal:,: the _ class of the device over; t, for the English _ = inside: the day into the domain to change the sigh mail Write the letter 〇", "number;" and the package f "^字0" and "big foreign, in the main content article before entering other languages is a common technique, but also set the meaningless text through the mail classifier遽, 窜 的 的 的 的 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾 垃圾After the tampering of the spam caused by the tampering of the classifier, the present invention proceeds to ', (S2). According to the _(d) in the mail of the data set, after the plural keywords in the mail are effectively avoided. - obtaining a correlation weight coefficient and a plurality of feature words, and the step: substep (10)) counting the plurality of keywords in the target dice step (S22) according to the plurality of keywords appearing in the target post ί Ϊ 侧 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _f collection of other mail The frequency ^ clever, array. Referring to Figure 2, Figure 2 depicts a schematic diagram of a matrix according to the present invention. As depicted in Figure 2, the first, second, and ith keywords of the 4th to the vv?, the first, second, and ith keywords, 忒(2) = one 'Xm,i represents the mth message. i key = substep (S23), substep yang) is based on the first moment beam ===== coefficient - the above s

量。式巾之Nd係代表該㈣射郵件之數 相關係數的計算方式可參考習知技藝以故=複寶= 步驟Ϊ!23)計算出細健的轉方式後,則進行子 第二L車。,i=s-24)i根ί該複數個相關係數,來形成一 之第二矩車陣:㊁ 矩陣中值之觸姆應矩t第一 ί 於一第二_,則 值,則不"k改,相Mrt該複數個相關係數未小於-第二門檻 —特徵詞i數數相對應之關鍵詞設為 第一矩陣。μ參閱圖四,圖四繪述根據本 10 201224789 具體實&例之第二矩陣之示意圖。上述之該第n料 值可按使用者對郵件過濾等級之需要自由設定。/弟一門檻 當第三矩陣完成後,進行子步驟(S26) + 3該第三矩陣之該複數個相關係數,來計算相〜生二根 其中,在將該複數個相關係數全部代前 ^係數jthe amount. The Nd of the towel represents the calculation method of the number of the (4) e-mails. For the calculation of the correlation coefficient, refer to the conventional technique. If the formula is calculated, then the sub-second L car is calculated. , i = s-24) i root ί the plurality of correlation coefficients to form a second moment car array: two matrix value of the touch of the moment t first ί, a second _, then the value, then "k change, phase Mrt The complex correlation coefficient is not less than - the second threshold - the keyword corresponding to the feature word i number is set as the first matrix. μ refers to FIG. 4, and FIG. 4 depicts a schematic diagram of the second matrix according to the specific example of this 201224789. The nth material value described above can be freely set according to the user's need for the mail filtering level. After the third matrix is completed, the sub-step (S26) + 3 the plurality of correlation coefficients of the third matrix are performed to calculate the phase-to-life two, and before the plurality of correlation coefficients are all replaced by ^ Coefficient j

係數中,依其相關性權重係數按从排二個相關 N個特徵詞。接著依據選取之該N 求取相關性權麵數。 ㈣H代人下列算式以Among the coefficients, according to the correlation weight coefficient, two related N characteristic words are arranged. Then, according to the selected N, the number of correlation weights is obtained. (4) The following formulas of the H generation

Pc=nf_ + Rffia) 旦上述計算式中之pc為相關性權重係數, :’而Rrrm則為g m個郵件中第η個關鍵詞之;關 ,便己求得烟⑽重係數以及複數鱗徵詞。’、, 數H t發财法騎行麵(s3) 法t標準機率。為計 整個資“集中的屮V::S 對上述之複數個特徵詞,在 該標的郵件被歸屬於垃圾郵標;票準機率係指 ,圾郵件中有分_2=^具說,類 此多加贅il ve Bayes演算法為習知之技藝,故不於 詞,在資根據^發明之—具體實施例的特徵 率(PW約下賴。如_示,其標準機 ρ 一_0.4^0.2^0.6 ▲ spam------------=A 1 0.4*0,2»*0.6+〇.6»〇19*〇,4''υ*^ '、取%·‘準機H ’則進行步驟(S4)以將該相關性 201224789 如下列 與該標準機率相乘以得到-修正機卿“), PSpan/= Pspam* pc 取得上述之該修正機率後,則進行步 根據該標的郵件中之該複數個特 現之)’ 4(S5)為 -最大間距,該最大間距為該二;^見 間的最大間距;接者,繼續進行步f ~複數個特徵詞 特徵詞間之相互距離是否小於兮最 ·別判斷該複數個 小於該最大間隔之相互:套 ====, 距離權重係數: 飞乂°十鼻並取侍一Pc=nf_ + Rffia) Once pc in the above formula is the correlation weight coefficient, : ' and Rrrm is the nth keyword in gm messages; off, then the smoke (10) weight coefficient and the complex scale sign word. ',, number H t financial method riding surface (s3) method t standard probability. In order to count the entire capital "concentrated 屮V::S for the above multiple characteristic words, the subject mail is attributed to the junk postmark; the quasi-opportunity rate means that there is a score in the junk mail. This multi-plus il ve Bayes algorithm is a well-known skill, so it is not a word, according to the invention - the characteristic rate of the specific embodiment (PW is about the next. For example, its standard machine ρ _0.4^0.2 ^0.6 ▲ spam------------=A 1 0.4*0,2»*0.6+〇.6»〇19*〇,4''υ*^ ', take %·' Machine H' then proceeds to step (S4) to multiply the correlation 201224789 by the standard probability to obtain - correction machine "", PSpan / = Pspam * pc to obtain the above-mentioned correction probability, then step according to The plurality of special occurrences in the subject mail is '4(S5) is the maximum spacing, the maximum spacing is the two; the maximum spacing between the two; and the proceeding proceeds to step f ~ a plurality of characteristic word characteristic words Whether the mutual distance between them is less than 兮 most, do not judge the plurality of mutually smaller than the maximum interval: set ====, distance weight coefficient: fly 乂 ° ten nose and take one

Pd = ΠΓ Πι(1 + 2顧-lDm-DT'丨 Dmax 重係絲最糾隔且該表示距離權 里你数δ月參閱圖六,圖六繪述根據本於 目> ,字詞,於資料集中的出現位 = 入上述之算式將該圖六中之數據套 1.6Pd = ΠΓ Πι(1 + 2 Gu-lDm-DT'丨Dmax The weight of the line is the most rectified and the distance is expressed in the distance δ month. See Figure 6. Figure 6 is based on the word >, words, The occurrence of the data in the data set = into the above formula, the data set in Figure 6

Pd= 1 驟述之距離權重健後,進行靖S7),步 機率,其計算式如^示軸重健姆,叹新該修正Pd= 1 The distance weight is described as the weight, and Jing S7), the probability of the step, the calculation formula is as shown in the figure, the axis is heavy, and the new correction is given.

Pspam'-Pspam* Pc* Pd 準上相 =重及係機率、修標 =士述之該修正機率後,則進行步驟㈣以判斷該修正 疋大於一第一門檻值,若是,則該標的郵件非為垃圾郵Pspam'-Pspam* Pc* Pd quasi-upper phase = weight and probability, revision = after the correction of the probability, then proceed to step (4) to determine that the correction is greater than a first threshold, and if so, the target email Not for spam

1P 201224789 件’右=’職標的郵件為城郵件。上述之 按使用者_件過_級之需要而自由設定。懷值可 μ 中,不以上述步驟(S5)、(S6)及(S7)為必要,括估田去 ,遽郵件精準度之要求,使用者亦可於步驟=者 步驟(S8)以直接據以判斷該標的郵件是否為垃圾郵)&。4進订 請一併參閱圖七A及圖七B,圖七A會 =辨垃圾郵件之方法,與其他f知中文郵= 較表,圖七B _了本發明之―種分辨 該論文之題目為,,基表之論文, fo/Anti l^A StUdy 〇'Wtier FiItCri^ Sch-es 率一,方面之表It 郵件= 明之體^狀神鱗望缺加清楚描述本發 由制。相反地,其之目的是希望於本發明所欲 的範,’能涵蓋各種改變以及具等效性的結 作最宽廣的解^日所Ϊ清之專利範圍的範#應根據上述的說明 2寬廣的轉,以致使其涵蓋所有可能的改變以及等效性的 13 201224789 【圖式簡單說明】 圖一繪示根據本發明之一具體實施例之一種分辨垃圾郵 件方法之流程圖° 圖二繪述根據本發明之一具體實施例之第一矩陣的示意 圖。 圖三繪述根據本發明之一具體實施例之第二矩陣的示意 圖。1P 201224789 pieces of 'right=' job mail are city mail. The above is freely set according to the needs of the user_pieces. The value of the value can be μ, not necessary for the above steps (S5), (S6) and (S7), including the estimation of the accuracy of the mail, the user can also directly in the step = step (S8) According to the judgment of whether the subject of the mail is spam) & 4 Please refer to Figure 7A and Figure 7B for details. Figure 7A will be the method of identifying spam, and other Chinese-speaking companies will be compared with the other tables. Figure 7B _ The invention is a kind of distinguishing the paper. The title is, the paper of the base table, fo/Anti l^A StUdy 〇'Wtier FiItCri^ Sch-es rate one, the aspect of the table It mail = the body of the body ^ shape of the gods and the lack of a clear description of the hair system. On the contrary, the purpose of the invention is to intend to cover the scope of the invention, and to cover the various changes and the equivalent of the most comprehensive solution. 2 broadly rotated so that it covers all possible changes and equivalences 13 201224789 [Schematic Description] FIG. 1 is a flow chart of a method for distinguishing spam according to an embodiment of the present invention. A schematic diagram of a first matrix in accordance with an embodiment of the present invention is depicted. Figure 3 depicts a schematic diagram of a second matrix in accordance with an embodiment of the present invention.

圖四繪述根據本發明之一具體實施例之第三矩陣的示意 圖。 圖五繪述根據本發明之一具體實施例之特徵詞在資料集 中的出現頻率之示意圖。 圖六繪述根據本發明之一具體實施例之特徵字詞於資料 集中的出現位置之示意圖。 "、、 習知了叙枝與其他 之方法與其他 習去B缘述了本發明之一種分辨垃圾郵件 °中英文郵件分類器的效果比較表。 【主要元件符號說明】 S1 S4、S8 :流程步驟 14Figure 4 depicts a schematic diagram of a third matrix in accordance with an embodiment of the present invention. Figure 5 is a diagram showing the frequency of occurrence of feature words in a data set in accordance with an embodiment of the present invention. Figure 6 is a diagram showing the appearance of feature words in a data set in accordance with an embodiment of the present invention. ",, I know the method of narration and other methods, and the other is a comparison table of the effect of the invention of spam. [Main component symbol description] S1 S4, S8: Process step 14

Claims (1)

201224789 七 申請專利範圍: 垃垃用以判斷-標的郵件是否為 算取得-相關性權重係數以及複數m人數’來計 H準該=性權重係數—^算法,來計 H該^:晴嶋爾_目細得到—修正 標的郵件是否 2、 =專利雜丨項所述之方法,其中該步華)包含以 下 (521) 統計該複數侧鍵詞於該標的郵件 (522) 根據該複數個關鍵詞出現於 H , 現於該資料集之其 (Τι* ίίιί ^ —輯來計#賴數個騎触對應於 複數個相對應於該資料集之其他郵 (=根據該複數個相關係數形成—第二 該第二檻:將該數= 中之該複數個相關係數未小於該第巧-J陣 係;r “鍵詞設 15 201224789 矩陣之織健侧絲,來計算該相 3、 如申凊專利範圍第1項所述之方法 間可進一步包含以下步驟 ’其中s玄步驟(S4)及步驟(S8) t 料巾之賴數例樣㈣現之位置距 複數個特徵;I:大g娜咖標的郵件中之各該 以 糊斷該複數鱗徵酬之相互距離,是否小於該最 隔,若是,則根據該等小於該最大間隔之相互距離, 取仔一距離權重係數;以及 機 =7。)將該修正機率與該距離權重係數相乘,以更新該修正 4、t申Μ專利範圍第1項所述之方法,其中該步驟(S8)進-步包 3以下子步驟: (S81)判斷該修正機率是否大於該第一門檻值,若是,則該標 的郵件非為垃圾郵件,若非,則該標的郵件為垃圾郵件"不201224789 Seven patent application scope: The garbage is used to judge whether the -standard mail is the calculation-relevance weight coefficient and the number of plural m's to calculate the H-weight coefficient -^ algorithm, to calculate H. ^:Qingmuer _ 目 细 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Appears in H, which is now in the dataset (Τι* ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Second, the second 槛: the number of correlation coefficients in the number = is not less than the crypto-J array; r "keys set 15 201224789 matrix of the weaving side wire to calculate the phase 3, such as Shen The method described in the first item of the patent range may further comprise the following steps: wherein the sth step (S4) and the step (S8) t the number of the towel (4) is the positional distance from the plurality of features; I: Da Gina Each of the espresso's mails should be ruined by the plural scales. Whether the mutual distance is smaller than the maximum interval, and if so, the distance weight coefficient is taken according to the mutual distances smaller than the maximum interval; and the machine=7.) multiplying the correction probability by the distance weight coefficient to The method of claim 1, wherein the step (S8) further includes the following substeps: (S81) determining whether the correction probability is greater than the first threshold, and if so, The target email is not spam, if not, the email is spam" 1616
TW099141834A 2010-12-02 2010-12-02 A method for sorting the spam mail TWI457767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW099141834A TWI457767B (en) 2010-12-02 2010-12-02 A method for sorting the spam mail

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW099141834A TWI457767B (en) 2010-12-02 2010-12-02 A method for sorting the spam mail

Publications (2)

Publication Number Publication Date
TW201224789A true TW201224789A (en) 2012-06-16
TWI457767B TWI457767B (en) 2014-10-21

Family

ID=46725952

Family Applications (1)

Application Number Title Priority Date Filing Date
TW099141834A TWI457767B (en) 2010-12-02 2010-12-02 A method for sorting the spam mail

Country Status (1)

Country Link
TW (1) TWI457767B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9690797B2 (en) 2013-06-21 2017-06-27 Ubic, Inc Digital information analysis system, digital information analysis method, and digital information analysis program
CN112087444A (en) * 2020-09-04 2020-12-15 腾讯科技(深圳)有限公司 Account identification method and device, storage medium and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200516484A (en) * 2003-10-27 2005-05-16 Softnext Technologies Co Ltd Filtering method for SPAM
CN1991879B (en) * 2005-12-29 2011-08-03 腾讯科技(深圳)有限公司 Filtration method of junk mail
US20080082658A1 (en) * 2006-09-29 2008-04-03 Wan-Yen Hsu Spam control systems and methods
CN101166159B (en) * 2006-10-18 2010-07-28 阿里巴巴集团控股有限公司 A method and system for identifying rubbish information
CN101594313A (en) * 2008-05-30 2009-12-02 电子科技大学 A kind of spam judgement, classification, filter method and system based on potential semantic indexing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9690797B2 (en) 2013-06-21 2017-06-27 Ubic, Inc Digital information analysis system, digital information analysis method, and digital information analysis program
CN112087444A (en) * 2020-09-04 2020-12-15 腾讯科技(深圳)有限公司 Account identification method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
TWI457767B (en) 2014-10-21

Similar Documents

Publication Publication Date Title
Kramer An unobtrusive behavioral model of" gross national happiness"
US10509531B2 (en) Grouping and summarization of messages based on topics
Van Dalen et al. Signals in science-On the importance of signaling in gaining attention in science
US8103650B1 (en) Generating targeted paid search campaigns
JP5379138B2 (en) Creating an area dictionary
US20110153595A1 (en) System And Method For Identifying Topics For Short Text Communications
US20130246440A1 (en) Processing a content item with regard to an event and a location
US20120185544A1 (en) Method and Apparatus for Analyzing and Applying Data Related to Customer Interactions with Social Media
TW201033823A (en) Systems and methods for analyzing electronic text
US8401899B1 (en) Grouping user features based on performance measures
US8359238B1 (en) Grouping user features based on performance measures
Vijayakumar et al. A new method to identify short-text authors using combinations of machine learning and natural language processing techniques
Van Dalen et al. Demographers and their journals: Who remains uncited after ten years?
Er et al. User-level twitter sentiment analysis with a hybrid approach
Widiyaningtyas et al. Sentiment Analysis Of Hotel Review Using N-Gram And Naive Bayes Methods
Xie et al. Identifying features of source and message that influence the retweeting of health information on social media during the COVID-19 pandemic
JP6356268B2 (en) E-mail analysis system, e-mail analysis system control method, and e-mail analysis system control program
TW201224789A (en) A method for sorting the spam mail
Saleiro et al. Popstar at replab 2013: Name ambiguity resolution on twitter
JP2020521246A (en) Automated classification of network accessible content
Patel et al. Influence of Gujarati STEmmeR in supervised learning of web page categorization
CN110941759B (en) Microblog emotion analysis method
Rosenthal et al. Social proof: The impact of author traits on influence detection
Ganie et al. Sentiment analysis on the effect of trending source less News: special reference to the recent death of an Indian actor
CN108154382B (en) Evaluation device, evaluation method, and storage medium

Legal Events

Date Code Title Description
GD4A Issue of patent certificate for granted invention patent
MM4A Annulment or lapse of patent due to non-payment of fees