WO2007093661A1 - Procédé destiné à classer des messages de courrier électronique en courrier désiré et courrier non désiré - Google Patents

Procédé destiné à classer des messages de courrier électronique en courrier désiré et courrier non désiré Download PDF

Info

Publication number
WO2007093661A1
WO2007093661A1 PCT/ES2007/070026 ES2007070026W WO2007093661A1 WO 2007093661 A1 WO2007093661 A1 WO 2007093661A1 ES 2007070026 W ES2007070026 W ES 2007070026W WO 2007093661 A1 WO2007093661 A1 WO 2007093661A1
Authority
WO
WIPO (PCT)
Prior art keywords
heuristic
words
filter
vocabulary
emails
Prior art date
Application number
PCT/ES2007/070026
Other languages
English (en)
Spanish (es)
Inventor
Mª DOLORES del CASTILLO SOBRINO
José Ignacio SERRANO MORENO
Salvador Ros Torrecillas
Original Assignee
Consejo Superior De Investigaciones Científicas
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Consejo Superior De Investigaciones Científicas filed Critical Consejo Superior De Investigaciones Científicas
Publication of WO2007093661A1 publication Critical patent/WO2007093661A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking

Definitions

  • the present invention relates in general to a method for classifying email messages in spam and spam, or by using a probabilistic filter, and in particular a method of carrying out said classification using a Bayesian filter. without the need for prior training.
  • the fundamental objective of a classification system is twofold: to identify unwanted emails and filter them, that is, to obtain good coverage results, and, on the other hand, to avoid the incorrect classification of emails. valid.
  • This last objective is a priority since the assignment of the spam category to a valid email is an error of a higher order than not identifying an invalid email as spam due to the consequences of possible loss of useful information.
  • characteristics refer to attributes, for example, of the message format, such as referring to whether a word in the message is capitalized or not, or if the message contains a series of punctuation marks.
  • the present invention concerns a method for classifying email messages in spam and spam, of the type based on the use of a probabilistic filter to perform said classification, in particular a filter based on the Na ⁇ ve Bayes method , or Bayesian filter.
  • the method comprises generating, without the intervention of the Bayesian filter, that is, outside the filter, and prior to the beginning of said classification, a group of heuristic patterns and heuristic words, from the analysis of a series or corpus of historical mails , said heuristic words forming an initial heuristic vocabulary, using at least morphological criteria to perform said analysis, and including said heuristic patterns, which relate said morphological criteria to the heuristic words, in a database, said database being supplied together with the initial heuristic vocabulary to said probabilistic filter, from which it is able to start working.
  • the proposed method includes using the Bayesian filter to consult the heuristic patterns of said database, which relate the heuristic words to the morphological criteria on which they are based , in order to search, in the mails to be classified, words that meet one or more of said morphological criteria, and assign to each word found the heuristic word related to the criteria met by said word found, and consult the heuristic vocabulary to assign the probabilities of said heuristic words assigned in the heuristic vocabulary.
  • said assignment to each word found of the heuristic word related to the criteria met is carried out by directly substituting in the mail to classify the word in question by the heuristic word, for example the word “ffffffutbol” by " Nosense ", every time it appears in the mail, and after that we proceed to the probabilistic analysis of the mail taking as reference the word” Nosense ".
  • the method comprises using the Bayesian filter to search, in the mails to be classified, words that meet a criterion of said morphological criteria related to a first heuristic pattern of the database consulted, and assign the related heuristic word for said first pattern and the probabilities assigned in the heuristic vocabulary.
  • Said heuristic patterns are ordered in the database, for an example of preferred embodiment, depending on the degree of spam probability assigned to the respective heuristic words related by them, said query being performed by the Bayesian filter following said order , so that if a word complies with said first pattern, because it is associated with a probability of spam greater than a possible second or third pattern (according to the order defined above), it is not necessary to verify that said word complies with said further patterns.
  • the proposed method also includes updating said probabilities of said heuristic words progressively based on their influence on the result of the analysis of the mails classified by said probabilistic filter, increasing said probabilities if the influence of the associated heuristic word has caused a correct classification, or vice versa.
  • the proposed method includes using the filter so that it gradually learns while classifying.
  • the proposed method includes improving the classification capacity of said Bayesian filter through an interaction with the user through which the latter indicates to the Bayesian filter which emails have been correctly classified and / or which emails have been incorrectly classified, each time The filter classifies emails or periodically.
  • the method includes updating and increasing said initial heuristic vocabulary supplied to said filter, initially formed only by heuristic words, progressively, during the use of said Bayesian filter on the emails classified by the filter, being analyzed mails incorrectly classified by said Bayesian filter and adding, for a preferred embodiment to the initial heuristic vocabulary, one or more words, heuristic and non-heuristic, obtained from said analysis.
  • said obtaining of said heuristic and non-heuristic words is carried out by means of an analysis of the words contained in said mails incorrectly classified also using at least said morphological criteria, analysis which is preferably performed by the Bayesian filter itself, and also preferably carried out.
  • the method comprises carrying out said updating of said heuristic vocabulary at least in part automatically, analyzing each email, immediately or every certain period of time programmed after being classified by the probabilistic filter and validated by the user.
  • the method comprises carrying out said updating of said heuristic vocabulary at least in part at the request of a user, analyzing only a series of emails indicated by the user.
  • these refer to at least one of the following criteria: length of each word analyzed, order or sequence and type of characters contained, adjacency of characters, or a combination thereof, being said consonant characters, vowels, numbers and / or symbols.
  • the proposed method comprises using morphological criteria that refer to the morphology of words for a plurality of different languages, thus representing the heuristic vocabulary described the words considered as invalid due to that their morphology, that is, the way they are structured, is incorrect in any language.
  • Fig. 1 is a flow chart illustrating a series of actions to be performed according to an example of embodiment of the method proposed by the present invention.
  • the method comprises in order to obtain said initial heuristic vocabulary, the use of a finite state automaton to generate said group of heuristic patterns and heuristic words by carrying out said analysis according to one or more morphological criteria. , in order to screen the words supplied to said automaton, separating them into valid words and invalid words, according to said morphological criterion used by the automaton.
  • the method comprises building and training said finite state automaton by supplying it with words extracted from tagged emails, in general by a user, as valid, said automaton being once trained capable of recognizing as valid words those goods formed in terms of the sequence and type of the characters that compose them, the sequence and type of characters contained in the words being the morphological criteria used by the automaton for the present preferred embodiment.
  • the method includes providing words extracted from tagged emails, in general by a user, as invalid, and classifying as words not valid those not recognized as valid by the automaton, thus producing the aforementioned screen that enables that later only the words not recognized by the automaton are submitted to the above described analysis that leads to the obtaining of the corresponding heuristic words that are included in the mentioned initial heuristic vocabulary. Therefore, the finite state automaton used for the application of the proposed method, is adapted, once trained, to automatically identify correctly formed words (or tokens) and differentiate them from those poorly formed, morphologically speaking. In other words, the automaton is able to recognize the grammar that describes the well-formed words, understood as character sequences, for a plurality of different languages.
  • these characters can be a consonant ( 1 C 1 ), a vowel (V), a number ('n') or a symbol ( 1 S 1 ).
  • the construction of the automaton from the words of said emails labeled as valid is carried out by means of the use of an ECGI algorithm of adapted grammatical inference, by means of which it is possible to obtain a set of examples of different well-formed words, taken from a corpus of valid emails, which are those that are supplied to the automaton during its training, so that it can be used as a reference to be able to recognize morphologically well-formed words.
  • the valid word "scientific-technical”, presented as the string of terms "c v v c c v c v c v c v s c v c c v c v" will be recognized by the automaton once trained for it. Each time the automaton recognizes a word, the word is considered to be well formed.
  • the method comprises composing all the morphological criteria prior to its use to carry out said analysis, not being modified at any time, and for another example of embodiment, the method comprises composing, previously, only one or more initial morphological criteria, and modifying them, or creating new morphological criteria, simultaneously with the performance of said analysis. , and depending on the results obtained with it.
  • said five relations could be represented by another number of heuristic patterns not necessarily equal to five.
  • the method comprises registering the heuristic patterns obtained in said database.
  • the method comprises registering said morphological criteria related to said heuristic patterns in said database with the heuristic words obtained from them.
  • the proposed method comprises using the Bayesian filter to analyze the emails received on one or more of the following fields separately, based on a corresponding selection by the user: sender, subject and body, and, for an embodiment example for the which one selected more than one field, combine, in a weighted way, the result of all the analyzes performed to obtain a final result.
  • the method comprises, for an embodiment example, generating a single initial heuristic vocabulary common to the subject and body parts and valid for said parts of the mail, or, for another example of realization, to generate an own and different heuristic vocabulary for each of said parts.
  • non-heuristic words is the word "transplant” which has only two consonants, so it does not meet any of the morphological criteria related to the heuristic words, but which the filter can extract from an erroneously classified mail and add to the cited heuristic vocabulary.
  • the filter when the filter receives the first mail to be classified, and although it preferably analyzes each part of the mail separately, focusing on the part or field corresponding to the subject, for each word it finds, first, check if the word is in the heuristic vocabulary, either the common one or the one associated with that part, that is, the subject field, depending on the example of realization.
  • the filter does not have the words: you need, transplant, heart
  • the next step the filter takes is to analyze if each of these three words meets any criteria referenced by a corresponding heuristic pattern of The database.
  • the word heart would meet the morphological criteria of "Letter and / or Numbers and / or Consecutive Symbols in any order" and the heuristic word "LNL" would be assigned.
  • the filter calculates the probability of SPAM and NO-SPAM of the subject, like any Bayesian filter, only taking into account the probabilities that the LNS Metaword has been assigned a priori. If the mail is classified in one of the two categories and the user gives the approval, the vocabulary is not updated in words but in the probabilities of the words it contains, since its use has caused that the classification has been correct.
  • the filter incorporates the subject vocabulary (either own or common to the rest of the fields) the words "you need” and "transplant” with the corresponding SPAM and NO-SPAM probabilities.
  • the vocabulary of the subject part, or the common one to the subject and body parts if that is the case, would then be formed by ⁇ Nosense, Minsize, Cns, Raresymbs, Numbers, Metapalabra LNS, you need, transplant, LNL ⁇ .
  • Said extended vocabulary formed by the initial heuristic words, plus the heuristic and non-heuristic words added, will be the one that the Bayesian filter uses in later analyzes, with which the more the vocabulary is increased the greater the precision of the filter, provided when the increase is not excessive, since this would cause a longer classification time and, therefore, a lower efficiency of the filter.
  • said vocabulary does not grow in vain, that is to say that the words that are incorporated therein are relevant for later use in the classifications carried out by the Bayesian filter, discarding thus the words that would contribute uselessly to vocabulary growth, increasing the mentioned classification time, and that would have a practically null contribution in the discriminatory potential of the filter.
  • the method proposed by the present invention comprises carrying out an intelligent management thereof.
  • the filter When the filter has to learn from the classification errors, the vocabularies of the subject and body parts are updated with the words contained in the misclassified emails, whether they are false positive or negative, and to the new words, both heuristic and not heuristics, they are assigned a corresponding probability of spam and non-spam set by the filter for new heuristic and non-heuristic words. If for this case in which the filter learns from a poorly classified email, the email contains invalid words that conform to some criteria referenced by a respective heuristic pattern, these are not added to the vocabulary but only the probabilities associated with the word are updated. heuristic pattern that relates to the criteria that meets the invalid word.
  • the initial heuristic vocabulary is empty, whereby the Bayesian filter begins to operate using said empty vocabulary to analyze the sender field of incoming mails, which produces a classification of said field as non-spam.
  • the weighted classification made by the filter for the three parts of the emails will allow obtaining the final classification of the same, which may be spam or non-spam. If the user indicates to the filter that the classification has been correct, the proposed method includes storing in the vocabulary of the sender field, only the sender of the non-spam emails with a probability associated with the non-spam category higher than the spam .
  • the filter When the filter has to learn because errors have occurred in the classification of emails, it stores only the email addresses, or senders, of the false positives, that is, of the emails that really are non-spam.
  • the senders of the false negatives, that is to say, of the emails erroneously assigned to the non-spam category, are ignored since most of the unsolicited emails almost never come from the same sender and their inclusion would contribute unnecessarily to the growth of the vocabulary of this part.
  • the proposed method comprises selecting the operating mode of said Bayesian filter, by a user, from among the following three modes, depending on whose selection one or other action will be performed with emails received: - mode 1, or blind trust mode, whose selection causes emails that the filter identifies as spam are not delivered to the user,
  • - mode 2 or supervised mode, whose selection causes the emails that the filter detects as spam to be delivered to the user including a character string, or label, in the field related to the subject of the mail, and - mode 3, whose selection causes that all emails are delivered to the user, regardless of whether or not they are spam.
  • the method comprises choosing by the user, which character string, or label, to include in said field concerning the subject of the mail.
  • the proposed method is applicable regardless of the mail client used, and includes adjusting the error bias of incorrectly classified emails, such as spam or non-spam, of the Bayesian filter by modifying, by the user, a certain parameter .
  • the proposed method comprises showing a user at least one of the following elements, or a combination thereof:
  • - a history or trace of the user's connections to the mail server, showing for each of them the date, time and number of emails received with the results of the filter, as well as possible specific connection errors, - a list of the emails received along with the identification that the filter has made of each of them, list on which the user can indicate to the filter the successes and mistakes made so that he learns from them, being possible to delete emails from the list so that the filter does not consider them again, and - a configuration panel where the user can adjust certain parameters of the filtrate and connections, and
  • Fig. 1 a flow chart is shown that illustrates the use of the Bayesian filter, both to classify incoming mails, and to learn from their successes and errors, through a series of actions to be carried out according to the method proposed by the present invention, representative of an exemplary embodiment, some of which have already been described above, said actions carried out by the Bayesian filter once the initial heuristic vocabulary has been created, as well as the heuristic patterns.
  • - Mailing list refers to the index of the emails shown to the user, that is, the subject and sender fields of the emails.
  • - List and temporary emails refers to the complete emails, that is, with the subject, sender and body fields.
  • the filter acts as an intermediary or proxy server with respect to the incoming mail server, for example POP (or POP3), at which one connects
  • POP or POP3
  • the filter analyzes the parts (subject, sender and body) of the extracted vocabulary and determines if the email is spam or not, combining them, carrying out said analysis by consulting the heuristic vocabulary and the database generated and supplied to it according to the proposed method , as indicated in Fig. 1 with the dashed arrow from the cylinder with the legend: "Analysis data learned" described above.
  • the filter saves the vocabulary extracted from the mail temporarily and adds an entry associated to the commented temporary list, as shown by the corresponding dashed arrow in Fig. 1 directed to the element with the legend "list and received temporary emails”. If the filter has detected that the mail received is spam, the filter acts according to the mode selected by the user: in mode 1, or "blind trust” mode, the filter deletes the mail from the server, in mode 3, the Ie filter deliver the mail to the mail client, and in mode 2, or supervised mode, the filter includes a string in the mail subject, such as: "this mail is spam”. Therefore, it refers to the option to send the filter to learn
  • the data for subsequent analysis is updated, as indicated by the dashed arrow directed towards the cylinder with the legend "Analysis data learned”.
  • the update based on the analysis of the incorrectly classified emails it refers to the extraction and incorporation in the heuristic vocabulary of the non-heuristic words found in said emails with a corresponding probability of spam or non-spam, as the case may be, as well as updating the probabilities of spam and non-spam of words, both heuristic and non-heuristic, if they exist, used to classify said emails, depending on their influence on said incorrect classifications.
  • this refers mainly to the update of the probabilities of spam and non-spam of the words, both heuristic and non-heuristic, if they exist, used to classify these emails, depending on their influence on said correct classifications.
  • the filter deletes the content of the element with the legend "list and temporary emails received", that is, it deletes the temporary mail and its entry from the temporary list, since said temporary emails have already been used to fulfill its function, either to facilitate the words to be incorporated into the heuristic vocabulary in order to increase it, and therefore to improve it, and to adjust the probabilities of the words of said vocabulary, after which the filter updates its statistics, and it goes back to standby until the user selects one of the two possible options described.
  • a person skilled in the art could introduce changes and modifications in the described embodiments without departing from the scope of the invention as defined in the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • Computer Hardware Design (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

L'invention concerne un procédé consistant à générer, contrairement au filtre, au moins avant le démarrage de la classification réalisée par un filtre bayésien, un groupe de patrons et de mots heuristiques, à partir de l'analyse de courriers historiques, ces mots heuristiques formant un vocabulaire heuristique initial, des critères morphologiques étant utilisés en vue de réaliser cette analyse, et à inclure ces patrons heuristiques, lesquels sont liés aux critères morphologiques avec les mots heuristiques, dans une base de données, cette base de données administrant le vocabulaire heuristique initial au filtre bayésien. Ce procédé consiste également à augmenter progressivement le vocabulaire heuristique initial, durant l'utilisation du filtre, par une gestion intelligente qui garantit une grande efficacité et une grande efficience dans la classification, évitant ainsi que le vocabulaire augmente de façon démesurée avec des mots insignifiants, et par conséquent que le temps de classification augmente inutilement.
PCT/ES2007/070026 2006-02-15 2007-02-05 Procédé destiné à classer des messages de courrier électronique en courrier désiré et courrier non désiré WO2007093661A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ES200600349 2006-02-15
ESP200600349 2006-02-15

Publications (1)

Publication Number Publication Date
WO2007093661A1 true WO2007093661A1 (fr) 2007-08-23

Family

ID=38371213

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/ES2007/070026 WO2007093661A1 (fr) 2006-02-15 2007-02-05 Procédé destiné à classer des messages de courrier électronique en courrier désiré et courrier non désiré

Country Status (1)

Country Link
WO (1) WO2007093661A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6161130A (en) * 1998-06-23 2000-12-12 Microsoft Corporation Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set
US20040260776A1 (en) * 2003-06-23 2004-12-23 Starbuck Bryan T. Advanced spam detection techniques
US20050102366A1 (en) * 2003-11-07 2005-05-12 Kirsch Steven T. E-mail filter employing adaptive ruleset
US20050192992A1 (en) * 2004-03-01 2005-09-01 Microsoft Corporation Systems and methods that determine intent of data and respond to the data based on the intent
US20060015561A1 (en) * 2004-06-29 2006-01-19 Microsoft Corporation Incremental anti-spam lookup and update service

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6161130A (en) * 1998-06-23 2000-12-12 Microsoft Corporation Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set
US20040260776A1 (en) * 2003-06-23 2004-12-23 Starbuck Bryan T. Advanced spam detection techniques
US20050102366A1 (en) * 2003-11-07 2005-05-12 Kirsch Steven T. E-mail filter employing adaptive ruleset
US20050192992A1 (en) * 2004-03-01 2005-09-01 Microsoft Corporation Systems and methods that determine intent of data and respond to the data based on the intent
US20060015561A1 (en) * 2004-06-29 2006-01-19 Microsoft Corporation Incremental anti-spam lookup and update service

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DEL CASTILLO M.D. ET AL.: "An interactive hybrid system for identifiying and filtering unsolicited email", PROCEEDINGS. THE 2005 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE. IEEE COMPUTER. SOC. LOS ALAMITOS, NJ, pages 814 - 815, XP010841833 *

Similar Documents

Publication Publication Date Title
US10027611B2 (en) Method and apparatus for classifying electronic messages
US8065379B1 (en) Line-structure-based electronic communication filtering systems and methods
JP7372812B2 (ja) 会話に基づくチケットロギングのためのシステム及び方法
Bratko et al. Spam filtering using statistical data compression models
US8131655B1 (en) Spam filtering using feature relevance assignment in neural networks
Trivedi A study of machine learning classifiers for spam detection
Androutsopoulos et al. Learning to filter unsolicited commercial e-mail
US8527436B2 (en) Automated parsing of e-mail messages
US9699129B1 (en) System and method for increasing email productivity
US20070130368A1 (en) Method and apparatus for identifying potential recipients
CN103136266A (zh) 邮件分类的方法及装置
Almeida et al. Facing the spammers: A very effective approach to avoid junk e-mails
JP5056337B2 (ja) 情報検索システム
WO2007093661A1 (fr) Procédé destiné à classer des messages de courrier électronique en courrier désiré et courrier non désiré
CN112988962B (zh) 文本纠错方法、装置、电子设备及存储介质
Tahsin et al. A novel approach for e-mail classification using fasttext
Gordillo et al. An HMM for detecting spam mail
Cesarini et al. A two level knowledge approach for understanding documents of a multi-class domain
Chang et al. Applying name entity recognition to informal text
JP4746083B2 (ja) 宛先正否判定システム
Ha et al. Personalized email recommender system based on user actions
Sculley Advances in online learning-based spam filtering
Meng et al. Learning belief networks for language understanding
Méndez et al. Analyzing the impact of corpus preprocessing on anti-spam filtering software
JPWO2011048672A1 (ja) データ処理装置及びデータ処理方法及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07712571

Country of ref document: EP

Kind code of ref document: A1