WO2007093661A1 - Method for sorting e-mail messages into wanted mail and unwanted mail - Google Patents

Method for sorting e-mail messages into wanted mail and unwanted mail Download PDF

Info

Publication number
WO2007093661A1
WO2007093661A1 PCT/ES2007/070026 ES2007070026W WO2007093661A1 WO 2007093661 A1 WO2007093661 A1 WO 2007093661A1 ES 2007070026 W ES2007070026 W ES 2007070026W WO 2007093661 A1 WO2007093661 A1 WO 2007093661A1
Authority
WO
WIPO (PCT)
Prior art keywords
heuristic
words
filter
vocabulary
emails
Prior art date
Application number
PCT/ES2007/070026
Other languages
Spanish (es)
French (fr)
Inventor
Mª DOLORES del CASTILLO SOBRINO
José Ignacio SERRANO MORENO
Salvador Ros Torrecillas
Original Assignee
Consejo Superior De Investigaciones Científicas
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Consejo Superior De Investigaciones Científicas filed Critical Consejo Superior De Investigaciones Científicas
Publication of WO2007093661A1 publication Critical patent/WO2007093661A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking

Definitions

  • the present invention relates in general to a method for classifying email messages in spam and spam, or by using a probabilistic filter, and in particular a method of carrying out said classification using a Bayesian filter. without the need for prior training.
  • the fundamental objective of a classification system is twofold: to identify unwanted emails and filter them, that is, to obtain good coverage results, and, on the other hand, to avoid the incorrect classification of emails. valid.
  • This last objective is a priority since the assignment of the spam category to a valid email is an error of a higher order than not identifying an invalid email as spam due to the consequences of possible loss of useful information.
  • characteristics refer to attributes, for example, of the message format, such as referring to whether a word in the message is capitalized or not, or if the message contains a series of punctuation marks.
  • the present invention concerns a method for classifying email messages in spam and spam, of the type based on the use of a probabilistic filter to perform said classification, in particular a filter based on the Na ⁇ ve Bayes method , or Bayesian filter.
  • the method comprises generating, without the intervention of the Bayesian filter, that is, outside the filter, and prior to the beginning of said classification, a group of heuristic patterns and heuristic words, from the analysis of a series or corpus of historical mails , said heuristic words forming an initial heuristic vocabulary, using at least morphological criteria to perform said analysis, and including said heuristic patterns, which relate said morphological criteria to the heuristic words, in a database, said database being supplied together with the initial heuristic vocabulary to said probabilistic filter, from which it is able to start working.
  • the proposed method includes using the Bayesian filter to consult the heuristic patterns of said database, which relate the heuristic words to the morphological criteria on which they are based , in order to search, in the mails to be classified, words that meet one or more of said morphological criteria, and assign to each word found the heuristic word related to the criteria met by said word found, and consult the heuristic vocabulary to assign the probabilities of said heuristic words assigned in the heuristic vocabulary.
  • said assignment to each word found of the heuristic word related to the criteria met is carried out by directly substituting in the mail to classify the word in question by the heuristic word, for example the word “ffffffutbol” by " Nosense ", every time it appears in the mail, and after that we proceed to the probabilistic analysis of the mail taking as reference the word” Nosense ".
  • the method comprises using the Bayesian filter to search, in the mails to be classified, words that meet a criterion of said morphological criteria related to a first heuristic pattern of the database consulted, and assign the related heuristic word for said first pattern and the probabilities assigned in the heuristic vocabulary.
  • Said heuristic patterns are ordered in the database, for an example of preferred embodiment, depending on the degree of spam probability assigned to the respective heuristic words related by them, said query being performed by the Bayesian filter following said order , so that if a word complies with said first pattern, because it is associated with a probability of spam greater than a possible second or third pattern (according to the order defined above), it is not necessary to verify that said word complies with said further patterns.
  • the proposed method also includes updating said probabilities of said heuristic words progressively based on their influence on the result of the analysis of the mails classified by said probabilistic filter, increasing said probabilities if the influence of the associated heuristic word has caused a correct classification, or vice versa.
  • the proposed method includes using the filter so that it gradually learns while classifying.
  • the proposed method includes improving the classification capacity of said Bayesian filter through an interaction with the user through which the latter indicates to the Bayesian filter which emails have been correctly classified and / or which emails have been incorrectly classified, each time The filter classifies emails or periodically.
  • the method includes updating and increasing said initial heuristic vocabulary supplied to said filter, initially formed only by heuristic words, progressively, during the use of said Bayesian filter on the emails classified by the filter, being analyzed mails incorrectly classified by said Bayesian filter and adding, for a preferred embodiment to the initial heuristic vocabulary, one or more words, heuristic and non-heuristic, obtained from said analysis.
  • said obtaining of said heuristic and non-heuristic words is carried out by means of an analysis of the words contained in said mails incorrectly classified also using at least said morphological criteria, analysis which is preferably performed by the Bayesian filter itself, and also preferably carried out.
  • the method comprises carrying out said updating of said heuristic vocabulary at least in part automatically, analyzing each email, immediately or every certain period of time programmed after being classified by the probabilistic filter and validated by the user.
  • the method comprises carrying out said updating of said heuristic vocabulary at least in part at the request of a user, analyzing only a series of emails indicated by the user.
  • these refer to at least one of the following criteria: length of each word analyzed, order or sequence and type of characters contained, adjacency of characters, or a combination thereof, being said consonant characters, vowels, numbers and / or symbols.
  • the proposed method comprises using morphological criteria that refer to the morphology of words for a plurality of different languages, thus representing the heuristic vocabulary described the words considered as invalid due to that their morphology, that is, the way they are structured, is incorrect in any language.
  • Fig. 1 is a flow chart illustrating a series of actions to be performed according to an example of embodiment of the method proposed by the present invention.
  • the method comprises in order to obtain said initial heuristic vocabulary, the use of a finite state automaton to generate said group of heuristic patterns and heuristic words by carrying out said analysis according to one or more morphological criteria. , in order to screen the words supplied to said automaton, separating them into valid words and invalid words, according to said morphological criterion used by the automaton.
  • the method comprises building and training said finite state automaton by supplying it with words extracted from tagged emails, in general by a user, as valid, said automaton being once trained capable of recognizing as valid words those goods formed in terms of the sequence and type of the characters that compose them, the sequence and type of characters contained in the words being the morphological criteria used by the automaton for the present preferred embodiment.
  • the method includes providing words extracted from tagged emails, in general by a user, as invalid, and classifying as words not valid those not recognized as valid by the automaton, thus producing the aforementioned screen that enables that later only the words not recognized by the automaton are submitted to the above described analysis that leads to the obtaining of the corresponding heuristic words that are included in the mentioned initial heuristic vocabulary. Therefore, the finite state automaton used for the application of the proposed method, is adapted, once trained, to automatically identify correctly formed words (or tokens) and differentiate them from those poorly formed, morphologically speaking. In other words, the automaton is able to recognize the grammar that describes the well-formed words, understood as character sequences, for a plurality of different languages.
  • these characters can be a consonant ( 1 C 1 ), a vowel (V), a number ('n') or a symbol ( 1 S 1 ).
  • the construction of the automaton from the words of said emails labeled as valid is carried out by means of the use of an ECGI algorithm of adapted grammatical inference, by means of which it is possible to obtain a set of examples of different well-formed words, taken from a corpus of valid emails, which are those that are supplied to the automaton during its training, so that it can be used as a reference to be able to recognize morphologically well-formed words.
  • the valid word "scientific-technical”, presented as the string of terms "c v v c c v c v c v c v s c v c c v c v" will be recognized by the automaton once trained for it. Each time the automaton recognizes a word, the word is considered to be well formed.
  • the method comprises composing all the morphological criteria prior to its use to carry out said analysis, not being modified at any time, and for another example of embodiment, the method comprises composing, previously, only one or more initial morphological criteria, and modifying them, or creating new morphological criteria, simultaneously with the performance of said analysis. , and depending on the results obtained with it.
  • said five relations could be represented by another number of heuristic patterns not necessarily equal to five.
  • the method comprises registering the heuristic patterns obtained in said database.
  • the method comprises registering said morphological criteria related to said heuristic patterns in said database with the heuristic words obtained from them.
  • the proposed method comprises using the Bayesian filter to analyze the emails received on one or more of the following fields separately, based on a corresponding selection by the user: sender, subject and body, and, for an embodiment example for the which one selected more than one field, combine, in a weighted way, the result of all the analyzes performed to obtain a final result.
  • the method comprises, for an embodiment example, generating a single initial heuristic vocabulary common to the subject and body parts and valid for said parts of the mail, or, for another example of realization, to generate an own and different heuristic vocabulary for each of said parts.
  • non-heuristic words is the word "transplant” which has only two consonants, so it does not meet any of the morphological criteria related to the heuristic words, but which the filter can extract from an erroneously classified mail and add to the cited heuristic vocabulary.
  • the filter when the filter receives the first mail to be classified, and although it preferably analyzes each part of the mail separately, focusing on the part or field corresponding to the subject, for each word it finds, first, check if the word is in the heuristic vocabulary, either the common one or the one associated with that part, that is, the subject field, depending on the example of realization.
  • the filter does not have the words: you need, transplant, heart
  • the next step the filter takes is to analyze if each of these three words meets any criteria referenced by a corresponding heuristic pattern of The database.
  • the word heart would meet the morphological criteria of "Letter and / or Numbers and / or Consecutive Symbols in any order" and the heuristic word "LNL" would be assigned.
  • the filter calculates the probability of SPAM and NO-SPAM of the subject, like any Bayesian filter, only taking into account the probabilities that the LNS Metaword has been assigned a priori. If the mail is classified in one of the two categories and the user gives the approval, the vocabulary is not updated in words but in the probabilities of the words it contains, since its use has caused that the classification has been correct.
  • the filter incorporates the subject vocabulary (either own or common to the rest of the fields) the words "you need” and "transplant” with the corresponding SPAM and NO-SPAM probabilities.
  • the vocabulary of the subject part, or the common one to the subject and body parts if that is the case, would then be formed by ⁇ Nosense, Minsize, Cns, Raresymbs, Numbers, Metapalabra LNS, you need, transplant, LNL ⁇ .
  • Said extended vocabulary formed by the initial heuristic words, plus the heuristic and non-heuristic words added, will be the one that the Bayesian filter uses in later analyzes, with which the more the vocabulary is increased the greater the precision of the filter, provided when the increase is not excessive, since this would cause a longer classification time and, therefore, a lower efficiency of the filter.
  • said vocabulary does not grow in vain, that is to say that the words that are incorporated therein are relevant for later use in the classifications carried out by the Bayesian filter, discarding thus the words that would contribute uselessly to vocabulary growth, increasing the mentioned classification time, and that would have a practically null contribution in the discriminatory potential of the filter.
  • the method proposed by the present invention comprises carrying out an intelligent management thereof.
  • the filter When the filter has to learn from the classification errors, the vocabularies of the subject and body parts are updated with the words contained in the misclassified emails, whether they are false positive or negative, and to the new words, both heuristic and not heuristics, they are assigned a corresponding probability of spam and non-spam set by the filter for new heuristic and non-heuristic words. If for this case in which the filter learns from a poorly classified email, the email contains invalid words that conform to some criteria referenced by a respective heuristic pattern, these are not added to the vocabulary but only the probabilities associated with the word are updated. heuristic pattern that relates to the criteria that meets the invalid word.
  • the initial heuristic vocabulary is empty, whereby the Bayesian filter begins to operate using said empty vocabulary to analyze the sender field of incoming mails, which produces a classification of said field as non-spam.
  • the weighted classification made by the filter for the three parts of the emails will allow obtaining the final classification of the same, which may be spam or non-spam. If the user indicates to the filter that the classification has been correct, the proposed method includes storing in the vocabulary of the sender field, only the sender of the non-spam emails with a probability associated with the non-spam category higher than the spam .
  • the filter When the filter has to learn because errors have occurred in the classification of emails, it stores only the email addresses, or senders, of the false positives, that is, of the emails that really are non-spam.
  • the senders of the false negatives, that is to say, of the emails erroneously assigned to the non-spam category, are ignored since most of the unsolicited emails almost never come from the same sender and their inclusion would contribute unnecessarily to the growth of the vocabulary of this part.
  • the proposed method comprises selecting the operating mode of said Bayesian filter, by a user, from among the following three modes, depending on whose selection one or other action will be performed with emails received: - mode 1, or blind trust mode, whose selection causes emails that the filter identifies as spam are not delivered to the user,
  • - mode 2 or supervised mode, whose selection causes the emails that the filter detects as spam to be delivered to the user including a character string, or label, in the field related to the subject of the mail, and - mode 3, whose selection causes that all emails are delivered to the user, regardless of whether or not they are spam.
  • the method comprises choosing by the user, which character string, or label, to include in said field concerning the subject of the mail.
  • the proposed method is applicable regardless of the mail client used, and includes adjusting the error bias of incorrectly classified emails, such as spam or non-spam, of the Bayesian filter by modifying, by the user, a certain parameter .
  • the proposed method comprises showing a user at least one of the following elements, or a combination thereof:
  • - a history or trace of the user's connections to the mail server, showing for each of them the date, time and number of emails received with the results of the filter, as well as possible specific connection errors, - a list of the emails received along with the identification that the filter has made of each of them, list on which the user can indicate to the filter the successes and mistakes made so that he learns from them, being possible to delete emails from the list so that the filter does not consider them again, and - a configuration panel where the user can adjust certain parameters of the filtrate and connections, and
  • Fig. 1 a flow chart is shown that illustrates the use of the Bayesian filter, both to classify incoming mails, and to learn from their successes and errors, through a series of actions to be carried out according to the method proposed by the present invention, representative of an exemplary embodiment, some of which have already been described above, said actions carried out by the Bayesian filter once the initial heuristic vocabulary has been created, as well as the heuristic patterns.
  • - Mailing list refers to the index of the emails shown to the user, that is, the subject and sender fields of the emails.
  • - List and temporary emails refers to the complete emails, that is, with the subject, sender and body fields.
  • the filter acts as an intermediary or proxy server with respect to the incoming mail server, for example POP (or POP3), at which one connects
  • POP or POP3
  • the filter analyzes the parts (subject, sender and body) of the extracted vocabulary and determines if the email is spam or not, combining them, carrying out said analysis by consulting the heuristic vocabulary and the database generated and supplied to it according to the proposed method , as indicated in Fig. 1 with the dashed arrow from the cylinder with the legend: "Analysis data learned" described above.
  • the filter saves the vocabulary extracted from the mail temporarily and adds an entry associated to the commented temporary list, as shown by the corresponding dashed arrow in Fig. 1 directed to the element with the legend "list and received temporary emails”. If the filter has detected that the mail received is spam, the filter acts according to the mode selected by the user: in mode 1, or "blind trust” mode, the filter deletes the mail from the server, in mode 3, the Ie filter deliver the mail to the mail client, and in mode 2, or supervised mode, the filter includes a string in the mail subject, such as: "this mail is spam”. Therefore, it refers to the option to send the filter to learn
  • the data for subsequent analysis is updated, as indicated by the dashed arrow directed towards the cylinder with the legend "Analysis data learned”.
  • the update based on the analysis of the incorrectly classified emails it refers to the extraction and incorporation in the heuristic vocabulary of the non-heuristic words found in said emails with a corresponding probability of spam or non-spam, as the case may be, as well as updating the probabilities of spam and non-spam of words, both heuristic and non-heuristic, if they exist, used to classify said emails, depending on their influence on said incorrect classifications.
  • this refers mainly to the update of the probabilities of spam and non-spam of the words, both heuristic and non-heuristic, if they exist, used to classify these emails, depending on their influence on said correct classifications.
  • the filter deletes the content of the element with the legend "list and temporary emails received", that is, it deletes the temporary mail and its entry from the temporary list, since said temporary emails have already been used to fulfill its function, either to facilitate the words to be incorporated into the heuristic vocabulary in order to increase it, and therefore to improve it, and to adjust the probabilities of the words of said vocabulary, after which the filter updates its statistics, and it goes back to standby until the user selects one of the two possible options described.
  • a person skilled in the art could introduce changes and modifications in the described embodiments without departing from the scope of the invention as defined in the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • Computer Hardware Design (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a method for sorting e-mail messages into wanted mail and unwanted mail. According to the invention, at least prior to the start of the sorting process which is performed by a Bayesian filter, a group of heuristic patterns and heuristic words is generated, independently of the filter, from an analysis of old mails, said heuristic words forming an initial heuristic vocabulary, with morphological criteria being used to perform the analysis, including the heuristic patterns which associate the morphological criteria with the heuristic words in a database. Said database is supplied together with the initial heuristic vocabulary to the Bayesian filter. The method consists in progressively increasing the initial heuristic vocabulary during the use of the filter, using an intelligent management system which guarantees very effective and efficient sorting, thereby preventing the vocabulary from growing excessively with irrelevant words and the sorting time from increasing unnecessarily.

Description

TítuloTitle
MÉTODO PARA CLASIFICAR MENSAJES DE CORREO ELECTRÓNICO ENMETHOD FOR CLASSIFYING EMAIL MESSAGES IN
CORREO DESEADO Y CORREO NO DESEADODESIRED MAIL AND UNWANTED MAIL
Sector de Ia técnicaTechnical sector
La presente invención concierne en general a un método para clasificar mensajes de correo electrónico en correo deseado y correo no deseado, o spam, mediante Ia utilización de un filtro probabilístico, y en particular a un método para llevar a cabo dicha clasificación utilizando un filtro bayesiano sin Ia necesidad de un entrenamiento previo.The present invention relates in general to a method for classifying email messages in spam and spam, or by using a probabilistic filter, and in particular a method of carrying out said classification using a Bayesian filter. without the need for prior training.
Estado de Ia técnica anteriorState of the prior art
El envío de correo indeseado, o spam, de forma masiva y automática, es un gran problema en nuestros días, que provoca grandes molestias a los usuarios de clientes de correo electrónico, así como a los servidores que alojan las cuentas de correo de los mismos.The sending of unwanted mail, or spam, in a massive and automatic way, is a big problem in our days, which causes great inconvenience to the users of email clients, as well as to the servers that host their email accounts .
El objetivo fundamental de un sistema clasificador, y en particular de un clasificador de correo electrónico, es doble: identificar los correos no deseados y filtrarlos, es decir, obtener buenos resultados de cobertura, y, por otro lado, evitar Ia clasificación errónea de correos válidos. Este último objetivo es prioritario ya que Ia asignación de Ia categoría spam a un correo válido es un error de un orden superior al de no identificar un correo inválido como spam por las consecuencias de posibles pérdidas de información útil.The fundamental objective of a classification system, and in particular of an email classifier, is twofold: to identify unwanted emails and filter them, that is, to obtain good coverage results, and, on the other hand, to avoid the incorrect classification of emails. valid. This last objective is a priority since the assignment of the spam category to a valid email is an error of a higher order than not identifying an invalid email as spam due to the consequences of possible loss of useful information.
La utilización de los filtros probabilísticos, en especial aquellos basados en el método de Naϊve Bayes, o filtros bayesianos, como filtros anti-spam es conocida en el estado de Ia técnica, ya que los resultados que con ellos se obtienen son bastante aceptables en comparación con otros filtros utilizados anteriormente.The use of probabilistic filters, especially those based on the Naϊve Bayes method, or Bayesian filters, as anti-spam filters is known in the state of the art, since the results obtained with them are quite acceptable in comparison with other filters used previously.
Estos filtros, o sistemas de identificación y filtrado de correos basados en el clasificador Naϊve Bayes, parten de Ia hipótesis de Ia independencia estadística de las palabras en un texto perteneciente a una determinada categoría temática. Normalmente, Ia mayoría de los filtros bayesianos desarrollados para correo electrónico aprenden a partir de un corpus de correos históricos clasificados en dos categorías, spam y no-spam, del que obtienen un vocabulario extenso, que es común a todos los usuarios del filtro y que, si no se actualiza, en un breve período de tiempo puede quedar obsoleto.These filters, or mail identification and filtering systems based on the Naϊve Bayes classifier, start from the hypothesis of the statistical independence of words in a text belonging to a specific thematic category. Normally, most Bayesian filters developed for email learn from a corpus of Historical emails classified in two categories, spam and non-spam, from which they obtain an extensive vocabulary, which is common to all users of the filter and which, if not updated, in a short period of time may become obsolete.
Un ejemplo de utilización de uno de dichos filtros es el propuesto por Ia patente US-A-6161130, Ia cual concierne a un método de clasificación de mensajes que utiliza un clasificador probabilístico, tal como un Bayesiano, apto para reconocer una serie de características susceptibles de estar presentes en los mensajes entrantes, para Io cual se ha sometido al filtro a un entrenamiento previo con un conjunto de entrenamiento formado por una pluralidad de mensajes que incluyen dichas características, el cual es actualizado en función del resultado del análisis de cada mensaje y el filtro re-entrenado con el conjunto actualizado.An example of the use of one of said filters is that proposed by US-A-6161130, which concerns a method of classification of messages that uses a probabilistic classifier, such as a Bayesian, capable of recognizing a series of susceptible characteristics of being present in the incoming messages, for which the filter has been subjected to a previous training with a training set formed by a plurality of messages that include said characteristics, which is updated based on the result of the analysis of each message and the filter re-trained with the updated set.
Estas características hacen referencia a atributos, por ejemplo, de formato del mensaje, tal como al referente a si una palabra del mensaje está en mayúsculas o no, o si el mensaje contiene una serie de marcas de puntuación.These characteristics refer to attributes, for example, of the message format, such as referring to whether a word in the message is capitalized or not, or if the message contains a series of punctuation marks.
En Ia solicitud de patente US2005/0192992 se propone un sistema y un método para determinar el "propósito" de unos datos entrantes, para comprobar por ejemplo si son spam, que utilizan como clasificador, por ejemplo, un modelo Bayesiano. Se proponen diversos criterios a Ia hora de analizar los mensajes: heurístico, de inferencia, así como extraer, de los mensajes analizados, características sintácticas, lingüísticas, etc. Para Ia realización que contempla Ia utilización de un filtro bayesiano, se propone un entrenamiento del mismo a partir de un conjunto de datos generado manual o automáticamente y utilizados durante una fase de entrenamiento del filtro bayesiano.In the patent application US2005 / 0192992 a system and a method are proposed to determine the "purpose" of incoming data, to check for example if they are spam, which use as a classifier, for example, a Bayesian model. Various criteria are proposed when analyzing the messages: heuristic, inference, as well as extracting, from the analyzed messages, syntactic, linguistic characteristics, etc. For the embodiment that contemplates the use of a Bayesian filter, a training thereof is proposed from a set of data generated manually or automatically and used during a training phase of the Bayesian filter.
Ambos antecedentes citados tienen el inconveniente arriba descrito de Ia necesidad de un entrenamiento previo del filtro bayesiano, a partir del cuál obtienen un vocabulario extenso que, aunque en Ia solicitud US2005/0192992 se contempla el hecho de actualizar el allí denominado conjunto de datos, implica Ia utilización del filtro bayesiano durante dicha fase de entrenamiento para obtener al menos un vocabulario inicial con el que poder empezar a clasificar mensajes entrantes en tiempo real, con unas mínimas garantías de fiabilidad. - Explicación de Ia invenciónBoth of the aforementioned antecedents have the above-described drawback of the need for prior training of the Bayesian filter, from which they obtain an extensive vocabulary that, although in the application US2005 / 0192992 the fact of updating the so-called data set is contemplated, implies The use of the Bayesian filter during said training phase to obtain at least one initial vocabulary with which to start classifying incoming messages in real time, with minimum guarantees of reliability. - Explanation of the invention
Aparece necesario ofrecer una alternativa al estado de Ia técnica que posibilite Ia creación de un vocabulario inicial con el que un filtro probabilístico, tal como un filtro bayesiano, pueda desde un primer momento empezar a trabajar para identificar y filtrar los correos entrantes no deseados, o spam, sin Ia necesidad de realizar ningún entrenamiento sobre correos históricos clasificados previamente.It seems necessary to offer an alternative to the state of the art that allows the creation of an initial vocabulary with which a probabilistic filter, such as a Bayesian filter, can start working from the beginning to identify and filter unwanted incoming emails, or spam, without the need to conduct any training on previously classified historical emails.
La presente invención concierne a un método para clasificar mensajes de correo electrónico en correo deseado y correo no deseado, o spam, del tipo basado en Ia utilización de un filtro probabilístico para realizar dicha clasificación, en particular un filtro basado en el método de Naϊve Bayes, o filtro bayesiano.The present invention concerns a method for classifying email messages in spam and spam, of the type based on the use of a probabilistic filter to perform said classification, in particular a filter based on the Naϊve Bayes method , or Bayesian filter.
El método comprende generar, sin Ia intervención del filtro bayesiano, es decir de manera ajena al filtro, y previamente al comienzo de dicha clasificación, un grupo de patrones heurísticos y de palabras heurísticas, a partir del análisis de una serie o corpus de correos históricos, formando dichas palabras heurísticas un vocabulario heurístico inicial, utilizando para realizar dicho análisis al menos unos criterios morfológicos, e incluir dichos patrones heurísticos, los cuales relacionan dichos criterios morfológicos con las palabras heurísticas, en una base de datos, suministrándose dicha base de datos junto con el vocabulario heurístico inicial a dicho filtro probabilístico, a partir de los cuales éste es capaz de comenzar a funcionar.The method comprises generating, without the intervention of the Bayesian filter, that is, outside the filter, and prior to the beginning of said classification, a group of heuristic patterns and heuristic words, from the analysis of a series or corpus of historical mails , said heuristic words forming an initial heuristic vocabulary, using at least morphological criteria to perform said analysis, and including said heuristic patterns, which relate said morphological criteria to the heuristic words, in a database, said database being supplied together with the initial heuristic vocabulary to said probabilistic filter, from which it is able to start working.
Aunque en un apartado posterior se expondrán varios ejemplos de palabras heurísticas y criterios morfológicos, con el fin de clarificar Ia descripción hecha en los siguientes párrafos, sirvan Ia palabra heurística "Nosense" y el criterio morfológico "consonantes consecutivas >= 4" relacionado, como un ejemplo en el que apoyar Ia descripción siguiente. Para este ejemplo se generaría Ia citada palabra heurística "Nosense" a partir del citado análisis de correos históricos, utilizando el citado criterio, es decir mediante el hallazgo de palabras que tengan cuatro o más consonantes consecutivas, como por ejemplo Ia palabra "ffffffutbol".Although in a later section several examples of heuristic words and morphological criteria will be presented, in order to clarify the description made in the following paragraphs, they serve the heuristic word "Nosense" and the related morphological criterion "consecutive consonants> = 4", as An example in which to support the following description. For this example, the heuristic word "Nosense" would be generated from the aforementioned analysis of historical mails, using the aforementioned criterion, that is, by finding words that have four or more consecutive consonants, such as the word "ffffffutbol".
El método propuesto por Ia presente invención comprende asignar y asociar a cada una de dichas palabras heurísticas comprendidas en dicho vocabulario heurístico inicial, una probabilidad de que Ia presencia en un correo de palabras que cumplan los criterios morfológicos referenciados por dichas palabras heurísticas, sea indicativa de que éste es un correo no deseado, o de alta probabilidad de spam. Es decir que si, por ejemplo, se encuentra palabras con "consonantes consecutivas >= 4" ("ffffffutbol" y otras) que aparecen en muchos correos históricos que resultan ser spam, a Ia palabra heurística relacionada "Nosense" se Ie asignará en el vocabulario heurístico una probabilidad muy alta de spam.The method proposed by the present invention comprises assigning and associating each of said heuristic words included in said Initial heuristic vocabulary, a probability that the presence in a mail of words that meet the morphological criteria referenced by said heuristic words, is indicative that this is an unwanted email, or high probability of spam. That is, if, for example, words with "consecutive consonants> = 4"("ffffffutbol" and others) that appear in many historical emails that turn out to be spam are found, the related heuristic word "Nosense" will be assigned in the heuristic vocabulary a very high probability of spam.
Una vez generado el mencionado grupo de patrones heurísticos así como el vocabulario heurístico inicial, el método propuesto comprende utilizar el filtro bayesiano para consultar los patrones heurísticos de dicha base de datos, que relacionan a las palabras heurísticas con los criterios morfológicos en los que se basan, con el fin de buscar, en los correos a clasificar, palabras que cumplan uno o más de dichos criterios morfológicos, y asignar a cada palabra encontrada Ia palabra heurística relacionada con el criterio cumplido por dicha palabra hallada, y consultar el vocabulario heurístico para asignar las probabilidades que tenga dichas palabras heurísticas asignadas en el vocabulario heurístico.Once the mentioned group of heuristic patterns and the initial heuristic vocabulary have been generated, the proposed method includes using the Bayesian filter to consult the heuristic patterns of said database, which relate the heuristic words to the morphological criteria on which they are based , in order to search, in the mails to be classified, words that meet one or more of said morphological criteria, and assign to each word found the heuristic word related to the criteria met by said word found, and consult the heuristic vocabulary to assign the probabilities of said heuristic words assigned in the heuristic vocabulary.
Para un ejemplo de realización dicha asignación a cada palabra encontrada de Ia palabra heurística relacionada con el criterio cumplido, se lleva a cabo sustituyendo directamente en el correo a clasificar Ia palabra en cuestión por Ia palabra heurística, por ejemplo Ia palabra "ffffffutbol" por "Nosense", todas las veces que aparezca en el correo, y tras ello se procede al análisis probabilístico del correo tomando como referencia Ia palabra "Nosense".For an example of realization, said assignment to each word found of the heuristic word related to the criteria met, is carried out by directly substituting in the mail to classify the word in question by the heuristic word, for example the word "ffffffutbol" by " Nosense ", every time it appears in the mail, and after that we proceed to the probabilistic analysis of the mail taking as reference the word" Nosense ".
Para un ejemplo de realización preferido, aunque una palabra cumpla dos o más criterios morfológicos diferentes (por ejemplo no solamente el citado "consonantes consecutivas >= 4"), el método comprende utilizar el filtro bayesiano para buscar, en los correos a clasificar, palabras que cumplan un criterio de dichos criterios morfológicos relacionado por un primer patrón heurístico de Ia base de datos consultada, y asignarle Ia palabra heurística relacionada por dicho primer patrón y las probabilidades que tenga asignadas en el vocabulario heurístico. Dichos patrones heurísticos están ordenados en Ia base de datos, para un ejemplo de realización preferido, en función del grado de probabilidad de spam que tienen asignadas las respectivas palabras heurísticas relacionadas por los mismos, siendo realizada dicha consulta por parte del filtro bayesiano siguiendo dicho orden, con Io que si una palabra cumple con dicho primer patrón, debido a que éste está asociado a una probabilidad de spam mayor que un posible segundo o tercer patrón (según el orden definido anteriormente), no es necesario comprobar que dicha palabra cumpla con dichos patrones ulteriores. El método propuesto también comprende actualizar dichas probabilidades de dichas palabras heurísticas progresivamente en función de su influencia en el resultado de los análisis de los correos clasificados por dicho filtro probabilístico, aumentando dichas probabilidades si Ia influencia de Ia palabra heurística asociada ha provocado una clasificación correcta, o viceversa. Es decir que si, por ejemplo, se hubiesen catalogado como correos indeseados, de manera errónea, una serie de correos que contuviesen una o más veces Ia palabra "ffffffutbol", se deduciría que el hallar dicha palabra en un correo no implica que el mismo sea probablemente un correo indeseado, por Io que Ia probabilidad asociada a Ia palabra heurística "Nosense" se reduciría. Para adaptarse a Ia evolución de los correos y, por tanto, mantener el nivel de eficacia de un filtro es necesaria su evolución. Para ello el método propuesto comprende utilizar el filtro para que éste aprenda gradualmente a Ia vez que clasifica.For a preferred embodiment, although a word meets two or more different morphological criteria (for example, not only the aforementioned "consecutive consonants> = 4"), the method comprises using the Bayesian filter to search, in the mails to be classified, words that meet a criterion of said morphological criteria related to a first heuristic pattern of the database consulted, and assign the related heuristic word for said first pattern and the probabilities assigned in the heuristic vocabulary. Said heuristic patterns are ordered in the database, for an example of preferred embodiment, depending on the degree of spam probability assigned to the respective heuristic words related by them, said query being performed by the Bayesian filter following said order , so that if a word complies with said first pattern, because it is associated with a probability of spam greater than a possible second or third pattern (according to the order defined above), it is not necessary to verify that said word complies with said further patterns. The proposed method also includes updating said probabilities of said heuristic words progressively based on their influence on the result of the analysis of the mails classified by said probabilistic filter, increasing said probabilities if the influence of the associated heuristic word has caused a correct classification, or vice versa. In other words, if, for example, they had been cataloged as unwanted emails, erroneously, a series of emails containing one or more times the word "ffffffutbol", it would be deduced that finding that word in an email does not imply that it it is probably an unwanted email, so the probability associated with the heuristic word "Nosense" would be reduced. To adapt to the evolution of the emails and, therefore, to maintain the level of effectiveness of a filter, its evolution is necessary. For this, the proposed method includes using the filter so that it gradually learns while classifying.
Con tal fin el método propuesto comprende mejorar Ia capacidad de clasificación de dicho filtro bayesiano a través de una interacción con el usuario mediante Ia cual éste Ie indica al filtro bayesiano qué correos ha clasificado correctamente y/o qué correos ha clasificado incorrectamente, cada vez que el filtro clasifica los correos o de forma periódica.To this end, the proposed method includes improving the classification capacity of said Bayesian filter through an interaction with the user through which the latter indicates to the Bayesian filter which emails have been correctly classified and / or which emails have been incorrectly classified, each time The filter classifies emails or periodically.
Por Io que se refiere a Ia citada adaptación a Ia evolución de los correos, para ello el método comprende actualizar y aumentar dicho vocabulario heurístico inicial suministrado a dicho filtro, inicialmente formado sólo por palabras heurísticas, progresivamente, durante Ia utilización de dicho filtro bayesiano sobre los correos electrónicos clasificados por el filtro, analizándose los correos clasificados incorrectamente por dicho filtro bayesiano y añadiendo, para un ejemplo de realización preferido al vocabulario heurístico inicial, una o más palabras, heurísticas y no heurísticas, obtenidas a partir de dicho análisis. Preferentemente dicha obtención de dichas palabras heurísticas y no heurísticas se lleva a cabo mediante un análisis de las palabras contenidas en dichos correos clasificados incorrectamente también utilizando al menos dichos criterios morfológicos, análisis el cual es preferentemente realizado por el propio filtro bayesiano, y también preferentemente llevado a cabo sobre cada una de las palabras de dichos correos. Para un ejemplo de realización el método comprende llevar a cabo dicha actualización de dicho vocabulario heurístico al menos en parte automáticamente, analizando cada correo electrónico, inmediatamente o cada cierto periodo de tiempo programado después de ser clasificado por el filtro probabilístico y validado por el usuario. Para otro ejemplo de realización el método comprende llevar a cabo dicha actualización de dicho vocabulario heurístico al menos en parte a petición de un usuario, analizando solamente una serie de correos electrónicos indicados por el usuario.As regards the aforementioned adaptation to the evolution of the emails, for this the method includes updating and increasing said initial heuristic vocabulary supplied to said filter, initially formed only by heuristic words, progressively, during the use of said Bayesian filter on the emails classified by the filter, being analyzed mails incorrectly classified by said Bayesian filter and adding, for a preferred embodiment to the initial heuristic vocabulary, one or more words, heuristic and non-heuristic, obtained from said analysis. Preferably said obtaining of said heuristic and non-heuristic words is carried out by means of an analysis of the words contained in said mails incorrectly classified also using at least said morphological criteria, analysis which is preferably performed by the Bayesian filter itself, and also preferably carried out. out on each of the words of said emails. For an exemplary embodiment, the method comprises carrying out said updating of said heuristic vocabulary at least in part automatically, analyzing each email, immediately or every certain period of time programmed after being classified by the probabilistic filter and validated by the user. For another embodiment, the method comprises carrying out said updating of said heuristic vocabulary at least in part at the request of a user, analyzing only a series of emails indicated by the user.
Por Io que se refiere a los mencionados criterios morfológicos, estos hacen referencia a al menos uno de los siguientes criterios: longitud de cada palabra analizada, orden o secuencia y tipo de caracteres contenidos, adyacencia de caracteres, o una combinación de los mismos, siendo dichos caracteres consonantes, vocales, números y/o símbolos.As regards the aforementioned morphological criteria, these refer to at least one of the following criteria: length of each word analyzed, order or sequence and type of characters contained, adjacency of characters, or a combination thereof, being said consonant characters, vowels, numbers and / or symbols.
En el apartado referente a Ia explicación de unos ejemplos de realización se describirá cómo utilizar dichos criterios morfológicos mediante Ia aplicación del método propuesto para llevar a cabo el mencionado análisis, para unos ejemplos de realización.In the section referring to the explanation of some examples of embodiment, it will be described how to use said morphological criteria by means of the application of the proposed method to carry out said analysis, for some examples of embodiment.
Con el fin de poder filtrar correos indeseados en multitud de idiomas, el método propuesto comprende utilizar criterios morfológicos que hacen referencia a Ia morfología de palabras para una pluralidad de idiomas distintos, representando por tanto el vocabulario heurístico descrito las palabras consideradas como no válidas debido a que su morfología, es decir Ia manera en que están estructuradas, es incorrecta en cualquier idioma. - Breve descripción de los dibujosIn order to be able to filter unwanted emails in a multitude of languages, the proposed method comprises using morphological criteria that refer to the morphology of words for a plurality of different languages, thus representing the heuristic vocabulary described the words considered as invalid due to that their morphology, that is, the way they are structured, is incorrect in any language. - Brief description of the drawings
Las anteriores y otras ventajas y características se comprenderán más plenamente a partir de Ia siguiente descripción detallada de unos ejemplos de realización, alguno de los cuales con referencia al dibujo adjunto, que deben tomarse a título ilustrativo y no limitativo, en el que:The foregoing and other advantages and features will be more fully understood from the following detailed description of some examples of embodiment, some of which with reference to the attached drawing, which should be taken by way of illustration and not limitation, in which:
Ia Fig. 1 es un diagrama de flujo que ilustra una serie de acciones a realizar según un ejemplo de realización del método propuesto por Ia presente invención.Fig. 1 is a flow chart illustrating a series of actions to be performed according to an example of embodiment of the method proposed by the present invention.
- Descripción detallada de unos ejemplos de realización- Detailed description of some embodiments
Para un ejemplo de realización preferido el método comprende con el fin de obtener el mencionado vocabulario heurístico inicial, Ia utilización de un autómata de estados finitos para generar dicho grupo de patrones heurísticos y de palabras heurísticas llevando a cabo dicho análisis según uno o más criterios morfológicos, con el fin de cribar las palabras suministradas a dicho autómata, separándolas en palabras válidas y palabras no válidas, según dicho criterio morfológico utilizado por el autómata.For a preferred embodiment, the method comprises in order to obtain said initial heuristic vocabulary, the use of a finite state automaton to generate said group of heuristic patterns and heuristic words by carrying out said analysis according to one or more morphological criteria. , in order to screen the words supplied to said automaton, separating them into valid words and invalid words, according to said morphological criterion used by the automaton.
En primer lugar el método comprende construir y entrenar a dicho autómata de estados finitos mediante el suministro al mismo de palabras extraídas de correos etiquetados, en general por un usuario, como válidos, siendo dicho autómata una vez entrenado capaz de reconocer como palabras válidas aquellas bien formadas en cuanto a Ia secuencia y tipo de los caracteres que las componen, siendo Ia secuencia y tipo de caracteres contenidos en las palabras el criterio morfológico utilizado por el autómata para el presente ejemplo de realización preferido.In the first place, the method comprises building and training said finite state automaton by supplying it with words extracted from tagged emails, in general by a user, as valid, said automaton being once trained capable of recognizing as valid words those goods formed in terms of the sequence and type of the characters that compose them, the sequence and type of characters contained in the words being the morphological criteria used by the automaton for the present preferred embodiment.
Una vez construido y entrenado el autómata, el método comprende suministrarle palabras extraídas de correos etiquetados, en general por un usuario, como no válidos, y clasificar como palabras no válidas aquellas no reconocidas como válidas por el autómata, produciéndose así Ia citada criba que posibilita que posteriormente solamente las palabras no reconocidas por el autómata sean sometidas al arriba descrito análisis que conduzca a Ia obtención de las palabras heurísticas correspondientes que se incluyen en el mencionado vocabulario heurístico inicial. Por tanto el autómata de estados finitos utilizado para Ia aplicación del método propuesto, está adaptado, una vez entrenado, para identificar automáticamente palabras (o tokens) correctamente formados y diferenciarlos de aquellos mal formados, morfológicamente hablando. Es decir que el autómata es capaz de reconocer Ia gramática que describe a las palabras bien formadas, entendidas éstas como secuencias de caracteres, para una pluralidad de idiomas distintos.Once the automaton has been built and trained, the method includes providing words extracted from tagged emails, in general by a user, as invalid, and classifying as words not valid those not recognized as valid by the automaton, thus producing the aforementioned screen that enables that later only the words not recognized by the automaton are submitted to the above described analysis that leads to the obtaining of the corresponding heuristic words that are included in the mentioned initial heuristic vocabulary. Therefore, the finite state automaton used for the application of the proposed method, is adapted, once trained, to automatically identify correctly formed words (or tokens) and differentiate them from those poorly formed, morphologically speaking. In other words, the automaton is able to recognize the grammar that describes the well-formed words, understood as character sequences, for a plurality of different languages.
Tal y como se ha apuntado anteriormente dichos caracteres pueden ser una consonante (1C1), una vocal (V), un número ('n') o un símbolo (1S1). La construcción del autómata a partir de las palabras de dichos correos etiquetados como válidos es llevada a cabo mediante Ia utilización de un algoritmo ECGI de inferencia gramatical adaptado, mediante el cual es posible obtener un conjunto de ejemplos de diferentes palabras bien formadas, tomadas de un corpus de correos válidos, las cuales son las que se Ie suministran al autómata durante su entrenamiento, para que éste las utilice como referencia para ser capaz de reconocer palabras bien formadas morfológicamente.As noted above, these characters can be a consonant ( 1 C 1 ), a vowel (V), a number ('n') or a symbol ( 1 S 1 ). The construction of the automaton from the words of said emails labeled as valid is carried out by means of the use of an ECGI algorithm of adapted grammatical inference, by means of which it is possible to obtain a set of examples of different well-formed words, taken from a corpus of valid emails, which are those that are supplied to the automaton during its training, so that it can be used as a reference to be able to recognize morphologically well-formed words.
Así, por ejemplo, Ia palabra válida "científico-técnico", presentada como Ia cadena de términos "c v v c c v c v c v s c v c c v c v" será reconocida por el autómata una vez entrenado para ello. Cada vez que el autómata reconoce una palabra, se considera que Ia palabra está bien formada.Thus, for example, the valid word "scientific-technical", presented as the string of terms "c v v c c v c v c v s c v c c v c v" will be recognized by the automaton once trained for it. Each time the automaton recognizes a word, the word is considered to be well formed.
Las palabras tomadas de correos spam, como por ejemplo "v1@gra" representada por Ia cadena "c n s c c v", no son reconocidas por el autómata como palabras válidas y pasan a ser identificadas como palabras no válidas. La construcción preferida de un autómata que reconozca palabras bien formadas en lugar de palabras engañosas se debe a que las palabras bien formadas son más uniformes y homogéneas y son necesarios menos ejemplos para construir una gramática que las represente de una manera bastante completa. A continuación se exponen, a modo de ejemplo, una serie de palabras heurísticas generadas mediante Ia aplicación del método propuesto, junto con el criterio o criterios con los cuales están relacionadas por uno o más patrones heurísticos. The words taken from spam emails, such as "v1 @ gra" represented by the string "cnsccv", are not recognized by the automaton as valid words and become identified as invalid words. The preferred construction of an automaton that recognizes well-formed words instead of misleading words is due to the fact that well-formed words are more uniform and homogeneous and fewer examples are necessary to construct a grammar that represents them quite completely. The following shows, by way of example, a series of heuristic words generated by the application of the proposed method, together with the criteria or criteria with which they are related by one or more heuristic patterns.
Figure imgf000010_0001
Figure imgf000010_0001
Una vez el autómata ha realizado Ia citada criba, el método comprende analizar, según uno o más de dichos criterios morfológicos, dichas palabras clasificadas por el autómata como no válidas y añadir al vocabulario heurístico al menos una palabra heurística, utilizando uno de dichos criterios morfológicos, como sería el caso, por ejemplo, de Ia palabra heurística "Minsize" relacionada con el criterio único "Palabras de longitud <= que 1", o utilizando más de uno de dichos criterios morfológicos, como es el caso de Ia palabra heurística "Nosense" que puede obtenerse a partir de los cinco criterios expuestos, tales como "Consonantes consecutivas >= 4" y "Vocales consecutivas >= 4", es decir que una misma palabra heurística se puede obtener a partir de más de un criterio diferente, o dicho de otro modo por dos caminos diferentes, o utilizando un criterio morfológico compuesto, como es el caso de Ia "Metapalabra LNS" que puede obtenerse a partir del criterio morfológico "Letra y/o Números y/o Símbolos consecutivos en cualquier orden", representado por el lenguaje descrito en Ia forma de Backus-Naur como {L, N, S}+, y que se instancia mediante distintas palabras heurísticas como, por ejemplo, "LNS", "SN" o "NSLN". Para un ejemplo de realización el método comprende componer todos los criterios morfológicos de manera previa a su utilización para llevar a cabo dicho análisis, no viéndose modificados en ningún momento, y para otro ejemplo de realización, el método comprende componer, de manera previa, solamente uno o más criterios morfológicos iniciales, y modificarlos, o crear criterios morfológicos nuevos, simultáneamente a Ia realización de dicho análisis, y en función de los resultados obtenidos con el mismo.Once the automaton has made the aforementioned sieve, the method comprises analyzing, according to one or more of said morphological criteria, said words classified by the automaton as invalid and adding at least one heuristic word to the heuristic vocabulary, using one of said morphological criteria , as would be the case, for example, of the heuristic word "Minsize" related to the single criterion "Words of length <= than 1", or using more than one of said morphological criteria, as is the case of the heuristic word " Nosense "which can be obtained from the five criteria set forth, such as" Consecutive Consonants> = 4 "and" Consecutive Vocals> = 4 ", that is to say that the same heuristic word can be obtained from more than one different criterion, or put another way by two different paths, or using a compound morphological criterion, as is the case of the "LNS Metaword" that can be obtained from the morphological criteria or "Letter and / or Numbers and / or Consecutive symbols in any order", represented by the language described in the Backus-Naur form as {L, N, S} + , and which is instantiated by different heuristic words such as, by example, "LNS", "SN" or "NSLN". For an exemplary embodiment, the method comprises composing all the morphological criteria prior to its use to carry out said analysis, not being modified at any time, and for another example of embodiment, the method comprises composing, previously, only one or more initial morphological criteria, and modifying them, or creating new morphological criteria, simultaneously with the performance of said analysis. , and depending on the results obtained with it.
El método también comprende, para alguna o todas de dichas palabras heurísticas, obtener como mínimo un patrón heurístico por palabra heurística, cada patrón relacionando una palabra heurística con al menos un criterio morfológico diferente, como es el caso, por ejemplo, del patrón heurístico que relaciona a Ia palabra heurística Minsize" con el criterio morfológico único "Palabras de longitud <= que 1".The method also comprises, for some or all of said heuristic words, obtaining at least one heuristic pattern per heuristic word, each pattern relating a heuristic word with at least one different morphological criterion, as is the case, for example, of the heuristic pattern that relates the heuristic word Minsize "to the unique morphological criterion" Words of length <= than 1 ".
Para otro ejemplo de realización el método comprende, para alguna o todas dichas palabras no válidas analizadas, obtener al menos un patrón heurístico por palabra no válida analizada, cada patrón para un criterio morfológico diferente, como es el caso, por ejemplo, de las posibles cinco relaciones, representadas por cinco correspondientes patrones heurísticos, entre Ia palabra heurística "Nosense" y los cinco criterios morfológicos expuestos, tales como "Consonantes consecutivas >= 4" y "Vocales consecutivas >= 4". Siguiendo con dicho ejemplo relativo a Ia palabra heurística "Nosense", para otro ejemplo de realización dichas cinco relaciones podrían representarse por otro número de patrones heurísticos no necesariamente igual a cinco.For another embodiment, the method comprises, for some or all of said invalid words analyzed, obtaining at least one heuristic pattern per invalid word analyzed, each pattern for a different morphological criterion, as is the case, for example, of the possible five relationships, represented by five corresponding heuristic patterns, between the heuristic word "Nosense" and the five exposed morphological criteria, such as "Consecutive consonants> = 4" and "Consecutive vowels> = 4". Following with said example relative to the heuristic word "Nosense", for another example of realization said five relations could be represented by another number of heuristic patterns not necessarily equal to five.
Para un ejemplo de realización el método comprende registrar en dicha base de datos los patrones heurísticos obtenidos. Para otro ejemplo de realización el método comprende registrar en dicha base de datos también dichos criterios morfológicos relacionados, mediante dichos patrones heurísticos, con las palabras heurísticas obtenidas a partir de ellos.For an exemplary embodiment, the method comprises registering the heuristic patterns obtained in said database. For another embodiment, the method comprises registering said morphological criteria related to said heuristic patterns in said database with the heuristic words obtained from them.
El método propuesto comprende utilizar el filtro bayesiano para analizar los correos recibidos sobre uno o más de los siguientes campos por separado, en función de una correspondiente selección por parte del usuario: remitente, asunto y cuerpo, y, para un ejemplo de realización para el cual se haya seleccionado más de un campo, combinar, de manera ponderada, el resultado de todos los análisis realizados para obtener un resultado final.The proposed method comprises using the Bayesian filter to analyze the emails received on one or more of the following fields separately, based on a corresponding selection by the user: sender, subject and body, and, for an embodiment example for the which one selected more than one field, combine, in a weighted way, the result of all the analyzes performed to obtain a final result.
Debido a Ia existencia, en un mensaje de correo electrónico, de dichas partes, o campos, referentes a asunto, remitente y cuerpo, el método comprende, para un ejemplo de realización, generar un único vocabulario heurístico inicial común a las partes asunto y cuerpo y válido para dichas partes del correo, o, para otro ejemplo de realización, generar un vocabulario heurístico propio y diferente para cada una de dichas partes.Due to the existence, in an e-mail message, of said parts, or fields, referring to subject, sender and body, the method comprises, for an embodiment example, generating a single initial heuristic vocabulary common to the subject and body parts and valid for said parts of the mail, or, for another example of realization, to generate an own and different heuristic vocabulary for each of said parts.
Siguiendo con el ejemplo de palabras heurísticas expuestas anteriormente, es decir, las palabras "Nosense", "Minsize", "Cns", "Raresymbs", "Numbers", y "Metapalabra LNS", éstas son las que, para el presente ejemplo de realización, conforman el vocabulario heurístico inicial (con sus correspondientes probabilidades de spam) a suministrar al filtro bayesiano para las partes asunto y cuerpo, junto con Ia base de datos de patrones heurísticos que las relacionan con los criterios morfológicos, inicialmente, es decir, antes de que el filtro haya clasificado correo alguno, pero tal y como se ha expuesto en un apartado anterior, durante su vida, este vocabulario va a contener esas mismas palabras heurísticas iniciales además de otras palabras nuevas que se vayan incorporando como resultado del análisis de los correos clasificados incorrectamente por parte del filtro bayesiano, pudiendo ser dichas palabras nuevas palabras heurísticas o palabras no heurísticas.Following the example of heuristic words set forth above, that is, the words "Nosense", "Minsize", "Cns", "Raresymbs", "Numbers", and "LNS Metaword", these are what, for the present example of realization, they conform the initial heuristic vocabulary (with their corresponding probabilities of spam) to be supplied to the Bayesian filter for the subject and body parts, together with the database of heuristic patterns that relate them to the morphological criteria, initially, that is, before the filter has classified any mail, but as explained in a previous section, during its lifetime, this vocabulary will contain those same initial heuristic words in addition to other new words that are incorporated as a result of the analysis of mails incorrectly classified by the Bayesian filter, such words may be new heuristic words or non-heuristic words.
Un ejemplo de tales palabras no heurísticas es Ia palabra "trasplantte" que dispone solamente de dos consonantes, por Io cual no cumple ninguno de los criterios morfológicos relacionados con las palabras heurísticas, pero que el filtro puede extraer de un correo clasificado erróneamente y añadir al citado vocabulario heurístico.An example of such non-heuristic words is the word "transplant" which has only two consonants, so it does not meet any of the morphological criteria related to the heuristic words, but which the filter can extract from an erroneously classified mail and add to the cited heuristic vocabulary.
Por ejemplo, cuando el filtro recibe el primer correo a clasificar, y aunque preferentemente analiza cada parte del correo por separado, centrándonos en Ia parte o campo correspondiente al asunto, por cada palabra que encuentra, en primer lugar, comprueba si Ia palabra está en el vocabulario heurístico, ya sea el común o el que tiene asociado dicha parte, es decir el campo asunto, en función del ejemplo de realización. Si el campo asunto tiene, por ejemplo, el siguiente contenido: ¿Necesitas un trasplantte de corazOn?, y debido a que el filtro sólo dispone en su vocabulario (el vocabulario heurístico que se Ie ha suministrado) de las palabras heurísticas Nosense, Minsize, Cns, Raresymbs, Numbers, y Metapalabra LNS, y por tanto no tiene las palabras: necesitas, trasplantte, corazOn, el siguiente paso que da el filtro es analizar si cada una de estas tres palabras cumple algún criterio referenciado por un correspondiente patrón heurístico de Ia base de datos. En este ejemplo, sólo Ia palabra corazOn cumpliría el criterio morfológico de "Letra y/o Números y/o Símbolos consecutivos en cualquier orden" y se Ie asignaría Ia palabra heurística "LNL". De las otras dos palabras no sabe nada y por tanto, el filtro calcula Ia probabilidad de SPAM y NO-SPAM del asunto, como cualquier filtro bayesiano, sólo teniendo en cuenta las probabilidades que tiene asignadas a priori Ia Metapalabra LNS. Si el correo queda clasificado en una de las dos categorías y el usuario da el visto bueno, el vocabulario no se actualiza en palabras pero sí en las probabilidades de las palabras que contiene, ya que su utilización ha provocado que Ia clasificación haya sido correcta.For example, when the filter receives the first mail to be classified, and although it preferably analyzes each part of the mail separately, focusing on the part or field corresponding to the subject, for each word it finds, first, check if the word is in the heuristic vocabulary, either the common one or the one associated with that part, that is, the subject field, depending on the example of realization. If the subject field has, for example, the following content: Do you need a heart transplant ?, and because the filter only has in its vocabulary (the heuristic vocabulary that has been supplied) the heuristic words Nosense, Minsize, Cns, Raresymbs, Numbers, and Metapalabra LNS, and therefore does not have the words: you need, transplant, heart, the next step the filter takes is to analyze if each of these three words meets any criteria referenced by a corresponding heuristic pattern of The database. In this example, only the word heart would meet the morphological criteria of "Letter and / or Numbers and / or Consecutive Symbols in any order" and the heuristic word "LNL" would be assigned. Of the other two words he knows nothing and therefore, the filter calculates the probability of SPAM and NO-SPAM of the subject, like any Bayesian filter, only taking into account the probabilities that the LNS Metaword has been assigned a priori. If the mail is classified in one of the two categories and the user gives the approval, the vocabulary is not updated in words but in the probabilities of the words it contains, since its use has caused that the classification has been correct.
Si, por el contrario, el usuario considera que Ia clasificación ha sido incorrecta (el filtro clasificó el correo como SPAM cuando es NO-SPAM o como NO-SPAM y es SPAM), el filtro incorpora al vocabulario del asunto (ya sea propio o común al resto de campos) las palabras "necesitas" y "trasplantte" con las probabilidades de SPAM y NO-SPAM que correspondan. De este modo, el vocabulario de Ia parte asunto, o el común a las partes asunto y cuerpo si es el caso, quedaría entonces formado por {Nosense, Minsize, Cns, Raresymbs, Numbers, Metapalabra LNS, necesitas, trasplantte, LNL}.If, on the contrary, the user considers that the classification has been incorrect (the filter classified the mail as SPAM when it is NO-SPAM or as NO-SPAM and is SPAM), the filter incorporates the subject vocabulary (either own or common to the rest of the fields) the words "you need" and "transplant" with the corresponding SPAM and NO-SPAM probabilities. In this way, the vocabulary of the subject part, or the common one to the subject and body parts if that is the case, would then be formed by {Nosense, Minsize, Cns, Raresymbs, Numbers, Metapalabra LNS, you need, transplant, LNL}.
Obviamente Ia palabra "trasplantte" es inválida y pretende ser engañosa pero como no cumple ninguno de los criterios morfológicos relacionados por los patrones heurísticos, pasa al vocabulario heurístico tal y como es. El hecho de haber considerado un patrón o criterio morfológico sobre dos consonantes consecutivas generaría muchos falsos positivos ya que existen infinidad de palabras válidas con dos o tres consonantes seguidas (constipado, absoluto, inspirar, etc.). Dicho vocabulario ampliado, formado por las palabras heurísticas iniciales, más las palabras heurísticas y no heurísticas añadidas, será el que el filtro bayesiano utilice en análisis posteriores, con Io cual cuanto más se vaya aumentando el vocabulario mayor será Ia precisión del filtro, siempre y cuando el aumento no sea desmesurado, ya que ello provocaría un mayor tiempo de clasificación y, por tanto, una eficiencia menor del filtro.Obviously the word "transplant" is invalid and is intended to be misleading but since it does not meet any of the morphological criteria related by heuristic patterns, it goes into the heuristic vocabulary as it is. Having considered a morphological pattern or criterion on two consecutive consonants would generate many false positives since there are countless valid words with two or three consonants in a row (constipated, absolute, inspire, etc.). Said extended vocabulary, formed by the initial heuristic words, plus the heuristic and non-heuristic words added, will be the one that the Bayesian filter uses in later analyzes, with which the more the vocabulary is increased the greater the precision of the filter, provided when the increase is not excessive, since this would cause a longer classification time and, therefore, a lower efficiency of the filter.
Mediante Ia presente invención se consigue, como se explicará con posteridad, que dicho vocabulario no crezca en balde, es decir que las palabras que se vayan incorporando al mismo sean relevantes para su posterior utilización en las clasificaciones llevadas a cabo por el filtro bayesiano, descartando así las palabras que contribuirían inútilmente al crecimiento del vocabulario, incrementando el citado tiempo de clasificación, y que tendrían una aportación prácticamente nula en el potencial discriminatorio del filtro.By means of the present invention it is achieved, as will be explained later, that said vocabulary does not grow in vain, that is to say that the words that are incorporated therein are relevant for later use in the classifications carried out by the Bayesian filter, discarding thus the words that would contribute uselessly to vocabulary growth, increasing the mentioned classification time, and that would have a practically null contribution in the discriminatory potential of the filter.
Para evitar el comentado aumento desmesurado del vocabulario, o vocabularios en función del ejemplo de realización, el método propuesto por Ia presente invención comprende llevar a cabo una gestión inteligente de los mismos.To avoid the mentioned excessive increase in vocabulary, or vocabularies according to the example of embodiment, the method proposed by the present invention comprises carrying out an intelligent management thereof.
Para el mencionado ejemplo de realización en que cada parte o campo está asociada a un vocabulario propio, dicha gestión inteligente es llevada a cabo para cada uno de los citados vocabularios, y se consigue gracias a unas determinadas acciones enfocadas a seleccionar convenientemente las palabras a incorporar a cada uno de dichos vocabularios, y agrupadas en los siguientes tres grupos:For the aforementioned example of realization in which each part or field is associated with its own vocabulary, said intelligent management is carried out for each of the aforementioned vocabularies, and is achieved thanks to certain actions focused on conveniently selecting the words to incorporate to each of these vocabularies, and grouped into the following three groups:
1 ) En Io que respecta a los correos correctamente clasificados, los cuales son eliminados del dominio del filtro, los vocabularios de Ia parte asunto y del cuerpo no son ampliados con las nuevas palabras que contenían estos correos ya que Ia razón de su correcta clasificación recae en las palabras que contenían y que estaban presentes en el vocabulario. La posibilidad de incluir las palabras nuevas de correos correctamente clasificados en una categoría (spam o no-spam) en el vocabulario como palabras con una probabilidad mayor de pertenecer a dicha categoría, incrementaría el tamaño del vocabulario sin tener Ia certeza de que esas palabras aporten información relevante y que puedan formar parte de cualquiera de las dos categorías (spam y no-spam). 2) Cuando el filtro tiene que aprender de los errores de clasificación, los vocabularios de las partes asunto y cuerpo son actualizados con las palabras contenidas en los correos mal clasificados, sean éstos falsos positivos o negativos, y a las palabras nuevas, tanto heurísticas como no heurísticas, se les asignan unas correspondientes probabilidades de spam y de no-spam fijadas por el filtro para las palabras nuevas heurísticas y no heurísticas. Si para este caso en que el filtro aprende de un correo mal clasificado, el correo contiene palabras inválidas que se ajustan a algún criterio referenciado por un respectivo patrón heurístico, éstas no se añaden al vocabulario sino que únicamente se actualizan las probabilidades asociadas a Ia palabra heurística del patrón que Ia relaciona con el criterio que cumple Ia palabra inválida.1) Regarding the correctly classified emails, which are eliminated from the filter domain, the vocabularies of the subject and body part are not extended with the new words contained in these emails since the reason for their correct classification lies in the words they contained and were present in the vocabulary. The possibility of including new mail words correctly classified in a category (spam or non-spam) in the vocabulary as words with a higher probability of belonging to that category, would increase the size of the vocabulary without being certain that those words contribute relevant information and that can be part of any of the two categories (spam and non-spam). 2) When the filter has to learn from the classification errors, the vocabularies of the subject and body parts are updated with the words contained in the misclassified emails, whether they are false positive or negative, and to the new words, both heuristic and not heuristics, they are assigned a corresponding probability of spam and non-spam set by the filter for new heuristic and non-heuristic words. If for this case in which the filter learns from a poorly classified email, the email contains invalid words that conform to some criteria referenced by a respective heuristic pattern, these are not added to the vocabulary but only the probabilities associated with the word are updated. heuristic pattern that relates to the criteria that meets the invalid word.
3) En Ia parte remitente el vocabulario heurístico inicial está vacío, por Io cual el filtro bayesiano comienza a operar utilizando dicho vocabulario vacío para analizar el campo remitente de los correos entrantes, Io cual produce una clasificación de dicho campo como no-spam. La clasificación ponderada hecha por el filtro para las tres partes de los correos permitirá obtener Ia clasificación final de los mismos, que podrá ser spam o no-spam. Si el usuario indica al filtro que Ia clasificación realizada ha sido correcta, el método propuesto comprende almacenar en el vocabulario del campo remitente, sólo el remitente de los correos no-spam con una probabilidad asociada a Ia categoría no-spam superior a Ia de spam. Cuando el filtro tiene que aprender porque se han producido errores en Ia clasificación de los correos, almacena sólo las direcciones de correo electrónico, o remitentes, de los falsos positivos, es decir, de los correos que realmente son no-spam. Los remitentes de los falsos negativos, es decir, de los correos erróneamente asignados a Ia categoría no- spam, son obviados ya que Ia mayoría de los correos no solicitados casi nunca proceden del mismo remitente y su inclusión contribuiría innecesariamente al crecimiento del vocabulario de esta parte.3) In the sending part the initial heuristic vocabulary is empty, whereby the Bayesian filter begins to operate using said empty vocabulary to analyze the sender field of incoming mails, which produces a classification of said field as non-spam. The weighted classification made by the filter for the three parts of the emails will allow obtaining the final classification of the same, which may be spam or non-spam. If the user indicates to the filter that the classification has been correct, the proposed method includes storing in the vocabulary of the sender field, only the sender of the non-spam emails with a probability associated with the non-spam category higher than the spam . When the filter has to learn because errors have occurred in the classification of emails, it stores only the email addresses, or senders, of the false positives, that is, of the emails that really are non-spam. The senders of the false negatives, that is to say, of the emails erroneously assigned to the non-spam category, are ignored since most of the unsolicited emails almost never come from the same sender and their inclusion would contribute unnecessarily to the growth of the vocabulary of this part.
Es decir que gracias a Ia gestión inteligente descrita, los vocabularios heurísticos iniciales de los campos cuerpo y asunto, solamente se incrementan con palabras heurísticas y no heurísticas extraídas de correos mal clasificados, y por el contrario, en Io referente al vocabulario del campo remitente, el cual inicialmente está vacío, éste aumenta solamente con Ia incorporación de direcciones de correo electrónico de correos que realmente son no-spam, independientemente de si han sido bien o mal clasificados. Los grupos de acciones 1 ) y 2) también son aplicables para el ejemplo de realización para el cual los campos asunto y cuerpo compartan un único vocabulario común. Por Io que se refiere al filtro bayesiano propiamente dicho, el método propuesto comprende seleccionar el modo de funcionamiento de dicho filtro bayesiano, por parte de un usuario, de entre los siguientes tres modos, en función de cuya selección se realizará una u otra acción con los correos recibidos: - modo 1 , o modo de confianza ciega, cuya selección provoca que los correos que el filtro identifique como spam no sean entregados al usuario,That is to say that thanks to the intelligent management described, the initial heuristic vocabularies of the body and subject fields only increase with heuristic and non-heuristic words extracted from misclassified emails, and on the contrary, in what refers to the vocabulary of the sending field, which is initially empty, it only increases with the incorporation of email addresses of emails that really are non-spam, regardless of whether they have been well or poorly classified. Action groups 1) and 2) are also applicable to the example of realization for which the subject and body fields share a single common vocabulary. As regards the Bayesian filter itself, the proposed method comprises selecting the operating mode of said Bayesian filter, by a user, from among the following three modes, depending on whose selection one or other action will be performed with emails received: - mode 1, or blind trust mode, whose selection causes emails that the filter identifies as spam are not delivered to the user,
- modo 2, o modo supervisado, cuya selección provoca que los correos que el filtro detecte como spam sean entregados al usuario incluyendo una cadena de caracteres, o etiqueta, en el campo referente al asunto del correo, y - modo 3, cuya selección provoca que todos los correos sean entregados al usuario, independientemente de si son o no spam.- mode 2, or supervised mode, whose selection causes the emails that the filter detects as spam to be delivered to the user including a character string, or label, in the field related to the subject of the mail, and - mode 3, whose selection causes that all emails are delivered to the user, regardless of whether or not they are spam.
Cuando el filtro bayesiano opera en dicho modo 2, o modo supervisado, el método comprende elegir por parte del usuario, qué cadena de caracteres, o etiqueta, incluir en dicho campo referente al asunto del correo. El método propuesto es aplicable independientemente de cuál sea el cliente de correo utilizado, y comprende ajustar el sesgo del error de los correos clasificados incorrectamente, como spam o no spam, del filtro bayesiano mediante Ia modificación, por parte del usuario, de un parámetro determinado. El método propuesto comprende mostrar a un usuario al menos uno de los siguientes elementos, o una combinación de los mismos:When the Bayesian filter operates in said mode 2, or supervised mode, the method comprises choosing by the user, which character string, or label, to include in said field concerning the subject of the mail. The proposed method is applicable regardless of the mail client used, and includes adjusting the error bias of incorrectly classified emails, such as spam or non-spam, of the Bayesian filter by modifying, by the user, a certain parameter . The proposed method comprises showing a user at least one of the following elements, or a combination thereof:
- unas estadísticas del filtro, indicando el porcentaje de aciertos y errores en correos spam detectados y no detectados,- filter statistics, indicating the percentage of hits and errors in spam emails detected and not detected,
- un histórico o traza de las conexiones del usuario al servidor de correo, mostrando para cada una de ellas Ia fecha, Ia hora y el número de correos recibidos con los resultados del filtro, así como posibles errores puntuales de conexión, - una lista de los correos recibidos junto con Ia identificación que el filtro ha hecho de cada uno de ellos, lista sobre Ia cual el usuario puede indicar al filtro los aciertos y errores cometidos para que aprenda de ellos, siendo posible borrar correos de Ia lista para que el filtro no los vuelva a considerar, y - un panel de configuración donde el usuario puede ajustar ciertos parámetros del filtrado y de las conexiones, y- a history or trace of the user's connections to the mail server, showing for each of them the date, time and number of emails received with the results of the filter, as well as possible specific connection errors, - a list of the emails received along with the identification that the filter has made of each of them, list on which the user can indicate to the filter the successes and mistakes made so that he learns from them, being possible to delete emails from the list so that the filter does not consider them again, and - a configuration panel where the user can adjust certain parameters of the filtrate and connections, and
- un indicador cuya selección, por parte del usuario permite detener o iniciar el servicio de filtrado en cada momento.- an indicator whose selection by the user allows stopping or starting the filtering service at any time.
En Ia Fig. 1 se muestra un diagrama de flujo que ilustra Ia utilización del filtro bayesiano, tanto para clasificar correos entrantes, como para aprender de sus aciertos y errores, mediante una serie de acciones a realizar según el método propuesto por Ia presente invención, representativas de un ejemplo de realización, algunas de las cuales ya se han descrito más arriba, dichas acciones llevadas a cabo por el filtro bayesiano una vez ya se ha creado el vocabulario heurístico inicial, así como los patrones heurísticos.In Fig. 1 a flow chart is shown that illustrates the use of the Bayesian filter, both to classify incoming mails, and to learn from their successes and errors, through a series of actions to be carried out according to the method proposed by the present invention, representative of an exemplary embodiment, some of which have already been described above, said actions carried out by the Bayesian filter once the initial heuristic vocabulary has been created, as well as the heuristic patterns.
El significado de las palabras incluidas en dicha figura debe ser tomado como unas respectivas indicaciones de dichas acciones, así como de los elementos utilizados por el método propuesto, cuando es el caso, como son los siguientes elementos: - Datos análisis aprendidos: hace referencia tanto al vocabulario heurístico como a Ia base de datos que incluye los patrones heurísticos.The meaning of the words included in said figure must be taken as respective indications of said actions, as well as of the elements used by the proposed method, when appropriate, as are the following elements: - Analysis data learned: refers to both to the heuristic vocabulary as to the database that includes the heuristic patterns.
- Lista de correos: hace referencia al índice de los correos mostrado al usuario, es decir los campos asunto y remitente de los correos.- Mailing list: refers to the index of the emails shown to the user, that is, the subject and sender fields of the emails.
- Lista y correos temporales: hace referencia a los correos completos, es decir con los campos asunto, remitente y cuerpo.- List and temporary emails: refers to the complete emails, that is, with the subject, sender and body fields.
En dicha Fig. 1 , en primer lugar se Ie posibilita al usuario el realizar unos cambios en Ia configuración, por ejemplo mediante el comentado panel de configuración mostrado al mismo para modificar los citados parámetros, los cuales en Ia Fig. 1 se encuentran almacenados en una estructura de datos, representada por un cilindro a Ia derecha de Ia Fig. 1In said Fig. 1, in the first place it is possible for the user to make some changes in the configuration, for example by means of the mentioned configuration panel shown thereto to modify said parameters, which in Fig. 1 are stored in a data structure, represented by a cylinder to the right of Fig. 1
A continuación, y tras un tiempo en espera del filtro, se Ie muestran dos opciones al usuario, en función de si éste desea recibir correos (parte izquierda de Ia Fig. 1 ) o desea mandar aprender al filtro (parte derecha de Ia Fig. 1 ). Si el usuario desea recibir correos, en primer lugar el cliente de correo utilizado por éste se conecta al filtro, el cual hace las veces de intermediario o servidor "proxy" respecto al servidor de correo entrante, por ejemplo POP (o POP3), al cual se conecta. Una vez establecida Ia conexión si el usuario no desea recibir correo en ese momento, el filtro vuelve a situarse en espera, pero, si por el contrario el usuario desea recibir correo, el filtro se Io requiere al servidor POP, el cual se Io envía, tras Io cual el filtro Io procesa extrayendo su vocabulario (para alguno o todos los campos, en función de una selección por parte del usuario). El filtro analiza las partes (asunto, remitente y cuerpo) del vocabulario extraído y determina si el correo es o no spam, combinándolas, llevándose a cabo dicho análisis consultando el vocabulario heurístico y Ia base de datos generados y suministrados al mismo según el método propuesto, tal y como se indica en Ia Fig. 1 con Ia flecha en línea discontinua proveniente del cilindro con Ia leyenda: "Datos análisis aprendidos" descrito arriba.Then, after waiting for the filter, two options are shown to the user, depending on whether he wants to receive emails (left part of Fig. 1) or wants to send the filter to learn (right part of Fig. one ). If the user wishes to receive emails, first of all the mail client used by the latter is connected to the filter, which acts as an intermediary or proxy server with respect to the incoming mail server, for example POP (or POP3), at which one connects Once the connection has been established if the user does not wish to receive mail at that time, the filter is placed on hold again, but if, on the contrary, the user wishes to receive mail, the filter is required to the POP server, which is sent , after which the filter processes it by extracting its vocabulary (for some or all fields, depending on a selection by the user). The filter analyzes the parts (subject, sender and body) of the extracted vocabulary and determines if the email is spam or not, combining them, carrying out said analysis by consulting the heuristic vocabulary and the database generated and supplied to it according to the proposed method , as indicated in Fig. 1 with the dashed arrow from the cylinder with the legend: "Analysis data learned" described above.
El filtro guarda el vocabulario extraído del correo temporalmente y agrega una entrada asociada a Ia comentada lista temporal, tal y como muestra Ia correspondiente flecha en línea discontinua de Ia Fig. 1 dirigida al elemento con Ia leyenda "lista y correos temporales recibidos". Si el filtro ha detectado que el correo recibido es spam, el filtro actúa según el modo seleccionado por el usuario: en el modo 1 , o modo"confianza ciega", el filtro borra el correo del servidor, en modo 3, el filtro Ie entrega el correo al cliente de correo, y en el modo 2, o modo supervisado, el filtro incluye una cadena en el asunto de correo, tal como: "este correo es spam". Por Io que se refiere a Ia opción relativa a mandar aprender al filtroThe filter saves the vocabulary extracted from the mail temporarily and adds an entry associated to the commented temporary list, as shown by the corresponding dashed arrow in Fig. 1 directed to the element with the legend "list and received temporary emails". If the filter has detected that the mail received is spam, the filter acts according to the mode selected by the user: in mode 1, or "blind trust" mode, the filter deletes the mail from the server, in mode 3, the Ie filter deliver the mail to the mail client, and in mode 2, or supervised mode, the filter includes a string in the mail subject, such as: "this mail is spam". Therefore, it refers to the option to send the filter to learn
(parte derecha de Ia Fig. 1 ), si ésta es seleccionada por el usuario, en primer lugar si Ia lista de correos (asunto y remitente) recibidos, o lista índice, está vacía el filtro se mantiene en espera, y si no Io está se Ie posibilita al usuario que seleccione, de dicha lista índice, uno o más correos, selección Ia cual implica que el usuario considera que dichos correos seleccionados han sido clasificados incorrectamente por el filtro, para que éste aprenda de sus errores. Los correos que el usuario no selecciona son considerados como bien clasificados, y utilizados por el filtro para aprender de sus aciertos. Estos correos se encuentran en Ia descrita arriba "lista y correos temporales recibidos", tal y como indican las dos flechas en líneas discontinuas provenientes del elemento con dicha leyenda (ver parte inferior derecha de Ia Fig. 1 ). Ya sea a partir del aprendizaje de los correos clasificados incorrectamente como de los clasificados correctamente, mediante el análisis de dichos correos por parte del filtro, se proceden a actualizar los datos para análisis posteriores, tal y como señala Ia flecha en línea discontinua dirigida hacia el cilindro con Ia leyenda "Datos análisis aprendidos". Por Io que se refiere a Ia actualización basada en el análisis de los correos clasificados incorrectamente, ésta hace referencia a Ia extracción e incorporación en el vocabulario heurístico de las palabras no heurísticas halladas en dichos correos con unas correspondientes probabilidades de spam o no-spam, según sea el caso, así como actualizar las probabilidades de spam y de no-spam de las palabras, tanto heurísticas como no heurísticas, si éstas existen, utilizadas para clasificar dichos correos, en función de su influencia en dichas clasificaciones incorrectas.(right part of Fig. 1), if it is selected by the user, first of all if the mailing list (subject and sender) received, or index list is empty, the filter is kept on hold, and if not It is possible for the user to select, from said index list, one or more emails, selection which implies that the user considers that said selected emails have been incorrectly classified by the filter, so that he learns from his mistakes. The emails that the user does not select are considered as well classified, and used by the filter to learn from their successes. These mails are in the above described "list and received temporary mails", as indicated by the two arrows in dashed lines coming from the element with said legend (see bottom right of Fig. 1). Either from the learning of the mails incorrectly classified as from those classified correctly, by analyzing said emails by the filter, the data for subsequent analysis is updated, as indicated by the dashed arrow directed towards the cylinder with the legend "Analysis data learned". As regards the update based on the analysis of the incorrectly classified emails, it refers to the extraction and incorporation in the heuristic vocabulary of the non-heuristic words found in said emails with a corresponding probability of spam or non-spam, as the case may be, as well as updating the probabilities of spam and non-spam of words, both heuristic and non-heuristic, if they exist, used to classify said emails, depending on their influence on said incorrect classifications.
Por Io que se refiere a Ia actualización basada en el análisis de los correos clasificados correctamente, ésta hace referencia principalmente a Ia actualización de las probabilidades de spam y de no-spam de las palabras, tanto heurísticas como no heurísticas, si éstas existen, utilizadas para clasificar dichos correos, en función de su influencia en dichas clasificaciones correctas.As regards the update based on the analysis of the mails correctly classified, this refers mainly to the update of the probabilities of spam and non-spam of the words, both heuristic and non-heuristic, if they exist, used to classify these emails, depending on their influence on said correct classifications.
Una vez se ha completado Ia actualización descrita, el filtro borra el contenido del elemento con Ia leyenda "lista y correos temporales recibidos", es decir que borra el correo temporal y su entrada de Ia lista temporal, ya que dicho correos temporales ya han sido utilizados para cumplir su función, ya sea para facilitar las palabras a incorporar al vocabulario heurístico con el fin de aumentarlo, y por tanto mejorarlo, como para ajustar las probabilidades de las palabras de dicho vocabulario, tras Io cual el filtro actualiza sus estadísticas, y vuelve a situarse en espera hasta que el usuario seleccione una de las dos posibles opciones descritas. Un experto en Ia materia podría introducir cambios y modificaciones en los ejemplos de realización descritos sin salirse del alcance de Ia invención según está definido en las reivindicaciones adjuntas. Once the update described has been completed, the filter deletes the content of the element with the legend "list and temporary emails received", that is, it deletes the temporary mail and its entry from the temporary list, since said temporary emails have already been used to fulfill its function, either to facilitate the words to be incorporated into the heuristic vocabulary in order to increase it, and therefore to improve it, and to adjust the probabilities of the words of said vocabulary, after which the filter updates its statistics, and it goes back to standby until the user selects one of the two possible options described. A person skilled in the art could introduce changes and modifications in the described embodiments without departing from the scope of the invention as defined in the appended claims.

Claims

REIVINDICACIONES
1.- Método para clasificar mensajes de correo electrónico en correo deseado y correo no deseado, o spam, del tipo basado en Ia utilización de un filtro probabilístico para realizar dicha clasificación, caracterizado porque comprende, al menos previamente al comienzo de dicha clasificación, generar un grupo de patrones heurísticos y de palabras heurísticas, a partir del análisis de una serie o corpus de correos históricos, formando dichas palabras heurísticas al menos un vocabulario heurístico inicial, utilizando para realizar dicho análisis al menos unos criterios morfológicos, e incluir dichos patrones heurísticos, los cuales relacionan dichos criterios morfológicos con las palabras heurísticas, en una base de datos, suministrándose dicha base de datos junto con dicho vocabulario heurístico inicial, que es al menos uno, a dicho filtro probabilístico para que éste pueda comenzar a funcionar.1.- Method for classifying email messages in desired mail and spam, or spam, of the type based on the use of a probabilistic filter to perform said classification, characterized in that it comprises, at least prior to the beginning of said classification, generating a group of heuristic patterns and heuristic words, based on the analysis of a series or corpus of historical mails, said heuristic words forming at least one initial heuristic vocabulary, using to perform said analysis at least morphological criteria, and include said heuristic patterns , which relate said morphological criteria to the heuristic words, in a database, said database being supplied together with said initial heuristic vocabulary, which is at least one, to said probabilistic filter so that it can begin to function.
2.- Método según Ia reivindicación 1 , caracterizado porque dicho filtro probabilístico es un filtro basado en el método de Naϊve Bayes, o filtro bayesiano.2. Method according to claim 1, characterized in that said probabilistic filter is a filter based on the Naϊve Bayes method, or Bayesian filter.
3.- Método según Ia reivindicación 1 ó 2, caracterizado porque comprende mejorar Ia capacidad de clasificación de dicho filtro probabilístico a través de una interacción con el usuario mediante Ia cual éste Ie indica al filtro probabilístico qué correos ha clasificado correctamente y/o qué correos ha clasificado incorrectamente, cada vez que el filtro clasifica los correos o de forma periódica.3. Method according to claim 1 or 2, characterized in that it comprises improving the classification capacity of said probabilistic filter through an interaction with the user by means of which it indicates to the probabilistic filter which emails have been correctly classified and / or which emails you have classified incorrectly, every time the filter classifies the emails or periodically.
4.- Método según Ia reivindicación 3, caracterizado porque comprende actualizar y aumentar el vocabulario heurístico inicial, que es al menos uno, suministrado a dicho filtro, inicialmente formado sólo por palabras heurísticas, progresivamente, durante Ia utilización de dicho filtro probabilístico sobre los correos electrónicos clasificados por el filtro, analizándose los correos clasificados incorrectamente por dicho filtro probabilístico y añadiendo al vocabulario heurístico una o más palabras obtenidas a partir de dicho análisis. 4. Method according to claim 3, characterized in that it comprises updating and increasing the initial heuristic vocabulary, which is at least one, supplied to said filter, initially formed only by heuristic words, progressively, during the use of said probabilistic filter on emails electronic classified by the filter, the mails classified incorrectly by said probabilistic filter and adding to the heuristic vocabulary one or more words obtained from said analysis.
5.- Método según Ia reivindicación 4, caracterizado porque dicha obtención de dichas palabras a incluir en el vocabulario heurístico se lleva a cabo mediante un análisis de las palabras contenidas en dichos correos clasificados incorrectamente también utilizando al menos dichos criterios morfológicos.5. Method according to claim 4, characterized in that said obtaining of said words to be included in the heuristic vocabulary is carried out by means of an analysis of the words contained in said emails incorrectly classified also using at least these morphological criteria.
6.- Método según Ia reivindicación 4 ó 5, caracterizado porque comprende llevar a cabo dicha actualización de dicho vocabulario heurístico al menos en parte automáticamente, analizando cada correo electrónico, inmediatamente o cada cierto periodo de tiempo programado después de ser clasificado por el filtro probabilístico y validado por el usuario.6. Method according to claim 4 or 5, characterized in that it comprises carrying out said update of said heuristic vocabulary at least in part automatically, analyzing each email, immediately or every certain period of time programmed after being classified by the probabilistic filter and validated by the user.
7.- Método según Ia reivindicación 4 ó 5, caracterizado porque comprende llevar a cabo dicha actualización de dicho vocabulario heurístico al menos en parte a petición de un usuario, analizando solamente una serie de correos electrónicos indicados por el usuario.7. Method according to claim 4 or 5, characterized in that it comprises carrying out said updating of said heuristic vocabulary at least in part at the request of a user, analyzing only a series of emails indicated by the user.
8.- Método según Ia reivindicación 1 , caracterizado porque dichos criterios morfológicos hacen referencia a al menos uno de los siguientes criterios: longitud de cada palabra analizada, orden o secuencia y tipo de caracteres contenidos, adyacencia de caracteres, o una combinación de los mismos.8. Method according to claim 1, characterized in that said morphological criteria refer to at least one of the following criteria: length of each word analyzed, order or sequence and type of characters contained, adjacency of characters, or a combination thereof .
9.- Método según Ia reivindicación 8, caracterizado porque cada uno de dichos caracteres es al menos uno del grupo formado por: consonantes, vocales, números y símbolos. 9. Method according to claim 8, characterized in that each of said characters is at least one of the group consisting of consonants, vowels, numbers and symbols.
10.- Método según Ia reivindicación 9, caracterizado porque dichos criterios morfológicos hacen referencia a Ia morfología de palabras para una pluralidad de idiomas distintos.10. Method according to claim 9, characterized in that said morphological criteria refer to the morphology of words for a plurality of different languages.
11.- Método según Ia reivindicación 1 , caracterizado porque comprende Ia utilización de un autómata de estados finitos para generar dicho grupo de patrones heurísticos y de palabras heurísticas llevando a cabo dicho análisis según uno o más criterios morfológicos, con el fin de cribar las palabras suministradas a dicho autómata, separándolas en palabras válidas y palabras no válidas, según dicho criterio morfológico utilizado por el autómata.11. Method according to claim 1, characterized in that it comprises the use of a finite state automaton to generate said group of heuristic patterns and heuristic words by carrying out said analysis according to one or more morphological criteria, in order to screen the words supplied to said automaton, separating them into valid words and invalid words, according to said morphological criterion used by the automaton.
12.- Método según Ia reivindicación 11 , caracterizado porque comprende construir y entrenar a dicho autómata de estados finitos mediante el suministro al mismo de palabras extraídas de correos etiquetados como válidos, siendo dicho autómata una vez entrenado capaz de reconocer como palabras válidas aquellas bien formadas en cuanto a Ia secuencia y tipo de los caracteres que las componen, siendo Ia secuencia y tipo de caracteres contenidos en las palabras el criterio morfológico utilizado por el autómata.12. Method according to claim 11, characterized in that it comprises constructing and training said finite state automaton by supplying it with words extracted from emails labeled as valid, said automaton being once trained capable of recognizing as well-formed words those well formed regarding the sequence and type of the characters that they compose them, being the sequence and type of characters contained in the words the morphological criterion used by the automaton.
13.- Método según Ia reivindicación 12, caracterizado porque dicha construcción del autómata a partir de las palabras de dichos correos etiquetados como válidos es llevada a cabo mediante Ia utilización de un algoritmo ECGI adaptado.13. Method according to claim 12, characterized in that said automaton construction from the words of said emails labeled as valid is carried out by means of the use of an adapted ECGI algorithm.
14.- Método según Ia reivindicación 12, caracterizado porque comprende suministrar a dicho autómata palabras extraídas de correos etiquetados como no válidos, y clasificar como palabras no válidas aquellas no reconocidas como válidas por el autómata.14. Method according to claim 12, characterized in that it comprises providing said automaton with words extracted from emails labeled as invalid, and classifying as words not valid those not recognized as valid by the automaton.
15.- Método según Ia reivindicación 14, caracterizado porque comprende analizar, según uno o más de dichos criterios morfológicos, dichas palabras clasificadas por el autómata como no válidas y añadir a dicho vocabulario heurístico al menos una palabra heurística, utilizando al menos uno dichos criterios morfológicos.15. Method according to claim 14, characterized in that it comprises analyzing, according to one or more of said morphological criteria, said words classified by the automaton as invalid and adding to said heuristic vocabulary at least one heuristic word, using at least one said criteria morphological
16.- Método según Ia reivindicación 15, caracterizado porque comprende, para alguna o todas de dichas palabras heurísticas, obtener al menos un patrón heurístico por palabra heurística, cada patrón relacionando una palabra heurística con al menos un criterio morfológico diferente, y registrar en dicha base de datos los patrones heurísticos obtenidos.16. Method according to claim 15, characterized in that it comprises, for some or all of said heuristic words, obtaining at least one heuristic pattern per heuristic word, each pattern relating a heuristic word with at least one different morphological criterion, and registering in said database heuristic patterns obtained.
17.- Método según Ia reivindicación 15, caracterizado porque comprende obtener una misma palabra heurística a partir de dichos análisis de una o más palabras no válidas, utilizando dos o más criterios morfológicos diferentes.17. Method according to claim 15, characterized in that it comprises obtaining the same heuristic word from said analysis of one or more invalid words, using two or more different morphological criteria.
18.- Método según Ia reivindicación 17, caracterizado porque comprende, para dicha palabra heurística obtenida en base a dichos dos o más criterios morfológicos, obtener dos o más patrones heurísticos, cada patrón relacionando Ia palabra heurística con un respectivo criterio morfológico de dichos criterios morfológicos, que son al menos dos, y registrar en dicha base de datos los patrones heurísticos obtenidos. 18. Method according to claim 17, characterized in that, for said heuristic word obtained on the basis of said two or more morphological criteria, obtaining two or more heuristic patterns, each pattern relating the heuristic word with a respective morphological criterion of said morphological criteria , which are at least two, and record in said database the heuristic patterns obtained.
19.- Método según Ia reivindicación 16, caracterizado porque comprende registrar en dicha base de datos también dichos criterios morfológicos relacionados, mediante dichos patrones heurísticos, con las palabras heurísticas obtenidas a partir de ellos. 19. Method according to claim 16, characterized in that it also includes registering said morphological criteria in said database, by means of said heuristic patterns, with the heuristic words obtained from them.
20.- Método según Ia reivindicación 12 ó 14, caracterizado porque al menos uno de dichos etiquetajes es llevado a cabo por un usuario.20. Method according to claim 12 or 14, characterized in that at least one of said labels is carried out by a user.
21.- Método según Ia reivindicación 1 ó 16, caracterizado porque comprende asignar y asociar a cada una de dichas palabras heurísticas comprendidas en dicho vocabulario heurístico, una probabilidad de que Ia presencia en un correo de palabras que cumplan los criterios morfológicos referenciados por dichas palabras heurísticas, sea indicativa de que éste es un correo no deseado, o de alta probabilidad de spam.21. Method according to claim 1 or 16, characterized in that it comprises assigning and associating to each of said heuristic words included in said heuristic vocabulary, a probability that the presence in a mail of words that meet the morphological criteria referenced by said words heuristics, be indicative that this is a spam, or high probability of spam.
22.- Método según Ia reivindicación 21 , caracterizado porque comprende utilizar dicho filtro probabilístico para consultar las palabras de dicho vocabulario heurístico y los patrones heurísticos de dicha base de datos, que relacionan a las palabras heurísticas con los criterios morfológicos en los que se basan, con el fin de buscar, en los correos a clasificar, palabras que cumplan uno o más de dichos criterios morfológicos. 22. Method according to claim 21, characterized in that it comprises using said probabilistic filter to consult the words of said heuristic vocabulary and the heuristic patterns of said database, which relate the heuristic words to the morphological criteria on which they are based, in order to search, in the mails to be classified, words that meet one or more of said morphological criteria.
23.- Método según Ia reivindicación 21 , caracterizado porque comprende actualizar dichas probabilidades de dichas palabras heurísticas progresivamente en función de su influencia en el resultado de los análisis de los correos clasificados por dicho filtro probabilístico, aumentando dichas probabilidades si Ia influencia de Ia palabra heurística asociada ha provocado una clasificación correcta, o viceversa.23. Method according to claim 21, characterized in that it comprises updating said probabilities of said heuristic words progressively based on their influence on the result of the analysis of the mails classified by said probabilistic filter, increasing said probabilities if the influence of the heuristic word associated has caused a correct classification, or vice versa.
24.- Método según Ia reivindicación 22 ó 23, caracterizado porque comprende asignar a cada palabra encontrada, como resultado de dicha búsqueda por parte del filtro probabilístico, que cumpla el criterio morfológico relacionado por el primer patrón heurístico de Ia base de datos consultado, Ia palabra heurística relacionada por dicho primer patrón y las probabilidades que tenga asignadas en el vocabulario heurístico.24.- Method according to claim 22 or 23, characterized in that it comprises assigning to each word found, as a result of said search by the probabilistic filter, that it meets the morphological criteria related by the first heuristic pattern of the database consulted, Ia heuristic word related by said first pattern and the probabilities assigned in the heuristic vocabulary.
25.- Método según Ia reivindicación 24, caracterizado porque dichos patrones heurísticos están ordenados en Ia base de datos en función del grado de probabilidad de spam que tienen asignadas las respectivas palabras heurísticas relacionadas por los mismos, siendo realizada dicha consulta por parte del filtro probabilístico siguiendo dicho orden. 25.- Method according to claim 24, characterized in that said heuristic patterns are arranged in the database according to the degree of spam probability assigned to the respective heuristic words related thereto, said query being performed by the probabilistic filter following that order.
26.- Método según Ia reivindicación 4, porque el análisis de los correos clasificados incorrectamente comprende analizar, mediante el filtro bayesiano, cada una de las palabras de dichos correos.26.- Method according to claim 4, because the analysis of the mails classified incorrectly comprises analyzing, by means of the Bayesian filter, each of the words of said mails.
27.- Método según Ia reivindicación 2, caracterizado porque comprende seleccionar el modo de funcionamiento de dicho filtro bayesiano, por parte de un usuario, de entre los siguientes tres modos, en función de cuya selección se realizará una u otra acción con los correos recibidos:27.- Method according to claim 2, characterized in that it comprises selecting the operating mode of said Bayesian filter, by a user, from among the following three modes, depending on whose selection one or the other action will be performed with the received mails :
- modo 1 , o modo de confianza ciega, cuya selección provoca que los correos que el filtro identifique como spam no sean entregados al usuario, - modo 2, o modo supervisado, cuya selección provoca que los correos que el filtro detecte como spam sean entregados al usuario incluyendo una cadena de caracteres, o etiqueta, en el campo referente al asunto del correo, y- mode 1, or blind trust mode, whose selection causes emails that the filter identifies as spam are not delivered to the user, - mode 2, or supervised mode, whose selection causes emails that the filter detects as spam are delivered to the user including a string of characters, or label, in the field referring to the subject of the mail, and
- modo 3, cuya selección provoca que todos los correos sean entregados al usuario, independientemente de sin son o no spam. - mode 3, whose selection causes all emails to be delivered to the user, regardless of whether or not they are spam.
28.- Método según Ia reivindicación 27, caracterizado porque cuando el filtro bayesiano opera en dicho modo 2, o modo supervisado, comprende elegir por parte del usuario, qué cadena de caracteres, o etiqueta, incluir en dicho campo referente al asunto del correo.28.- Method according to claim 27, characterized in that when the Bayesian filter operates in said mode 2, or supervised mode, it comprises choosing by the user, which character string, or label, to include in said field referring to the subject of the mail.
29.- Método según Ia reivindicación 2, caracterizado porque comprende utilizar dicho filtro bayesiano para analizar los correos recibidos sobre uno o más de los siguientes campos por separado, en función de una correspondiente selección por parte del usuario: remitente, asunto y cuerpo, y, en el caso de haberse seleccionado más de un campo, combinar el resultado de todos los análisis realizados para obtener un resultado final. 29.- Method according to claim 2, characterized in that it comprises using said Bayesian filter to analyze the emails received on one or more of the following fields separately, based on a corresponding selection by the user: sender, subject and body, and , in the case of having selected more than one field, combine the result of all the analyzes performed to obtain a final result.
30.- Método según Ia reivindicación 2, caracterizado porque es aplicable independientemente de cuál sea el cliente de correo utilizado.30.- Method according to claim 2, characterized in that it is applicable regardless of the mail client used.
31.- Método según Ia reivindicación 3, caracterizado porque comprende ajustar el sesgo del error de los correos clasificados incorrectamente, como spam o no spam, de dicho filtro probabilístico mediante Ia modificación, por parte del usuario, de un parámetro determinado.31.- Method according to claim 3, characterized in that it comprises adjusting the error bias of the incorrectly classified emails, such as spam or non-spam, of said probabilistic filter by means of the modification, by the user, of a given parameter.
32.- Método según Ia reivindicación 29, caracterizado porque comprende, para dicho caso de haberse seleccionado más de un campo a analizar, realizar el análisis de los campos asunto y cuerpo utilizando un único vocabulario heurístico inicial común.32.- Method according to claim 29, characterized in that it comprises, for said case, having selected more than one field a analyze, perform the analysis of the subject and body fields using a single common initial heuristic vocabulary.
33.- Método según Ia reivindicación 29, caracterizado porque comprende, para dicho caso de haberse seleccionado más de un campo a analizar, realizar el análisis de los campos asunto y cuerpo utilizando dos respectivos vocabularios heurísticos iniciales.33.- Method according to claim 29, characterized in that it comprises, for said case of having selected more than one field to analyze, perform the analysis of the subject and body fields using two respective initial heuristic vocabularies.
34.- Método según Ia reivindicación 29, 32 ó 33, caracterizado porque comprende, para dicho caso de haberse seleccionado más de un campo a analizar, realizar el análisis del campo remitente utilizan do un vocabulario inicialmente vacío, y actualizar dicho vocabulario inicialmente vacío, añadiendo palabras relativas a direcciones de correo electrónico obtenidas de respectivos campos remitente de correos electrónicos validados por el usuario como no- spam. 34.- Method according to claim 29, 32 or 33, characterized in that it comprises, for said case of having selected more than one field to be analyzed, performing the analysis of the sending field using an initially empty vocabulary, and updating said initially empty vocabulary, adding words related to email addresses obtained from respective email sender fields validated by the user as non-spam.
PCT/ES2007/070026 2006-02-15 2007-02-05 Method for sorting e-mail messages into wanted mail and unwanted mail WO2007093661A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ESP200600349 2006-02-15
ES200600349 2006-02-15

Publications (1)

Publication Number Publication Date
WO2007093661A1 true WO2007093661A1 (en) 2007-08-23

Family

ID=38371213

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/ES2007/070026 WO2007093661A1 (en) 2006-02-15 2007-02-05 Method for sorting e-mail messages into wanted mail and unwanted mail

Country Status (1)

Country Link
WO (1) WO2007093661A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6161130A (en) * 1998-06-23 2000-12-12 Microsoft Corporation Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set
US20040260776A1 (en) * 2003-06-23 2004-12-23 Starbuck Bryan T. Advanced spam detection techniques
US20050102366A1 (en) * 2003-11-07 2005-05-12 Kirsch Steven T. E-mail filter employing adaptive ruleset
US20050192992A1 (en) * 2004-03-01 2005-09-01 Microsoft Corporation Systems and methods that determine intent of data and respond to the data based on the intent
US20060015561A1 (en) * 2004-06-29 2006-01-19 Microsoft Corporation Incremental anti-spam lookup and update service

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6161130A (en) * 1998-06-23 2000-12-12 Microsoft Corporation Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set
US20040260776A1 (en) * 2003-06-23 2004-12-23 Starbuck Bryan T. Advanced spam detection techniques
US20050102366A1 (en) * 2003-11-07 2005-05-12 Kirsch Steven T. E-mail filter employing adaptive ruleset
US20050192992A1 (en) * 2004-03-01 2005-09-01 Microsoft Corporation Systems and methods that determine intent of data and respond to the data based on the intent
US20060015561A1 (en) * 2004-06-29 2006-01-19 Microsoft Corporation Incremental anti-spam lookup and update service

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DEL CASTILLO M.D. ET AL.: "An interactive hybrid system for identifiying and filtering unsolicited email", PROCEEDINGS. THE 2005 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE. IEEE COMPUTER. SOC. LOS ALAMITOS, NJ, pages 814 - 815, XP010841833 *

Similar Documents

Publication Publication Date Title
US10027611B2 (en) Method and apparatus for classifying electronic messages
US10915564B2 (en) Leveraging corporal data for data parsing and predicting
Pynadath et al. Probabilistic state-dependent grammars for plan recognition
US8170966B1 (en) Dynamic streaming message clustering for rapid spam-wave detection
US8065379B1 (en) Line-structure-based electronic communication filtering systems and methods
Bratko et al. Spam filtering using statistical data compression models
US8131655B1 (en) Spam filtering using feature relevance assignment in neural networks
Norvig Natural language corpus data
Androutsopoulos et al. Learning to filter unsolicited commercial e-mail
US8527436B2 (en) Automated parsing of e-mail messages
Trivedi A study of machine learning classifiers for spam detection
JP7372812B2 (en) System and method for conversation-based ticket logging
US9699129B1 (en) System and method for increasing email productivity
CN103136266A (en) Method and device for classification of mail
AU2004281052A1 (en) Dynamic message filtering
JP2007157152A (en) Method and apparatus for identifying potential recipient candidate
Almeida et al. Facing the spammers: A very effective approach to avoid junk e-mails
CN107729520A (en) File classifying method, device, computer equipment and computer-readable medium
CN107256212A (en) Chinese search word intelligence cutting method
JP5056337B2 (en) Information retrieval system
WO2007093661A1 (en) Method for sorting e-mail messages into wanted mail and unwanted mail
Tahsin et al. A novel approach for e-mail classification using fasttext
Itskevitch Automatic hierarchical e-mail classification using association rules
Cesarini et al. A two level knowledge approach for understanding documents of a multi-class domain
JP5220200B2 (en) Data processing apparatus, data processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07712571

Country of ref document: EP

Kind code of ref document: A1