WO2007093661A1

WO2007093661A1 - Method for sorting e-mail messages into wanted mail and unwanted mail

Info

Publication number: WO2007093661A1
Application number: PCT/ES2007/070026
Authority: WO
Inventors: Mª DOLORES del CASTILLO SOBRINO; José Ignacio SERRANO MORENO; Salvador Ros Torrecillas
Original assignee: Consejo Superior De Investigaciones Científicas
Priority date: 2006-02-15
Filing date: 2007-02-05
Publication date: 2007-08-23

Abstract

The invention relates to a method for sorting e-mail messages into wanted mail and unwanted mail. According to the invention, at least prior to the start of the sorting process which is performed by a Bayesian filter, a group of heuristic patterns and heuristic words is generated, independently of the filter, from an analysis of old mails, said heuristic words forming an initial heuristic vocabulary, with morphological criteria being used to perform the analysis, including the heuristic patterns which associate the morphological criteria with the heuristic words in a database. Said database is supplied together with the initial heuristic vocabulary to the Bayesian filter. The method consists in progressively increasing the initial heuristic vocabulary during the use of the filter, using an intelligent management system which guarantees very effective and efficient sorting, thereby preventing the vocabulary from growing excessively with irrelevant words and the sorting time from increasing unnecessarily.

Description

Title

METHOD FOR CLASSIFYING EMAIL MESSAGES IN

DESIRED MAIL AND UNWANTED MAIL

Technical sector

The present invention relates in general to a method for classifying email messages in spam and spam, or by using a probabilistic filter, and in particular a method of carrying out said classification using a Bayesian filter. without the need for prior training.

State of the prior art

The sending of unwanted mail, or spam, in a massive and automatic way, is a big problem in our days, which causes great inconvenience to the users of email clients, as well as to the servers that host their email accounts .

The fundamental objective of a classification system, and in particular of an email classifier, is twofold: to identify unwanted emails and filter them, that is, to obtain good coverage results, and, on the other hand, to avoid the incorrect classification of emails. valid. This last objective is a priority since the assignment of the spam category to a valid email is an error of a higher order than not identifying an invalid email as spam due to the consequences of possible loss of useful information.

The use of probabilistic filters, especially those based on the Naϊve Bayes method, or Bayesian filters, as anti-spam filters is known in the state of the art, since the results obtained with them are quite acceptable in comparison with other filters used previously.

These filters, or mail identification and filtering systems based on the Naϊve Bayes classifier, start from the hypothesis of the statistical independence of words in a text belonging to a specific thematic category. Normally, most Bayesian filters developed for email learn from a corpus of Historical emails classified in two categories, spam and non-spam, from which they obtain an extensive vocabulary, which is common to all users of the filter and which, if not updated, in a short period of time may become obsolete.

An example of the use of one of said filters is that proposed by US-A-6161130, which concerns a method of classification of messages that uses a probabilistic classifier, such as a Bayesian, capable of recognizing a series of susceptible characteristics of being present in the incoming messages, for which the filter has been subjected to a previous training with a training set formed by a plurality of messages that include said characteristics, which is updated based on the result of the analysis of each message and the filter re-trained with the updated set.

These characteristics refer to attributes, for example, of the message format, such as referring to whether a word in the message is capitalized or not, or if the message contains a series of punctuation marks.

In the patent application US2005 / 0192992 a system and a method are proposed to determine the "purpose" of incoming data, to check for example if they are spam, which use as a classifier, for example, a Bayesian model. Various criteria are proposed when analyzing the messages: heuristic, inference, as well as extracting, from the analyzed messages, syntactic, linguistic characteristics, etc. For the embodiment that contemplates the use of a Bayesian filter, a training thereof is proposed from a set of data generated manually or automatically and used during a training phase of the Bayesian filter.

Both of the aforementioned antecedents have the above-described drawback of the need for prior training of the Bayesian filter, from which they obtain an extensive vocabulary that, although in the application US2005 / 0192992 the fact of updating the so-called data set is contemplated, implies The use of the Bayesian filter during said training phase to obtain at least one initial vocabulary with which to start classifying incoming messages in real time, with minimum guarantees of reliability. - Explanation of the invention

It seems necessary to offer an alternative to the state of the art that allows the creation of an initial vocabulary with which a probabilistic filter, such as a Bayesian filter, can start working from the beginning to identify and filter unwanted incoming emails, or spam, without the need to conduct any training on previously classified historical emails.

The present invention concerns a method for classifying email messages in spam and spam, of the type based on the use of a probabilistic filter to perform said classification, in particular a filter based on the Naϊve Bayes method , or Bayesian filter.

The method comprises generating, without the intervention of the Bayesian filter, that is, outside the filter, and prior to the beginning of said classification, a group of heuristic patterns and heuristic words, from the analysis of a series or corpus of historical mails , said heuristic words forming an initial heuristic vocabulary, using at least morphological criteria to perform said analysis, and including said heuristic patterns, which relate said morphological criteria to the heuristic words, in a database, said database being supplied together with the initial heuristic vocabulary to said probabilistic filter, from which it is able to start working.

Although in a later section several examples of heuristic words and morphological criteria will be presented, in order to clarify the description made in the following paragraphs, they serve the heuristic word "Nosense" and the related morphological criterion "consecutive consonants> = 4", as An example in which to support the following description. For this example, the heuristic word "Nosense" would be generated from the aforementioned analysis of historical mails, using the aforementioned criterion, that is, by finding words that have four or more consecutive consonants, such as the word "ffffffutbol".

The method proposed by the present invention comprises assigning and associating each of said heuristic words included in said Initial heuristic vocabulary, a probability that the presence in a mail of words that meet the morphological criteria referenced by said heuristic words, is indicative that this is an unwanted email, or high probability of spam. That is, if, for example, words with "consecutive consonants> = 4"("ffffffutbol" and others) that appear in many historical emails that turn out to be spam are found, the related heuristic word "Nosense" will be assigned in the heuristic vocabulary a very high probability of spam.

Once the mentioned group of heuristic patterns and the initial heuristic vocabulary have been generated, the proposed method includes using the Bayesian filter to consult the heuristic patterns of said database, which relate the heuristic words to the morphological criteria on which they are based , in order to search, in the mails to be classified, words that meet one or more of said morphological criteria, and assign to each word found the heuristic word related to the criteria met by said word found, and consult the heuristic vocabulary to assign the probabilities of said heuristic words assigned in the heuristic vocabulary.

For an example of realization, said assignment to each word found of the heuristic word related to the criteria met, is carried out by directly substituting in the mail to classify the word in question by the heuristic word, for example the word "ffffffutbol" by " Nosense ", every time it appears in the mail, and after that we proceed to the probabilistic analysis of the mail taking as reference the word" Nosense ".

For a preferred embodiment, although a word meets two or more different morphological criteria (for example, not only the aforementioned "consecutive consonants> = 4"), the method comprises using the Bayesian filter to search, in the mails to be classified, words that meet a criterion of said morphological criteria related to a first heuristic pattern of the database consulted, and assign the related heuristic word for said first pattern and the probabilities assigned in the heuristic vocabulary. Said heuristic patterns are ordered in the database, for an example of preferred embodiment, depending on the degree of spam probability assigned to the respective heuristic words related by them, said query being performed by the Bayesian filter following said order , so that if a word complies with said first pattern, because it is associated with a probability of spam greater than a possible second or third pattern (according to the order defined above), it is not necessary to verify that said word complies with said further patterns. The proposed method also includes updating said probabilities of said heuristic words progressively based on their influence on the result of the analysis of the mails classified by said probabilistic filter, increasing said probabilities if the influence of the associated heuristic word has caused a correct classification, or vice versa. In other words, if, for example, they had been cataloged as unwanted emails, erroneously, a series of emails containing one or more times the word "ffffffutbol", it would be deduced that finding that word in an email does not imply that it it is probably an unwanted email, so the probability associated with the heuristic word "Nosense" would be reduced. To adapt to the evolution of the emails and, therefore, to maintain the level of effectiveness of a filter, its evolution is necessary. For this, the proposed method includes using the filter so that it gradually learns while classifying.

To this end, the proposed method includes improving the classification capacity of said Bayesian filter through an interaction with the user through which the latter indicates to the Bayesian filter which emails have been correctly classified and / or which emails have been incorrectly classified, each time The filter classifies emails or periodically.

As regards the aforementioned adaptation to the evolution of the emails, for this the method includes updating and increasing said initial heuristic vocabulary supplied to said filter, initially formed only by heuristic words, progressively, during the use of said Bayesian filter on the emails classified by the filter, being analyzed mails incorrectly classified by said Bayesian filter and adding, for a preferred embodiment to the initial heuristic vocabulary, one or more words, heuristic and non-heuristic, obtained from said analysis. Preferably said obtaining of said heuristic and non-heuristic words is carried out by means of an analysis of the words contained in said mails incorrectly classified also using at least said morphological criteria, analysis which is preferably performed by the Bayesian filter itself, and also preferably carried out. out on each of the words of said emails. For an exemplary embodiment, the method comprises carrying out said updating of said heuristic vocabulary at least in part automatically, analyzing each email, immediately or every certain period of time programmed after being classified by the probabilistic filter and validated by the user. For another embodiment, the method comprises carrying out said updating of said heuristic vocabulary at least in part at the request of a user, analyzing only a series of emails indicated by the user.

As regards the aforementioned morphological criteria, these refer to at least one of the following criteria: length of each word analyzed, order or sequence and type of characters contained, adjacency of characters, or a combination thereof, being said consonant characters, vowels, numbers and / or symbols.

In the section referring to the explanation of some examples of embodiment, it will be described how to use said morphological criteria by means of the application of the proposed method to carry out said analysis, for some examples of embodiment.

In order to be able to filter unwanted emails in a multitude of languages, the proposed method comprises using morphological criteria that refer to the morphology of words for a plurality of different languages, thus representing the heuristic vocabulary described the words considered as invalid due to that their morphology, that is, the way they are structured, is incorrect in any language. - Brief description of the drawings

The foregoing and other advantages and features will be more fully understood from the following detailed description of some examples of embodiment, some of which with reference to the attached drawing, which should be taken by way of illustration and not limitation, in which:

Fig. 1 is a flow chart illustrating a series of actions to be performed according to an example of embodiment of the method proposed by the present invention.

- Detailed description of some embodiments

For a preferred embodiment, the method comprises in order to obtain said initial heuristic vocabulary, the use of a finite state automaton to generate said group of heuristic patterns and heuristic words by carrying out said analysis according to one or more morphological criteria. , in order to screen the words supplied to said automaton, separating them into valid words and invalid words, according to said morphological criterion used by the automaton.

In the first place, the method comprises building and training said finite state automaton by supplying it with words extracted from tagged emails, in general by a user, as valid, said automaton being once trained capable of recognizing as valid words those goods formed in terms of the sequence and type of the characters that compose them, the sequence and type of characters contained in the words being the morphological criteria used by the automaton for the present preferred embodiment.

Once the automaton has been built and trained, the method includes providing words extracted from tagged emails, in general by a user, as invalid, and classifying as words not valid those not recognized as valid by the automaton, thus producing the aforementioned screen that enables that later only the words not recognized by the automaton are submitted to the above described analysis that leads to the obtaining of the corresponding heuristic words that are included in the mentioned initial heuristic vocabulary. Therefore, the finite state automaton used for the application of the proposed method, is adapted, once trained, to automatically identify correctly formed words (or tokens) and differentiate them from those poorly formed, morphologically speaking. In other words, the automaton is able to recognize the grammar that describes the well-formed words, understood as character sequences, for a plurality of different languages.

As noted above, these characters can be a consonant ( ¹ C ¹ ), a vowel (V), a number ('n') or a symbol ( ¹ S ¹ ). The construction of the automaton from the words of said emails labeled as valid is carried out by means of the use of an ECGI algorithm of adapted grammatical inference, by means of which it is possible to obtain a set of examples of different well-formed words, taken from a corpus of valid emails, which are those that are supplied to the automaton during its training, so that it can be used as a reference to be able to recognize morphologically well-formed words.

Thus, for example, the valid word "scientific-technical", presented as the string of terms "c v v c c v c v c v s c v c c v c v" will be recognized by the automaton once trained for it. Each time the automaton recognizes a word, the word is considered to be well formed.

The words taken from spam emails, such as "v1 @ gra" represented by the string "cnsccv", are not recognized by the automaton as valid words and become identified as invalid words. The preferred construction of an automaton that recognizes well-formed words instead of misleading words is due to the fact that well-formed words are more uniform and homogeneous and fewer examples are necessary to construct a grammar that represents them quite completely. The following shows, by way of example, a series of heuristic words generated by the application of the proposed method, together with the criteria or criteria with which they are related by one or more heuristic patterns.

Once the automaton has made the aforementioned sieve, the method comprises analyzing, according to one or more of said morphological criteria, said words classified by the automaton as invalid and adding at least one heuristic word to the heuristic vocabulary, using one of said morphological criteria , as would be the case, for example, of the heuristic word "Minsize" related to the single criterion "Words of length <= than 1", or using more than one of said morphological criteria, as is the case of the heuristic word " Nosense "which can be obtained from the five criteria set forth, such as" Consecutive Consonants> = 4 "and" Consecutive Vocals> = 4 ", that is to say that the same heuristic word can be obtained from more than one different criterion, or put another way by two different paths, or using a compound morphological criterion, as is the case of the "LNS Metaword" that can be obtained from the morphological criteria or "Letter and / or Numbers and / or Consecutive symbols in any order", represented by the language described in the Backus-Naur form as {L, N, S} ⁺ , and which is instantiated by different heuristic words such as, by example, "LNS", "SN" or "NSLN". For an exemplary embodiment, the method comprises composing all the morphological criteria prior to its use to carry out said analysis, not being modified at any time, and for another example of embodiment, the method comprises composing, previously, only one or more initial morphological criteria, and modifying them, or creating new morphological criteria, simultaneously with the performance of said analysis. , and depending on the results obtained with it.

The method also comprises, for some or all of said heuristic words, obtaining at least one heuristic pattern per heuristic word, each pattern relating a heuristic word with at least one different morphological criterion, as is the case, for example, of the heuristic pattern that relates the heuristic word Minsize "to the unique morphological criterion" Words of length <= than 1 ".

For another embodiment, the method comprises, for some or all of said invalid words analyzed, obtaining at least one heuristic pattern per invalid word analyzed, each pattern for a different morphological criterion, as is the case, for example, of the possible five relationships, represented by five corresponding heuristic patterns, between the heuristic word "Nosense" and the five exposed morphological criteria, such as "Consecutive consonants> = 4" and "Consecutive vowels> = 4". Following with said example relative to the heuristic word "Nosense", for another example of realization said five relations could be represented by another number of heuristic patterns not necessarily equal to five.

For an exemplary embodiment, the method comprises registering the heuristic patterns obtained in said database. For another embodiment, the method comprises registering said morphological criteria related to said heuristic patterns in said database with the heuristic words obtained from them.

The proposed method comprises using the Bayesian filter to analyze the emails received on one or more of the following fields separately, based on a corresponding selection by the user: sender, subject and body, and, for an embodiment example for the which one selected more than one field, combine, in a weighted way, the result of all the analyzes performed to obtain a final result.

Due to the existence, in an e-mail message, of said parts, or fields, referring to subject, sender and body, the method comprises, for an embodiment example, generating a single initial heuristic vocabulary common to the subject and body parts and valid for said parts of the mail, or, for another example of realization, to generate an own and different heuristic vocabulary for each of said parts.

Following the example of heuristic words set forth above, that is, the words "Nosense", "Minsize", "Cns", "Raresymbs", "Numbers", and "LNS Metaword", these are what, for the present example of realization, they conform the initial heuristic vocabulary (with their corresponding probabilities of spam) to be supplied to the Bayesian filter for the subject and body parts, together with the database of heuristic patterns that relate them to the morphological criteria, initially, that is, before the filter has classified any mail, but as explained in a previous section, during its lifetime, this vocabulary will contain those same initial heuristic words in addition to other new words that are incorporated as a result of the analysis of mails incorrectly classified by the Bayesian filter, such words may be new heuristic words or non-heuristic words.

An example of such non-heuristic words is the word "transplant" which has only two consonants, so it does not meet any of the morphological criteria related to the heuristic words, but which the filter can extract from an erroneously classified mail and add to the cited heuristic vocabulary.

For example, when the filter receives the first mail to be classified, and although it preferably analyzes each part of the mail separately, focusing on the part or field corresponding to the subject, for each word it finds, first, check if the word is in the heuristic vocabulary, either the common one or the one associated with that part, that is, the subject field, depending on the example of realization. If the subject field has, for example, the following content: Do you need a heart transplant ?, and because the filter only has in its vocabulary (the heuristic vocabulary that has been supplied) the heuristic words Nosense, Minsize, Cns, Raresymbs, Numbers, and Metapalabra LNS, and therefore does not have the words: you need, transplant, heart, the next step the filter takes is to analyze if each of these three words meets any criteria referenced by a corresponding heuristic pattern of The database. In this example, only the word heart would meet the morphological criteria of "Letter and / or Numbers and / or Consecutive Symbols in any order" and the heuristic word "LNL" would be assigned. Of the other two words he knows nothing and therefore, the filter calculates the probability of SPAM and NO-SPAM of the subject, like any Bayesian filter, only taking into account the probabilities that the LNS Metaword has been assigned a priori. If the mail is classified in one of the two categories and the user gives the approval, the vocabulary is not updated in words but in the probabilities of the words it contains, since its use has caused that the classification has been correct.

If, on the contrary, the user considers that the classification has been incorrect (the filter classified the mail as SPAM when it is NO-SPAM or as NO-SPAM and is SPAM), the filter incorporates the subject vocabulary (either own or common to the rest of the fields) the words "you need" and "transplant" with the corresponding SPAM and NO-SPAM probabilities. In this way, the vocabulary of the subject part, or the common one to the subject and body parts if that is the case, would then be formed by {Nosense, Minsize, Cns, Raresymbs, Numbers, Metapalabra LNS, you need, transplant, LNL}.

Obviously the word "transplant" is invalid and is intended to be misleading but since it does not meet any of the morphological criteria related by heuristic patterns, it goes into the heuristic vocabulary as it is. Having considered a morphological pattern or criterion on two consecutive consonants would generate many false positives since there are countless valid words with two or three consonants in a row (constipated, absolute, inspire, etc.). Said extended vocabulary, formed by the initial heuristic words, plus the heuristic and non-heuristic words added, will be the one that the Bayesian filter uses in later analyzes, with which the more the vocabulary is increased the greater the precision of the filter, provided when the increase is not excessive, since this would cause a longer classification time and, therefore, a lower efficiency of the filter.

By means of the present invention it is achieved, as will be explained later, that said vocabulary does not grow in vain, that is to say that the words that are incorporated therein are relevant for later use in the classifications carried out by the Bayesian filter, discarding thus the words that would contribute uselessly to vocabulary growth, increasing the mentioned classification time, and that would have a practically null contribution in the discriminatory potential of the filter.

To avoid the mentioned excessive increase in vocabulary, or vocabularies according to the example of embodiment, the method proposed by the present invention comprises carrying out an intelligent management thereof.

For the aforementioned example of realization in which each part or field is associated with its own vocabulary, said intelligent management is carried out for each of the aforementioned vocabularies, and is achieved thanks to certain actions focused on conveniently selecting the words to incorporate to each of these vocabularies, and grouped into the following three groups:

1) Regarding the correctly classified emails, which are eliminated from the filter domain, the vocabularies of the subject and body part are not extended with the new words contained in these emails since the reason for their correct classification lies in the words they contained and were present in the vocabulary. The possibility of including new mail words correctly classified in a category (spam or non-spam) in the vocabulary as words with a higher probability of belonging to that category, would increase the size of the vocabulary without being certain that those words contribute relevant information and that can be part of any of the two categories (spam and non-spam). 2) When the filter has to learn from the classification errors, the vocabularies of the subject and body parts are updated with the words contained in the misclassified emails, whether they are false positive or negative, and to the new words, both heuristic and not heuristics, they are assigned a corresponding probability of spam and non-spam set by the filter for new heuristic and non-heuristic words. If for this case in which the filter learns from a poorly classified email, the email contains invalid words that conform to some criteria referenced by a respective heuristic pattern, these are not added to the vocabulary but only the probabilities associated with the word are updated. heuristic pattern that relates to the criteria that meets the invalid word.

3) In the sending part the initial heuristic vocabulary is empty, whereby the Bayesian filter begins to operate using said empty vocabulary to analyze the sender field of incoming mails, which produces a classification of said field as non-spam. The weighted classification made by the filter for the three parts of the emails will allow obtaining the final classification of the same, which may be spam or non-spam. If the user indicates to the filter that the classification has been correct, the proposed method includes storing in the vocabulary of the sender field, only the sender of the non-spam emails with a probability associated with the non-spam category higher than the spam . When the filter has to learn because errors have occurred in the classification of emails, it stores only the email addresses, or senders, of the false positives, that is, of the emails that really are non-spam. The senders of the false negatives, that is to say, of the emails erroneously assigned to the non-spam category, are ignored since most of the unsolicited emails almost never come from the same sender and their inclusion would contribute unnecessarily to the growth of the vocabulary of this part.

That is to say that thanks to the intelligent management described, the initial heuristic vocabularies of the body and subject fields only increase with heuristic and non-heuristic words extracted from misclassified emails, and on the contrary, in what refers to the vocabulary of the sending field, which is initially empty, it only increases with the incorporation of email addresses of emails that really are non-spam, regardless of whether they have been well or poorly classified. Action groups 1) and 2) are also applicable to the example of realization for which the subject and body fields share a single common vocabulary. As regards the Bayesian filter itself, the proposed method comprises selecting the operating mode of said Bayesian filter, by a user, from among the following three modes, depending on whose selection one or other action will be performed with emails received: - mode 1, or blind trust mode, whose selection causes emails that the filter identifies as spam are not delivered to the user,

- mode 2, or supervised mode, whose selection causes the emails that the filter detects as spam to be delivered to the user including a character string, or label, in the field related to the subject of the mail, and - mode 3, whose selection causes that all emails are delivered to the user, regardless of whether or not they are spam.

When the Bayesian filter operates in said mode 2, or supervised mode, the method comprises choosing by the user, which character string, or label, to include in said field concerning the subject of the mail. The proposed method is applicable regardless of the mail client used, and includes adjusting the error bias of incorrectly classified emails, such as spam or non-spam, of the Bayesian filter by modifying, by the user, a certain parameter . The proposed method comprises showing a user at least one of the following elements, or a combination thereof:

- filter statistics, indicating the percentage of hits and errors in spam emails detected and not detected,

- a history or trace of the user's connections to the mail server, showing for each of them the date, time and number of emails received with the results of the filter, as well as possible specific connection errors, - a list of the emails received along with the identification that the filter has made of each of them, list on which the user can indicate to the filter the successes and mistakes made so that he learns from them, being possible to delete emails from the list so that the filter does not consider them again, and - a configuration panel where the user can adjust certain parameters of the filtrate and connections, and

- an indicator whose selection by the user allows stopping or starting the filtering service at any time.

In Fig. 1 a flow chart is shown that illustrates the use of the Bayesian filter, both to classify incoming mails, and to learn from their successes and errors, through a series of actions to be carried out according to the method proposed by the present invention, representative of an exemplary embodiment, some of which have already been described above, said actions carried out by the Bayesian filter once the initial heuristic vocabulary has been created, as well as the heuristic patterns.

The meaning of the words included in said figure must be taken as respective indications of said actions, as well as of the elements used by the proposed method, when appropriate, as are the following elements: - Analysis data learned: refers to both to the heuristic vocabulary as to the database that includes the heuristic patterns.

- Mailing list: refers to the index of the emails shown to the user, that is, the subject and sender fields of the emails.

- List and temporary emails: refers to the complete emails, that is, with the subject, sender and body fields.

In said Fig. 1, in the first place it is possible for the user to make some changes in the configuration, for example by means of the mentioned configuration panel shown thereto to modify said parameters, which in Fig. 1 are stored in a data structure, represented by a cylinder to the right of Fig. 1

Then, after waiting for the filter, two options are shown to the user, depending on whether he wants to receive emails (left part of Fig. 1) or wants to send the filter to learn (right part of Fig. one ). If the user wishes to receive emails, first of all the mail client used by the latter is connected to the filter, which acts as an intermediary or proxy server with respect to the incoming mail server, for example POP (or POP3), at which one connects Once the connection has been established if the user does not wish to receive mail at that time, the filter is placed on hold again, but if, on the contrary, the user wishes to receive mail, the filter is required to the POP server, which is sent , after which the filter processes it by extracting its vocabulary (for some or all fields, depending on a selection by the user). The filter analyzes the parts (subject, sender and body) of the extracted vocabulary and determines if the email is spam or not, combining them, carrying out said analysis by consulting the heuristic vocabulary and the database generated and supplied to it according to the proposed method , as indicated in Fig. 1 with the dashed arrow from the cylinder with the legend: "Analysis data learned" described above.

The filter saves the vocabulary extracted from the mail temporarily and adds an entry associated to the commented temporary list, as shown by the corresponding dashed arrow in Fig. 1 directed to the element with the legend "list and received temporary emails". If the filter has detected that the mail received is spam, the filter acts according to the mode selected by the user: in mode 1, or "blind trust" mode, the filter deletes the mail from the server, in mode 3, the Ie filter deliver the mail to the mail client, and in mode 2, or supervised mode, the filter includes a string in the mail subject, such as: "this mail is spam". Therefore, it refers to the option to send the filter to learn

(right part of Fig. 1), if it is selected by the user, first of all if the mailing list (subject and sender) received, or index list is empty, the filter is kept on hold, and if not It is possible for the user to select, from said index list, one or more emails, selection which implies that the user considers that said selected emails have been incorrectly classified by the filter, so that he learns from his mistakes. The emails that the user does not select are considered as well classified, and used by the filter to learn from their successes. These mails are in the above described "list and received temporary mails", as indicated by the two arrows in dashed lines coming from the element with said legend (see bottom right of Fig. 1). Either from the learning of the mails incorrectly classified as from those classified correctly, by analyzing said emails by the filter, the data for subsequent analysis is updated, as indicated by the dashed arrow directed towards the cylinder with the legend "Analysis data learned". As regards the update based on the analysis of the incorrectly classified emails, it refers to the extraction and incorporation in the heuristic vocabulary of the non-heuristic words found in said emails with a corresponding probability of spam or non-spam, as the case may be, as well as updating the probabilities of spam and non-spam of words, both heuristic and non-heuristic, if they exist, used to classify said emails, depending on their influence on said incorrect classifications.

As regards the update based on the analysis of the mails correctly classified, this refers mainly to the update of the probabilities of spam and non-spam of the words, both heuristic and non-heuristic, if they exist, used to classify these emails, depending on their influence on said correct classifications.

Once the update described has been completed, the filter deletes the content of the element with the legend "list and temporary emails received", that is, it deletes the temporary mail and its entry from the temporary list, since said temporary emails have already been used to fulfill its function, either to facilitate the words to be incorporated into the heuristic vocabulary in order to increase it, and therefore to improve it, and to adjust the probabilities of the words of said vocabulary, after which the filter updates its statistics, and it goes back to standby until the user selects one of the two possible options described. A person skilled in the art could introduce changes and modifications in the described embodiments without departing from the scope of the invention as defined in the appended claims.

Claims

1.- Method for classifying email messages in desired mail and spam, or spam, of the type based on the use of a probabilistic filter to perform said classification, characterized in that it comprises, at least prior to the beginning of said classification, generating a group of heuristic patterns and heuristic words, based on the analysis of a series or corpus of historical mails, said heuristic words forming at least one initial heuristic vocabulary, using to perform said analysis at least morphological criteria, and include said heuristic patterns , which relate said morphological criteria to the heuristic words, in a database, said database being supplied together with said initial heuristic vocabulary, which is at least one, to said probabilistic filter so that it can begin to function.

2. Method according to claim 1, characterized in that said probabilistic filter is a filter based on the Naϊve Bayes method, or Bayesian filter.

3. Method according to claim 1 or 2, characterized in that it comprises improving the classification capacity of said probabilistic filter through an interaction with the user by means of which it indicates to the probabilistic filter which emails have been correctly classified and / or which emails you have classified incorrectly, every time the filter classifies the emails or periodically.

4. Method according to claim 3, characterized in that it comprises updating and increasing the initial heuristic vocabulary, which is at least one, supplied to said filter, initially formed only by heuristic words, progressively, during the use of said probabilistic filter on emails electronic classified by the filter, the mails classified incorrectly by said probabilistic filter and adding to the heuristic vocabulary one or more words obtained from said analysis.

5. Method according to claim 4, characterized in that said obtaining of said words to be included in the heuristic vocabulary is carried out by means of an analysis of the words contained in said emails incorrectly classified also using at least these morphological criteria.

6. Method according to claim 4 or 5, characterized in that it comprises carrying out said update of said heuristic vocabulary at least in part automatically, analyzing each email, immediately or every certain period of time programmed after being classified by the probabilistic filter and validated by the user.

7. Method according to claim 4 or 5, characterized in that it comprises carrying out said updating of said heuristic vocabulary at least in part at the request of a user, analyzing only a series of emails indicated by the user.

8. Method according to claim 1, characterized in that said morphological criteria refer to at least one of the following criteria: length of each word analyzed, order or sequence and type of characters contained, adjacency of characters, or a combination thereof .

9. Method according to claim 8, characterized in that each of said characters is at least one of the group consisting of consonants, vowels, numbers and symbols.

10. Method according to claim 9, characterized in that said morphological criteria refer to the morphology of words for a plurality of different languages.

11. Method according to claim 1, characterized in that it comprises the use of a finite state automaton to generate said group of heuristic patterns and heuristic words by carrying out said analysis according to one or more morphological criteria, in order to screen the words supplied to said automaton, separating them into valid words and invalid words, according to said morphological criterion used by the automaton.

12. Method according to claim 11, characterized in that it comprises constructing and training said finite state automaton by supplying it with words extracted from emails labeled as valid, said automaton being once trained capable of recognizing as well-formed words those well formed regarding the sequence and type of the characters that they compose them, being the sequence and type of characters contained in the words the morphological criterion used by the automaton.

13. Method according to claim 12, characterized in that said automaton construction from the words of said emails labeled as valid is carried out by means of the use of an adapted ECGI algorithm.

14. Method according to claim 12, characterized in that it comprises providing said automaton with words extracted from emails labeled as invalid, and classifying as words not valid those not recognized as valid by the automaton.

15. Method according to claim 14, characterized in that it comprises analyzing, according to one or more of said morphological criteria, said words classified by the automaton as invalid and adding to said heuristic vocabulary at least one heuristic word, using at least one said criteria morphological

16. Method according to claim 15, characterized in that it comprises, for some or all of said heuristic words, obtaining at least one heuristic pattern per heuristic word, each pattern relating a heuristic word with at least one different morphological criterion, and registering in said database heuristic patterns obtained.

17. Method according to claim 15, characterized in that it comprises obtaining the same heuristic word from said analysis of one or more invalid words, using two or more different morphological criteria.

18. Method according to claim 17, characterized in that, for said heuristic word obtained on the basis of said two or more morphological criteria, obtaining two or more heuristic patterns, each pattern relating the heuristic word with a respective morphological criterion of said morphological criteria , which are at least two, and record in said database the heuristic patterns obtained.

19. Method according to claim 16, characterized in that it also includes registering said morphological criteria in said database, by means of said heuristic patterns, with the heuristic words obtained from them.

20. Method according to claim 12 or 14, characterized in that at least one of said labels is carried out by a user.

21. Method according to claim 1 or 16, characterized in that it comprises assigning and associating to each of said heuristic words included in said heuristic vocabulary, a probability that the presence in a mail of words that meet the morphological criteria referenced by said words heuristics, be indicative that this is a spam, or high probability of spam.

22. Method according to claim 21, characterized in that it comprises using said probabilistic filter to consult the words of said heuristic vocabulary and the heuristic patterns of said database, which relate the heuristic words to the morphological criteria on which they are based, in order to search, in the mails to be classified, words that meet one or more of said morphological criteria.

23. Method according to claim 21, characterized in that it comprises updating said probabilities of said heuristic words progressively based on their influence on the result of the analysis of the mails classified by said probabilistic filter, increasing said probabilities if the influence of the heuristic word associated has caused a correct classification, or vice versa.

24.- Method according to claim 22 or 23, characterized in that it comprises assigning to each word found, as a result of said search by the probabilistic filter, that it meets the morphological criteria related by the first heuristic pattern of the database consulted, Ia heuristic word related by said first pattern and the probabilities assigned in the heuristic vocabulary.

25.- Method according to claim 24, characterized in that said heuristic patterns are arranged in the database according to the degree of spam probability assigned to the respective heuristic words related thereto, said query being performed by the probabilistic filter following that order.

26.- Method according to claim 4, because the analysis of the mails classified incorrectly comprises analyzing, by means of the Bayesian filter, each of the words of said mails.

27.- Method according to claim 2, characterized in that it comprises selecting the operating mode of said Bayesian filter, by a user, from among the following three modes, depending on whose selection one or the other action will be performed with the received mails :

- mode 1, or blind trust mode, whose selection causes emails that the filter identifies as spam are not delivered to the user, - mode 2, or supervised mode, whose selection causes emails that the filter detects as spam are delivered to the user including a string of characters, or label, in the field referring to the subject of the mail, and

- mode 3, whose selection causes all emails to be delivered to the user, regardless of whether or not they are spam.

28.- Method according to claim 27, characterized in that when the Bayesian filter operates in said mode 2, or supervised mode, it comprises choosing by the user, which character string, or label, to include in said field referring to the subject of the mail.

29.- Method according to claim 2, characterized in that it comprises using said Bayesian filter to analyze the emails received on one or more of the following fields separately, based on a corresponding selection by the user: sender, subject and body, and , in the case of having selected more than one field, combine the result of all the analyzes performed to obtain a final result.

30.- Method according to claim 2, characterized in that it is applicable regardless of the mail client used.

31.- Method according to claim 3, characterized in that it comprises adjusting the error bias of the incorrectly classified emails, such as spam or non-spam, of said probabilistic filter by means of the modification, by the user, of a given parameter.

32.- Method according to claim 29, characterized in that it comprises, for said case, having selected more than one field a analyze, perform the analysis of the subject and body fields using a single common initial heuristic vocabulary.

33.- Method according to claim 29, characterized in that it comprises, for said case of having selected more than one field to analyze, perform the analysis of the subject and body fields using two respective initial heuristic vocabularies.

34.- Method according to claim 29, 32 or 33, characterized in that it comprises, for said case of having selected more than one field to be analyzed, performing the analysis of the sending field using an initially empty vocabulary, and updating said initially empty vocabulary, adding words related to email addresses obtained from respective email sender fields validated by the user as non-spam.