CN104361015A - Mail classification and recognition method - Google Patents

Mail classification and recognition method Download PDF

Info

Publication number
CN104361015A
CN104361015A CN201410547075.1A CN201410547075A CN104361015A CN 104361015 A CN104361015 A CN 104361015A CN 201410547075 A CN201410547075 A CN 201410547075A CN 104361015 A CN104361015 A CN 104361015A
Authority
CN
China
Prior art keywords
expression
mail
voice feature
feature data
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410547075.1A
Other languages
Chinese (zh)
Inventor
罗阳
陈虹宇
王峻岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SICHUAN SHENHU TECHNOLOGY Co Ltd
Original Assignee
SICHUAN SHENHU TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SICHUAN SHENHU TECHNOLOGY Co Ltd filed Critical SICHUAN SHENHU TECHNOLOGY Co Ltd
Priority to CN201410547075.1A priority Critical patent/CN104361015A/en
Publication of CN104361015A publication Critical patent/CN104361015A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a mail classification and recognition method. The method classifies and recognizes mails at levels in various ways and comprises the steps of obtaining classification attributes of the mails transmitted and received by a user according to expression and/or phonetic feature data when the user transmits and receives the mails, constructing a plurality of classifiers for the mails of which categories cannot be determined, transmitting a classification result of each classifier to a decision center, voting for the classification results obtained by the classifiers via a voting algorithm in the decision center, carrying out incremental update on the classifiers finally, and updating a preset expression and/or phonetic feature database with the final classification results. The recognition efficiency of the classification attributes is improved. The method can solve the problems of lower discrimination and efficiency of the mail classification and recognition method for the mails in the prior art.

Description

A kind of classification of mail recognition methods
During technical field
The present invention relates to a kind of classifying identification method of mail, be applicable to the fields such as Web content supervision, Spam filtering.
Background technology
Along with the development of internet, applications, Email is widely used, and become one of service the most basic on Internet, user can carry out economy, convenience and information interchange efficiently by Email and long-distance user.But, just while Email becomes a kind of indispensable important information media of communication gradually, also becoming a kind of commercial advertisement means.User is while receiving useful information, plenty of time and the how various mail of energy also must be spent to carry out Classification and Identification, to filter " rubbish " mail, and existing classification of mail recognition methods or adopt more single classifying identification method and cause result inaccurate, or use too complicated recognition method and improve time cost.Therefore, the accuracy rate and the efficiency that how to improve classification of mail identification are the hot issues studied at present.
Sorting technique conventional at present has a variety of, there is the method based on probability, as bayes method, its principle passes through probability calculation, most probable class object value is obtained by the property value of data object to be sorted, namely the conditional probability of each classification when this group property value given is calculated, and using class label maximum for output condition probable value as desired value.Its shortcoming is that precondition is not easy to meet; The method of Case-based Reasoning, as KNN method, its ultimate principle is the distance between Case-based Reasoning, and concerning each example, if the example near it is all certain classification, so this example also may be this classification.The shortcoming of the method is that classification effectiveness is lower; The method of Corpus--based Method study, as SVM etc.SVM classifier is one of best at present text classifier.Its shortcoming is that the selection of kernel function lacks guidance, is difficult to select best kernel function for particular problem.SVM training speed is greatly subject to the impact of training set scale in addition, and computing cost is larger.
Although each own respective advantage of these methods, each own different shortcoming, classification accuracy is the highest about 80%, can't meet the requirement of actual use.
Voting Algorithm, its core concept is: k (k be greater than 1 integer) efficient combination of individual expert judgments should be better than the judgement of certain expert individual.Voting Algorithm mainly contains two kinds: Bagging algorithm and Boosting algorithm.
Support vector machine is widely used in every field as a kind of classification tool.DUAL PROBLEMS OF VECTOR MAPPING in the space of a more higher-dimension, is set up and is had a largest interval lineoid by support vector machine in this space.Two lineoid parallel to each other are had on the both sides of the separately lineoid of data.Setting up the suitable separating hyperplane in direction makes the distance between two parallel with it lineoid maximize.It is assumed to, the distance between parallel lineoid or gap larger, the total error of sorter is less.
Current historical information is represented by support vector and the weight that associates with them.Therefore, in incremental update each time, the support vector and the new data arrived that describe class boundary information are upgraded support vector machine by as new data set.
The technology of support vector machine being carried out to incremental update comprises error-driven method ED (Error-driven technique), fixing dividing method FP (Fixed-partitiontechnique), super partitioning method EM (Exceeding-margin technique), super interval+error approach EM+E (Exceeding-margin+error technique) etc.
Summary of the invention
Fundamental purpose of the present invention is the classifying identification method providing a kind of mail, utilizes various ways to carry out multi-level Classification and Identification to mail, adopts the ballot mode of decision center to obtain Accurate classification result; Incremental update is carried out to the sorter built, to improve the adaptive ability of sorter; And upgrade with final classification results and preset expression and/or voice feature data storehouse, improve the recognition efficiency of categorical attribute, the classifying identification method that can solve mail in prior art is to the discrimination of mail and the lower problem of efficiency.
To achieve these goals, according to an aspect of the present invention, provide a kind of classification of mail recognition methods, comprise the following steps:
Step 1, expression during acquisition user's receiving and dispatching mail and/or voice feature data; And the categorical attribute of the mail of user's transmitting-receiving is obtained according to described expression and/or voice feature data, described categorical attribute comprises: normal email, spam and cannot confirm;
If the categorical attribute obtained is normal email or spam, then terminate classification, otherwise perform step 2.
Further, further comprising the steps of after step 2:
Step 3, is sent to decision center by the classification results of each sorter, adopts Voting Algorithm to vote to the classification results that described multiple sorter obtains, obtain final classification results in decision center;
Step 4, carries out incremental update to described multiple sorter, and upgrades default expression and/or voice feature data storehouse with final classification results.
Further, after step 1, and comprise before step 2:
Participle, characteristic vector pickup, weight calculation pre-service are carried out to mail;
Wherein, described characteristic vector pickup comprises: mail head's characteristic vector pickup, attachment feature vector extracts and text characteristic vector pickup; And
Extract proper vector in the mode of Database field stored in characteristic vector data storehouse.
Further, described sorter can adopt the learning algorithm based on decision tree to build.
Further, carry out incremental update to described multiple sorter to comprise:
Often receive and dispatch an envelope mail, obtain the proper vector of this mail;
Judge whether described proper vector is positioned at the class interval having built sorter;
If in described class interval, this mail is temporarily stored;
When store mail number reach preset numerical value time, using the proper vector of the mail of storage with build the support vector of sorter jointly as new training sample set, incremental update is carried out to the sorter built;
Delete the mail temporarily stored.
Further, described multiple sorter can comprise: SVM classifier, KNN sorter and Bayes classifier.
Further, described expressive features data comprise: eye position information, eye shape information, eyebrow positional information, eyebrow shape information, face positional information and face shape information;
Described voice feature data comprises: tone information, word speed information and filterableness keyword.
Further, the categorical attribute obtaining the mail of user's transmitting-receiving according to described expression and/or voice feature data in described step 1 comprises:
The default expression matched from default expression and/or voice feature data library lookup and described expression and/or voice feature data and/or voice feature data;
When find out described expression and/or voice feature data and first preset express one's feelings and/or voice feature data matches time, determine that described expression and/or expression corresponding to voice feature data and/or speech data are the first expression and/or speech data, and determine that the type of the mail that described user receives and dispatches is the first kind, wherein, described first to preset expression and/or voice feature data be arbitrary expression in described default expression and/or voice feature data storehouse and/or voice feature data, the corresponding relation of expression and/or voice feature data and email type is also stored in described default expression and/or voice feature data storehouse, and
When find out described expression and/or voice feature data and second preset express one's feelings and/or voice feature data matches time, determine that described expression and/or expression corresponding to voice feature data and/or speech data are the second expression and/or speech data, and determine that the type of the mail that described user receives and dispatches is Second Type, wherein, described second to preset expression and/or voice feature data be arbitrary expression in described default expression and/or voice feature data storehouse and/or voice feature data, and described second presets expression and/or voice feature data default to express one's feelings and/or voice feature data is different expression and/or voice feature data with described first.
Further, after determining that the type of the mail that described user receives and dispatches is Second Type, also comprise:
The priority of more described first expression and/or speech data and described second expression and/or speech data;
When the priority comparing described first expression and/or speech data is higher than described second expression and/or the priority of speech data, before the mail controlling the described first kind is arranged in the mail of described Second Type; And
When the priority comparing described first expression and/or speech data is lower than described second expression and/or the priority of speech data, after the mail controlling the described first kind is arranged in the mail of described Second Type.
Further, more described first expression and/or speech data and described second expression and/or speech data priority before, also comprise:
Receive the setting instruction of described user; And
The priority of described first expression and/or speech data and described second expression and/or speech data is determined according to described setting instruction.
Classification of mail recognition methods of the present invention can realize following beneficial effect:
The first, by expression during acquisition user's receiving and dispatching mail and/or voice feature data; And according to expression and/or voice feature data, the mail that user receives and dispatches is classified.
Generally speaking, when user handles postal matter, mood often changes because of Mail Contents, or itself be in a kind of mood, different mood can make the expressive features data of user different, by obtaining expressive features data during user's receiving and dispatching mail, then based on the expressive features data got, mail is classified, because user is relatively more deep to emotional memory when handling postal matter to oneself, thus can by the expressive features data corresponding with mood quickly to mail preliminary classification.
Simultaneously, for some spams (such as advertisement), or often comprise the voice that some are strange, or there are the voice of a lot of business marketing term, sensitive word or other set forms, or due to format recording, there is more stable word speed and intonation, and these are easier to classification identification often.
By expression and/or speech recognition, the Classification and Identification time can be shortened, to realize the preliminary classification identification of mail.
The second, by decision center, adopt Voting Algorithm to vote to the classification results that multiple sorter obtains, obtain accurate classification results.
3rd, incremental update can be carried out to multiple sorter, to improve the adaptive ability of sorter; And upgrade default expression and/or voice feature data storehouse with final classification results, improve the recognition efficiency of categorical attribute.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the classification of mail recognition methods according to the embodiment of the present invention.
Embodiment
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.
Embodiments provide a kind of classification of mail recognition methods, below the classification of mail recognition methods that the embodiment of the present invention provides be specifically introduced:
Fig. 1 is the process flow diagram of the classification of mail recognition methods according to the embodiment of the present invention, and as shown in Figure 1, the method comprises following step:
Step 1, expression during acquisition user's receiving and dispatching mail and/or voice feature data; And the categorical attribute of the mail of user's transmitting-receiving is obtained according to described expression and/or voice feature data, described categorical attribute comprises: normal email, spam and cannot confirm;
If the categorical attribute obtained is normal email or spam, then terminate classification, otherwise perform step 2;
Step 2, uses multiple sorter to classify to mail successively.
As preferably, further comprising the steps of after step 2:
Step 3, is sent to decision center by the classification results of each sorter, adopts Voting Algorithm to vote to the classification results that described multiple sorter obtains, obtain final classification results in decision center;
Step 4, carries out incremental update to described multiple sorter, and upgrades default expression and/or voice feature data storehouse with final classification results.
In a preferred embodiment of the invention, by expression during acquisition user's receiving and dispatching mail and/or voice feature data; And according to expression and/or voice feature data, preliminary classification is carried out to the mail that user receives and dispatches.
Generally speaking, when user handles postal matter, mood often changes because of Mail Contents, or itself be in a kind of mood, different mood can make expression and/or the voice feature data difference of user, by obtaining expression during user's receiving and dispatching mail and/or voice feature data, then based on the expression got and/or voice feature data, mail is classified, because user is more deep to emotional memory when handling postal matter to oneself, thus can by the expression corresponding with mood and/or voice feature data quickly to mail preliminary classification.
Meanwhile, for some spams (such as advertisement), or often comprise some strange voice, or there are the voice of a lot of business marketing term, sensitive word or other set forms, and these are easier to classification identification often.
By expression and/or speech recognition, the Classification and Identification time can be shortened.
By decision center, adopt Voting Algorithm to vote to the classification results that multiple sorter obtains, obtain accurate classification results;
Incremental update can be carried out to multiple sorter, improve the adaptive ability of sorter, Classification and Identification is tied
Fruit is more accurate; Meanwhile, upgrade with final classification results and preset expression and/or voice feature data storehouse, can carry
The recognition efficiency of high-class attribute.
In a preferred embodiment of the invention, described expressive features data can comprise: eye position information, eye shape information, eyebrow positional information, eyebrow shape information, face positional information and face shape information etc. compare the expressive features data being easy to recognize;
Described voice feature data can comprise: tone information, word speed information, filterableness keyword etc.
Wherein, the categorical attribute obtaining the mail of user's transmitting-receiving according to described expression and/or voice feature data comprises:
After the expression getting user and/or voice feature data, the default expression matched from default expression and/or voice feature data library lookup and described expression and/or voice feature data and/or voice feature data; Wherein, store and the type information of expressing one's feelings and/or voice feature data is corresponding in described default expression and/or voice feature data storehouse;
When find out described expression and/or voice feature data and first preset express one's feelings and/or voice feature data matches time, determine that described expression and/or expression corresponding to voice feature data and/or speech data are the first expression and/or speech data, and determine that the type of the mail that described user receives and dispatches is the first kind, wherein, described first to preset expression and/or voice feature data be arbitrary expression in described default expression and/or voice feature data storehouse and/or voice feature data, the corresponding relation of expression and/or voice feature data and email type is also stored in described default expression and/or voice feature data storehouse, and
When find out described expression and/or voice feature data and second preset express one's feelings and/or voice feature data matches time, determine that described expression and/or expression corresponding to voice feature data and/or speech data are the second expression and/or speech data, and determine that the type of the mail that described user receives and dispatches is Second Type, wherein, described second to preset expression and/or voice feature data be arbitrary expression in described default expression and/or voice feature data storehouse and/or voice feature data, and described second presets expression and/or voice feature data default to express one's feelings and/or voice feature data is different expression and/or voice feature data with described first.
Wherein, after determining that the type of the mail that described user receives and dispatches is Second Type, also comprise:
The priority of more described first expression and/or speech data and described second expression and/or speech data;
When the priority comparing described first expression and/or speech data is higher than described second expression and/or the priority of speech data, before the mail controlling the described first kind is arranged in the mail of described Second Type; And
When the priority comparing described first expression and/or speech data is lower than described second expression and/or the priority of speech data, after the mail controlling the described first kind is arranged in the mail of described Second Type.
Wherein, more described first expression and/or speech data and described second expression and/or speech data priority before, also comprise:
Receive the setting instruction of described user; And
The priority of described first expression and/or speech data and described second expression and/or speech data is determined according to described setting instruction.
In a preferred embodiment of the invention, the expressive features data obtaining user are mated mainly through existing face recognition technology (such as regional characteristics analysis algorithm), built skin detection and the user got characteristic of expressing one's feelings is utilized to carry out signature analysis, result according to analyzing provides a similar value, can be determined whether as certain expression user-defined by this value.
In a preferred embodiment of the invention, the voice feature data obtaining user mates mainly through existing speech recognition technology, built phonetic feature template is utilized to carry out signature analysis with the user vocal feature data got, result according to analyzing provides a similar value, can determine whether as certain voice user-defined by this value; In addition, if comprise some common filtration sensitive words, commercial advertisement publicity vocabulary in mail, and some other User Defined filterableness term and vocabulary, can Classification and Identification be spam.
In a preferred embodiment of the invention, because the definition of mood respective between different user and identification have a lot of complicacy and otherness, different people may be expressed one's feelings and/or be had very big-difference between the performance of voice and actual mood.In the preferred embodiment of the present invention, user can extract the characteristic information of the current expression of user and/or voice by camera/microphone when self-defined expression and/or voice feature data, and the expression and/or phonetic feature mail that these are expressed one's feelings and/or voice are corresponding is set simultaneously, realize self-defined setting fast and easily and express one's feelings and/or voice feature data.When guiding user oneself definition expression and/or voice feature data, user can be guided to be that different expressions and/or voice feature data distribute a unique ID, such as the expression that shows respectively under the various mood such as happy, sad, excited, detest, doubt and/or voice feature data is corresponding arranges a unique ID.
In a preferred embodiment of the invention, allow user can be arranged by User Defined in advance expression and/or arranging of voice feature data, also can arrange in following process: when user's receiving and dispatching mail, Real-time Obtaining is carried out to user's expression now and/or voice feature data, and the default expression of inquiry and/or voice feature data storehouse are to obtain the default expression corresponding with the expression got and/or voice feature data and/or voice feature data, and then determine the type of the type of the mail that user is now received and dispatched corresponding to the default expression that finds and/or voice feature data.
But, when not finding the default expression corresponding with the current expression that gets and/or voice feature data and/or the words of voice feature data in default expression and/or voice feature data storehouse, user is then described also not to this expression and/or voice feature data define at present, categorical attribute now in step 1 is for confirming, namely expression and/or phonetic feature Classification and Identification step after, if can not determine that the categorical attribute of mail is normal email or spam, then need to proceed Classification and Identification by building sorter to the mail that these cannot confirm.
In a preferred embodiment of the invention, if obtain the categorical attribute of the mail that user receives and dispatches for confirming according to described expression and/or voice feature data, then:
Participle, characteristic vector pickup, weight calculation pre-service can be carried out to mail;
Wherein, described characteristic vector pickup comprises: mail head's characteristic vector pickup, attachment feature vector extracts and text characteristic vector pickup; And
Extract proper vector in the mode of Database field stored in characteristic vector data storehouse.
Then, multiple sorter is used to classify to mail successively.
As preferably, the classification results of each sorter can be sent to decision center, adopt Voting Algorithm to vote to the classification results that described multiple sorter obtains in decision center, obtain final classification results;
Then, incremental update is carried out to described multiple sorter, and upgrade default expression and/or voice feature data storehouse with final classification results.
In an alternate embodiment of the invention, incremental update is carried out to described multiple sorter and comprises step:
Often receive and dispatch an envelope mail, obtain the proper vector of this mail;
Judge whether described proper vector is positioned at the class interval having built sorter;
If in described class interval, this mail is temporarily stored;
When store mail number reach preset numerical value time, using the proper vector of the mail of storage with build the support vector of sorter jointly as new training sample set, incremental update is carried out to the sorter built;
Delete the mail temporarily stored.
In an alternate embodiment of the invention, described multiple sorter can comprise: SVM classifier, KNN sorter and Bayes classifier etc.
In a preferred embodiment of the invention, described sorter builds can adopt decision tree learning algorithm, carries out learning classification, extract the recognition rule to spam to the mail in mail training sample database.
Mail sample is made up of jointly spam sample and normal email sample, and data mining learning method is exactly the Classification and Identification realizing normal email and spam by information such as statistical study spam and normal email architectural feature separately, text features.Therefore in order to reach best mining effect, the normal email of learning training sample post and the proportion of composing of spam should be pressed close to truly as much as possible, reflect real situation.
RFC822 defines the basic format that Email transmits in a network, and more than 20, conventional field, field value and a text form.RFC1341 has expanded again multi-usage the Internet expansion MIME agreement on the basis of RFC822, and both defines now widely used mail format.
Email belongs to semi-structured text message, field designator wherein and value provide mail from transmission, the many information be forwarded to last delivery process, as: the form etc. of sender address, address of the addressee, transmitting time, transmission program, coding.These information can be used in helping to judge whether an envelope mail is spam.For processing these information, the application have employed vector space model (VSM) (w1, w2 ... wn, C) represent an envelope sample post.Attribute w1 in vector representation, w2 ... wn is n the characteristic attribute contributing to distinguishing normal email and spam, and attribute C is then the categorical attribute of sample post.The value of categorical attribute C is defined as: normal email, spam and cannot confirm.The application represents with several characteristic attributes after adopting an envelope Email discretize.
Described characteristic vector pickup comprises: mail head's characteristic vector pickup, attachment feature vector extracts and text characteristic vector pickup.By extract proper vector in the mode of Database field stored in characteristic vector data storehouse.
In a preferred embodiment of the invention, adopt learning algorithm to set up described decision tree to build classification of mail device, the learning algorithm of described decision tree employing greedy algorithm, top-downly recursively construct decision tree.
Decision tree starts with the individual node representing training sample; If sample is all at same class C, then this node becomes leaf, and marks this node with such C; Otherwise algorithm uses and is called that the tolerance based on entropy of information gain is as heuristic information, and selecting can best by the attribute of sample classification; This attribute becomes the testing attribute of this node.To each known value of testing attribute, create a branch, and divide sample accordingly.Algorithm uses same process, recursively forms the sample decision tree in each division.Once an attribute occurs on one node, just need not consider in the spawn of this node.The step that recurrence divides only stops when one of following condition is set up:
(1) all samples of given node belong to same class.
(2) do not remain attribute and can be used for Further Division sample, in the case, use majority voting.And given Node is become leaf, and mark by the class at the most places in sample set.
(3) branch does not have sample, in this case, creates a leaf with the class at the most places in sample.Algorithm only has an empty decision tree at first, and does not know how to be classified by example according to attribute, and what do is exactly predict how to divide whole instance space according to attribute according to training example set structure decision tree.The process of decision tree learning is exactly the process that decision tree is reduced gradually to the uncertainty degree divided.
Above-mentioned algorithm, by obtaining a classification of mail decision tree to the study of mail training sample database, needs to carry out testing evaluation to this classification tree in addition, and to be modified to decision tree by the test and evaluation in mail test sample book storehouse and optimize.During testing evaluation, a two-dimentional variable (K1 is arranged to the leaf node that each classification results of the decision tree that study generates is not " cannot confirm ", K2), the test mail number that variable K1 classifies correct for recording this point, variable K2 is then recorded in the test mail number of this classification error.And calculate the classification error rate error=K2/ (K1+K2) of this point, be " spam " for those classification results, and error rate is greater than and the amendment of acceptance threshold can change the time classification results for " normal email ".
The categorised decision tree learning to obtain, in the Classification and Identification to mail with when filtering, is converted into classifying rules by classification of mail device, uses this rule to carry out filtration to spam and identifies.Setting by categorised decision the rule extracted is represent with the form of if-then.
During extracting rule, wear part rule from the tree root of tree to every paths of leaf, this rule is a conjunct collection, and each attribute-value on path corresponds to a conjunct of rule.The former piece (" if " part) of these conjunct composition rules.Along final the arrived leaf node of every paths, be the prediction of this rule to the Classification and Identification of mail, the consequent (" then " part) of the categorical attribute prediction formation rule of leaf node.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a classification of mail recognition methods, is characterized in that, the method comprises the following steps:
Step 1, expression during acquisition user's receiving and dispatching mail and/or voice feature data; And the categorical attribute of the mail of user's transmitting-receiving is obtained according to described expression and/or voice feature data, described categorical attribute comprises: normal email, spam and cannot confirm;
If the categorical attribute obtained is normal email or spam, then terminate classification, otherwise perform step 2;
Step 2, uses multiple sorter to classify to mail successively.
2. classification of mail recognition methods according to claim 1, is characterized in that, further comprising the steps of after step 2:
Step 3, is sent to decision center by the classification results of each sorter, adopts Voting Algorithm to vote to the classification results that described multiple sorter obtains, obtain final classification results in decision center;
Step 4, carries out incremental update to described multiple sorter, and upgrades default expression and/or voice feature data storehouse with final classification results.
3. classification of mail recognition methods according to claim 2, is characterized in that, after step 1, and comprises before step 2:
Participle, characteristic vector pickup, weight calculation pre-service are carried out to mail;
Wherein, described characteristic vector pickup comprises: mail head's characteristic vector pickup, attachment feature vector extracts and text characteristic vector pickup; And
Extract proper vector in the mode of Database field stored in characteristic vector data storehouse.
4. classification of mail recognition methods according to claim 3, is characterized in that, described sorter can adopt the learning algorithm based on decision tree to build.
5. classification of mail recognition methods according to claim 4, is characterized in that, carries out incremental update comprise described multiple sorter:
Often receive and dispatch an envelope mail, obtain the proper vector of this mail;
Judge whether described proper vector is positioned at the class interval having built sorter;
If in described class interval, this mail is temporarily stored;
When store mail number reach preset numerical value time, using the proper vector of the mail of storage with build the support vector of sorter jointly as new training sample set, incremental update is carried out to the sorter built;
Delete the mail temporarily stored.
6. classification of mail recognition methods according to claim 5, is characterized in that, described multiple sorter can comprise: SVM classifier, KNN sorter and Bayes classifier.
7. the classification of mail recognition methods according to any one of claim 1-6, is characterized in that,
Described expressive features data comprise: eye position information, eye shape information, eyebrow positional information, eyebrow shape information, face positional information and face shape information;
Described voice feature data comprises: tone information, word speed information and filterableness keyword.
8. classification of mail recognition methods according to claim 7, is characterized in that, the categorical attribute obtaining the mail of user's transmitting-receiving in described step 1 according to described expression and/or voice feature data comprises:
The default expression matched from default expression and/or voice feature data library lookup and described expression and/or voice feature data and/or voice feature data;
When find out described expression and/or voice feature data and first preset express one's feelings and/or voice feature data matches time, determine that described expression and/or expression corresponding to voice feature data and/or speech data are the first expression and/or speech data, and determine that the type of the mail that described user receives and dispatches is the first kind, wherein, described first to preset expression and/or voice feature data be arbitrary expression in described default expression and/or voice feature data storehouse and/or voice feature data, the corresponding relation of expression and/or voice feature data and email type is also stored in described default expression and/or voice feature data storehouse, and
When find out described expression and/or voice feature data and second preset express one's feelings and/or voice feature data matches time, determine that described expression and/or expression corresponding to voice feature data and/or speech data are the second expression and/or speech data, and determine that the type of the mail that described user receives and dispatches is Second Type, wherein, described second to preset expression and/or voice feature data be arbitrary expression in described default expression and/or voice feature data storehouse and/or voice feature data, and described second presets expression and/or voice feature data default to express one's feelings and/or voice feature data is different expression and/or voice feature data with described first.
9. classification of mail recognition methods according to claim 8, is characterized in that,
After determining that the type of the mail that described user receives and dispatches is Second Type, also comprise:
The priority of more described first expression and/or speech data and described second expression and/or speech data;
When the priority comparing described first expression and/or speech data is higher than described second expression and/or the priority of speech data, before the mail controlling the described first kind is arranged in the mail of described Second Type; And
When the priority comparing described first expression and/or speech data is lower than described second expression and/or the priority of speech data, after the mail controlling the described first kind is arranged in the mail of described Second Type.
10. classification of mail recognition methods according to claim 9, is characterized in that,
More described first expression and/or speech data and described second expression and/or speech data priority before, also comprise:
Receive the setting instruction of described user; And
The priority of described first expression and/or speech data and described second expression and/or speech data is determined according to described setting instruction.
CN201410547075.1A 2014-10-14 2014-10-14 Mail classification and recognition method Pending CN104361015A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410547075.1A CN104361015A (en) 2014-10-14 2014-10-14 Mail classification and recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410547075.1A CN104361015A (en) 2014-10-14 2014-10-14 Mail classification and recognition method

Publications (1)

Publication Number Publication Date
CN104361015A true CN104361015A (en) 2015-02-18

Family

ID=52528277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410547075.1A Pending CN104361015A (en) 2014-10-14 2014-10-14 Mail classification and recognition method

Country Status (1)

Country Link
CN (1) CN104361015A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106330680A (en) * 2016-08-30 2017-01-11 黑龙江八农垦大学 Electronic mail cleaning method
CN106372237A (en) * 2016-09-13 2017-02-01 新浪(上海)企业管理有限公司 Fraudulent mail identification method and device
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN108021565A (en) * 2016-11-01 2018-05-11 中国移动通信有限公司研究院 A kind of analysis method and device of the user satisfaction based on linguistic level
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium
CN111310939A (en) * 2018-12-11 2020-06-19 王俊杰 Remote checking processing system for article recovery

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004094583A (en) * 2002-08-30 2004-03-25 Ntt Advanced Technology Corp Method of classifying writings
CN1750030A (en) * 2005-10-25 2006-03-22 二六三网络通信股份有限公司 Method for filtering junk nails
CN101316246A (en) * 2008-07-18 2008-12-03 北京大学 Junk mail detection method and system based on dynamic update of categorizer
CN102902726A (en) * 2012-09-06 2013-01-30 北京天宇朗通通信设备股份有限公司 Method and device for sorting electronic mails

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004094583A (en) * 2002-08-30 2004-03-25 Ntt Advanced Technology Corp Method of classifying writings
CN1750030A (en) * 2005-10-25 2006-03-22 二六三网络通信股份有限公司 Method for filtering junk nails
CN101316246A (en) * 2008-07-18 2008-12-03 北京大学 Junk mail detection method and system based on dynamic update of categorizer
CN102902726A (en) * 2012-09-06 2013-01-30 北京天宇朗通通信设备股份有限公司 Method and device for sorting electronic mails

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106330680A (en) * 2016-08-30 2017-01-11 黑龙江八农垦大学 Electronic mail cleaning method
CN106330680B (en) * 2016-08-30 2018-06-12 黑龙江八一农垦大学 A kind of Email method for cleaning
CN106453033A (en) * 2016-08-31 2017-02-22 电子科技大学 Multilevel Email classification method based on Email content
CN106453033B (en) * 2016-08-31 2019-03-15 电子科技大学 Multi-level process for sorting mailings based on Mail Contents
CN106372237A (en) * 2016-09-13 2017-02-01 新浪(上海)企业管理有限公司 Fraudulent mail identification method and device
CN108021565A (en) * 2016-11-01 2018-05-11 中国移动通信有限公司研究院 A kind of analysis method and device of the user satisfaction based on linguistic level
CN108021565B (en) * 2016-11-01 2021-09-10 中国移动通信有限公司研究院 User satisfaction analysis method and device based on conversation
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium
CN111310939A (en) * 2018-12-11 2020-06-19 王俊杰 Remote checking processing system for article recovery

Similar Documents

Publication Publication Date Title
CN104361015A (en) Mail classification and recognition method
CN107067025B (en) Text data automatic labeling method based on active learning
CN106453033B (en) Multi-level process for sorting mailings based on Mail Contents
CN110413780A (en) Text emotion analysis method, device, storage medium and electronic equipment
CN112989035B (en) Method, device and storage medium for identifying user intention based on text classification
Pong-Inwong et al. Improved sentiment analysis for teaching evaluation using feature selection and voting ensemble learning integration
CN103425777A (en) Intelligent short message classification and searching method based on improved Bayesian classification
CN108604228A (en) System and method for the language feature generation that multilayer word indicates
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN102640089A (en) System and method for inputting text into electronic devices
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
WO2016177069A1 (en) Management method, device, spam short message monitoring system and computer storage medium
CN112036168B (en) Event main body recognition model optimization method, device, equipment and readable storage medium
CN110442568A (en) Acquisition methods and device, storage medium, the electronic device of field label
CN109800852A (en) A kind of multi-modal spam filtering method
CN104731958A (en) User-demand-oriented cloud manufacturing service recommendation method
CN113326377A (en) Name disambiguation method and system based on enterprise incidence relation
CN109933648A (en) A kind of differentiating method and discriminating device of real user comment
CN110472057B (en) Topic label generation method and device
CN111079427A (en) Junk mail identification method and system
JP5098631B2 (en) Mail classification system, mail search system
CN110109902A (en) A kind of electric business platform recommender system based on integrated learning approach
CN103490974A (en) Junk mail detection method and device
Al-Alwani Improving email response in an email management system using natural language processing based probabilistic methods
CN110781297A (en) Classification method of multi-label scientific research papers based on hierarchical discriminant trees

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150218

RJ01 Rejection of invention patent application after publication