CN107196844A - Exception mail recognition methods and device - Google Patents
Exception mail recognition methods and device Download PDFInfo
- Publication number
- CN107196844A CN107196844A CN201611065946.1A CN201611065946A CN107196844A CN 107196844 A CN107196844 A CN 107196844A CN 201611065946 A CN201611065946 A CN 201611065946A CN 107196844 A CN107196844 A CN 107196844A
- Authority
- CN
- China
- Prior art keywords
- data
- characteristic value
- history
- exception
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/42—Mailbox-related aspects, e.g. synchronisation of mailboxes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The embodiment of the present application provides a kind of exception mail recognition methods and device, and whether it obtains mail recognition model by the way that a large amount of history mail data are carried out with feature extraction and data training, abnormal by the mail recognition model automatic identification targeted mails.Relative to prior art, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model trains what is obtained based on a large amount of history mail data, identification range is more comprehensively, it can avoid causing part exception mail not to be identified because sensitive information inspection policy has leak, so as to improve recognition accuracy.
Description
Technical field
The application is related to technical field of network management, more particularly to a kind of exception mail recognition methods and device.
Background technology
Email so that its is easy-to-use, using quick, communication in time, cost and the advantages of contain much information,
As current business and personal important means of communication.Many valuable information are generally comprised in Email, to avoid machine
Confidential information or the file leakage comprising sensitive information, many enterprises are required for the Email for sending or receiving to employee to manage
Reason, specific way to manage has a lot, for example:When the addressee of Email, sender, theme, annex name or mail size etc.
When including sensitive information in attribute, forbid sending the Email, restriction can only use mailbox as defined in enterprise to send electronics postal
Part, and mail must be made a copy for and can be just successfully transmitted to specific people (such as department manager).
Prior art all has certain limitation to the way to manage of Email.For example, above-mentioned be based on sensitive information
To check and forbid the way to manage of exception mail, it is difficult to cover all sensitive information and be possible to comprising sensitive information
Attribute, causes inspection policy to there is leak, and part abnormal e-mail can not be detected;Need to make a copy for specific also as noted above
The way to manage of personnel, then need the specific people manually to detect whether for legitimate mail, add workload.
Therefore, need badly it is a kind of can recognize the scheme of exception mail automatically and exactly, not increase relevant people employee
On the premise of measuring, the output of exception mail is reduced.
The content of the invention
This application provides a kind of exception mail recognition methods and relevant apparatus, not increase related personnel's workload
Under the premise of, exception mail is recognized automatically and exactly, the output of exception mail is reduced.
In order to solve the above-mentioned technical problem, the embodiment of the present application discloses following technical scheme:
The first aspect of the embodiment of the present application there is provided a kind of exception mail recognition methods, including:
Obtain history mail data and the corresponding markup information of the history mail data;The markup information is used to mark
The history mail data are normal email data or exception mail data;
Feature extraction is performed to the history mail data, the corresponding characteristic value collection of the history mail data is obtained;
According to the markup information and characteristic value collection, mail recognition model is set up;
When detecting mail transmission event, targeted mails are identified using the mail recognition model, to determine
Whether the targeted mails are exception mail.
Optionally, feature extraction is performed to the history mail data, including:
Extract the corresponding independent characteristic value of each feature respectively from the history mail data;
And/or, it is mutually related the corresponding linked character value of multiple features from the history mail extracting data.
Optionally, according to the markup information and characteristic value collection, mail recognition model is set up, including:
The characteristic value collection is divided into by normal email sample set or exception mail sample set according to the markup information
Close;
The normal email sample set at least includes the first normal subclass and the second normal subset mutually without common factor
Close, the exception mail sample set at least includes the first abnormal subclass and the second abnormal subclass mutually without common factor;
Data training is carried out according to the described first normal subclass and/or the first abnormal subclass, initial model is obtained;
According to the described second normal subclass and/or the second abnormal subclass, the initial model is verified, obtained
The mail recognition model.
Optionally, according to the markup information and characteristic value collection, mail recognition model is set up, including:
The first subclass in the characteristic value collection carries out data training, obtains initial model;
Yield in the second subset in the characteristic value collection is closed, and the initial model is verified, the mail is obtained
Identification model.
Optionally, according to the markup information and characteristic value collection, mail recognition model is set up, including:
According to the markup information and characteristic value collection, the mail recognition mould is set up by binary logistic regression algorithm
Type.
Optionally, targeted mails are identified using the mail recognition model, including:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to the targeted mails
Each characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, the matching value is weighted and asked
With;
Determine whether the targeted mails are exception mail according to the weighted sum result.
Optionally, methods described also includes:
After targeted mails are identified using the mail recognition model, recognition result is verified;
When checking obtains the recognition result mistake, the corresponding mail data of the targeted mails is gone through added to described
History mail data, and re-establish the mail recognition model.
Optionally, according to the markup information and characteristic value collection, set up before mail recognition model, methods described is also
Including:
Data cleansing operation is carried out to the characteristic value collection;
And/or, the characteristic value of nonumeric type in the characteristic value collection is converted to the characteristic value of numeric type.
The second aspect of the embodiment of the present application there is provided a kind of exception mail identifying device, including:
Data acquisition unit, for obtaining history mail data and the corresponding markup information of the history mail data;Institute
Stating markup information is used to mark the history mail data to be normal email data or exception mail data;
Data processing unit, for performing feature extraction to the history mail data, obtains the history mail data
Corresponding characteristic value collection;
Modeling unit, for according to the markup information and characteristic value collection, setting up mail recognition model;
Recognition unit, for when detecting mail transmission event, being entered using the mail recognition model to targeted mails
Row identification, to determine whether the targeted mails are exception mail.
Optionally, described device also includes:
Authentication unit, the identification knot is obtained for being verified to the recognition result that recognition unit is obtained, and in checking
During fruit mistake, data acquisition unit, data processing unit and modeling unit described in retriggered, by targeted mails correspondence
Mail data be added to the history mail data, and re-establish the mail recognition model.
Optionally, described device also includes:
Data cleansing unit, for carrying out data cleansing operation to the characteristic value collection;
And/or, Date Conversion Unit, for the characteristic value of nonumeric type in the characteristic value collection to be converted into numeric type
Characteristic value.
From above technical scheme, the embodiment of the present application to a large amount of history mail data by carrying out feature extraction and number
According to training, mail recognition model is obtained, it is whether abnormal by the mail recognition model automatic identification targeted mails.Relative to existing
Technology, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model is with a large amount of history mail data
Based on training obtain, identification range more comprehensively, can avoid causing part different because sensitive information inspection policy has leak
Normal mail can not be identified, so as to improve recognition accuracy.
It should be appreciated that the general description of the above and detailed description hereinafter are only exemplary and explanatory, not
The disclosure can be limited.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also
To obtain other accompanying drawings according to these accompanying drawings.
A kind of flow chart for exception mail recognition methods that Fig. 1 provides for the embodiment of the present application;
The flow chart for another exception mail recognition methods that Fig. 2 provides for the embodiment of the present application;
The flow chart for another exception mail recognition methods that Fig. 3 provides for the embodiment of the present application;
A kind of structural representation for exception mail identifying device that Fig. 4 provides for the embodiment of the present application;
The structural representation for another exception mail identifying device that Fig. 5 provides for the embodiment of the present application.
Embodiment
In order that those skilled in the art more fully understands the technical scheme in the embodiment of the present application, and make the application real
Applying the above-mentioned purpose of example, feature and advantage can be more obvious understandable, below in conjunction with the accompanying drawings to technical side in the embodiment of the present application
Case is described in further detail.
A kind of flow chart for exception mail recognition methods that Fig. 1 provides for the embodiment of the present application.
Reference picture 1, the exception mail recognition methods comprises the following steps:
S11, acquisition history mail data and the corresponding markup information of the history mail data.
The markup information is used to mark the history mail data to be normal email data or exception mail data.
In the embodiment of the present application, the history mail data can be extracted from existing mail manager database, client
Obtain, can be all history mail or the history mail in a period of time (such as 2 years, 1 year);Certainly, it is
Ensure exception mail recognition accuracy, history mail data as much as possible should be obtained.Wherein, the history mail data can be with
Including the related various features data of each history mail, post time, sender, transmission IP as shown in Table 1 below
Address, sender's mailbox network address, and recipient mailbox's network address, transmission content, transmission annex, the client-side information for sending mail
Deng.
The history mail data statistic of table 1
Post time | Sender | Send IP address | Sender's mailbox network address |
2016/6/1 8:30 | User1 | 192.168.10.1 | mail.163.com |
2016/6/1 8:33 | User2 | 192.168.10.1 | mail.163.com |
2016/6/2 8:35 | User3 | 192.168.10.1 | mail.126.com |
2016/6/2 8:32 | User4 | 192.168.10.1 | mail.qq.com |
2016/6/3 8:32 | User5 | 192.168.10.2 | mail.qq.com |
2016/6/1 9:32 | User6 | 192.168.10.2 | mail.qq.com |
In the embodiment of the present application, above-mentioned markup information can characterize the related every kind of data of each history mail respectively to be
It is no be the information of abnormal data or characterize each history mail it is overall whether be exception mail information.
Optionally, above-mentioned markup information can, according to preset rules, automatic marking be carried out to the history mail data
Obtain.For example, by traveling through and judging that sender's mailbox network address is whether in default blacklist in history mail data, next pair
Sender's mailbox network address of each history mail is labeled, and the mailbox network address addition first in the default blacklist is marked
Information (the first markup information represent corresponding data be abnormal data), to mailbox network address not in the default blacklist without
Markup information, or the second identification information of addition (the second markup information represents that corresponding data is normal data).
Optionally, above-mentioned markup information can also be obtained according to the input information of user.For example, request can be passed through
User judges whether the body matter of history mail is legal (such as, if without sensitive information etc.), and receives the judgement of user's input
Object information, generates corresponding markup information, for judged result information representation body matter bag according to the judged result information
Containing sensitive information, above-mentioned first identification information is added, does not include sensitive information for judged result information representation body matter
, without markup information, or add above-mentioned second identification information.
S12, to the history mail data perform feature extraction, obtain the corresponding characteristic value collection of the history mail data
Close.
S13, according to the markup information and characteristic value collection, set up mail recognition model.
Due to for the different corresponding different application demands of application scenarios and different exception mail characteristic, Email
Various features all possibly as judge the Email whether be exception mail foundation, including above-mentioned post time, hair
Part people, transmission IP address, sender's mailbox network address, recipient mailbox's network address, transmission content, the visitor for sending annex, sending mail
The features such as family client information;Therefore, in the embodiment of the present application, by performing feature extraction operation to a large amount of history mail data, obtain
To the characteristic value of the various features of a large amount of history mails, composition characteristic value set, namely sample data sets, so as to pass through
Data training obtains mail recognition model, as described in step S13.
Optionally, described in the embodiment of the present application specific use by label of markup information of data training process has supervision
Learning process.
S14, when detect mail send event when, targeted mails are identified using the mail recognition model, with
Whether determine the targeted mails is exception mail.
In one feasible embodiment of the application, it can be trained in step S13 only with normal email characteristic value,
Mail recognition model is obtained, during this, above-mentioned markup information can be used for rejecting the exception mail feature in characteristic value collection
Value.
In the application in another feasible embodiment, in step S13 can also simultaneously using normal email characteristic value and
Exception mail characteristic value is trained, during this, first can be divided into the characteristic value collection just according to the markup information
Normal mail sample set or exception mail sample set (will characterize the markup information of abnormal data in the characteristic value collection
Corresponding characteristic value charges to exception mail sample set, characterizes the characteristic value corresponding to the markup information of normal data and charges to just
Normal mail sample set), it is trained respectively as positive sample and negative sample.
From above technical scheme, the embodiment of the present application to a large amount of history mail data by carrying out feature extraction and number
According to training, mail recognition model is obtained, it is whether abnormal by the mail recognition model automatic identification targeted mails.Relative to existing
Technology, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model is with a large amount of history mail data
Based on training obtain, identification range more comprehensively, can avoid causing part different because sensitive information inspection policy has leak
Normal mail can not be identified, so as to improve recognition accuracy.
Optionally, in the embodiment of the present application, the history mail data are performed with feature extraction, tool described in step S12
Body can be:Extract the corresponding independent characteristic value of each feature respectively from the history mail data;Can also be:From described
History mail extracting data is mutually related the corresponding linked character value of multiple features.
For example, previously described post time, sender's mailbox network address, recipient mailbox's network address etc. are characterized in relative
Independent, whether it normally contacts with other features without inevitable, therefore can individually extract its characteristic value to these features, remembers
For independent characteristic value.
And for the corresponding linked character of multiple features, in fact it could happen that such mail:The linked character is corresponding each
Feature is all normal, but the not common corresponding relation of its corresponding relation.It is special with sender and this association of transmission IP address
Exemplified by levying, in the case where forbidding the application demand of other people generation hair mails, sender B uses sender A mailbox under the IP address of oneself
The Email that address is sent, sender is normal, and it is also normal to send IP address, but the corresponding relation of the two is abnormal
, if sender is only individually identified and the normality of IP address is sent, ignore the corresponding relation of the two, then above-mentioned generation hair mail
It will be unable to be correctly validated out.Therefore, the embodiment of the present application extracts linked character correspondence in addition to extracting independent characteristic value, also
Linked character value, exception mail recognition correct rate can be improved, more complicated mail management demand is met.
, specifically can be by counting respectively during step S13 based on above-mentioned independent characteristic value and linked character value
The corresponding all characteristic values of each feature (including above-mentioned independent characteristic and linked character), determine the corresponding characteristic value model of this feature
Enclose, matched for the individual features value with targeted mails.
For example, carrying out collect statistics for all post times extracted, it is 8 to obtain earliest time therein:
30, latest time is 17:30, so that it is determined that the corresponding range of characteristic values of this feature of post time is 8:30~17:30,
So as to when carrying out anomalous identification to targeted mails, if the transmission time of targeted mails is not in the time range 8:30~17:
In 30, illustrate the transmission time anomaly of targeted mails, targeted mails may be exception mail, therefore can be given birth to according to the matching result
Into a matching value for representing this feature abnormalities, to improve the probability that targeted mails are identified as exception mail.
And for example, collect statistics are carried out for the sender extracted and the linked character value of transmission IP address, obtains normal
Correspondence set of the sender with sending IP address in mail, so that when carrying out anomalous identification to targeted mails, judging should
The corresponding relation (e.g., " User1 " of sender and transmission IP address in targeted mails:" 192.168.10.2 ") whether in the correspondence
In set of relationship, if not, the sender of explanation targeted mails or transmission IP address are abnormal, targeted mails may be abnormal postal
Part, therefore a matching value for representing this feature abnormalities can be equally generated according to the matching result, to improve targeted mails quilt
It is identified as the probability of exception mail.
In one feasible embodiment of the application, the utilization mail recognition model described in step S14 is to target postal
Part is identified, and specifically may comprise steps of:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to the targeted mails
Each characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, the matching value is weighted and asked
With;
Determine whether the targeted mails are exception mail according to the weighted sum result.
Optionally, if be trained in step S13 only with normal email characteristic value, the mail recognition finally given
Model, can record the span for all features that normal email possesses, and each feature influences on mail normality
Weight.Accordingly, when carrying out anomalous identification to new mail (i.e. targeted mails) in step S14, the mail recognition model can be with
The various features value of targeted mails is matched with the individual features value scope recorded in model, and generated according to matching result
One matching value, is weighted further according to the corresponding weight of various features recorded in model to the corresponding matching value of various features
Summation, is that can determine that whether targeted mails are exception mail according to the size of the weighted sum result.
Optionally, if be trained in step S13 using normal email characteristic value and exception mail characteristic value simultaneously,
In the mail recognition model finally given, while have recorded the range of characteristic values of normal email, the range of characteristic values of exception mail
And the corresponding weight of each feature.Accordingly, when carrying out anomalous identification to targeted mails in step S14, the mail recognition mould
Type can be by the various features value of targeted mails respectively with normal email range of characteristic values and the range of characteristic values of exception mail
Matched, accordingly obtain two groups of matching values, the i.e. matching value relative to normal email and the matching value relative to exception mail,
Summation is weighted to every group of matching value further according to various features corresponding weight, by comparing the big of two weighted sum results
It is small to can determine that whether targeted mails are exception mail.Depending on specific decision condition can be according to practical application request, for example, can
When the first weighted sum result relative to normal email is less than the second weighted sum result relative to exception mail, to sentence
The mail that sets the goal is exception mail, can also be less than in the ratio of the first weighted sum result and the second weighted sum result
During one predetermined threshold value, judge targeted mails as exception mail.
It can be seen that, matching value and weight of the embodiment of the present application based on each feature, the legitimacy of synthetic determination targeted mails,
The degree of accuracy of mail recognition can be improved.
In one feasible embodiment of the application, the mail recognition model described in step S13 sets up process specifically can be with
Including two steps, i.e.,:
S131, the first subclass in the characteristic value collection carry out data training, obtain initial model;
S132, the yield in the second subset in the characteristic value collection are closed, and the initial model is verified, obtains above-mentioned
Mail recognition model.
In the embodiment of the present application, the characteristic value collection extracted is divided at least two subclass, the first subclass is used for
Data are trained, and set up initial model, yield in the second subset is shared in the verification (tuning) to the initial model.To avoid model from excessively intending
Close, should ensure that two subclass without common factor.For example, in characteristic value collection 60% characteristic value can be included in the first subclass,
The characteristic value of residue 40% is included in yield in the second subset conjunction.
The embodiment of the present application further utilizes history mail data pair after model foundation is completed based on history mail data
The model is verified, tuning, can improve the recognition accuracy of mail recognition model.
Optionally, above-mentioned first subclass can include the first normal subclass and the first abnormal subclass, above-mentioned second
Subclass can include the second normal subclass and the second abnormal subclass.Wherein, the first normal subclass and the second normal-sub
Set is belonged between previously described normal email sample set, and the first normal subclass and the second normal subclass without friendship
Collection;First abnormal subclass and the second abnormal subclass belong to previously described exception mail sample set, and the first exception
Without common factor between subclass and the second abnormal subclass.
That is, in the embodiment of the present application, no matter data training process, or initial model checking procedure, all simultaneously using just
Sample and negative sample, can avoid, to single sample (single positive sample or single negative sample) overfitting, improving mail recognition
The recognition accuracy of model.
In one feasible embodiment of the application, the mail recognition model described in step S13 sets up process (or step
Initial model described in S131 sets up process), it can specifically use binary logistic regression algorithm.
It is abnormal postal because the exception mail recognition result described in the present embodiment only exists two kinds of situations, i.e. targeted mails
Part, and targeted mails are normal email (not being exception mails), therefore can be modeled using binary logistic regression algorithm.This two
The conditional probability distribution that item logistic regression algorithm can be expressed as:
Wherein, χ is independent variable, i.e. the input data of model, and its span is n dimension sets of real numbers Rn;Y is dependent variable, i.e.,
The output data of model, its span is { 0,1 }.Parameter ω is weight vector, ω ∈ Rn;Parameter b is biasing, and (R is b ∈ R
Set of real numbers).
Process, can be expanded weight vector ω and input χ, i.e., to simplify the process:
ω=ω(1),ω(2),...,ω(n),b)T;χ=(χ(1),χ(2),...,χ(n),1)T。
Now, above-mentioned formula one and formula two can be equivalent to:
Binding events logarithm probability formula (logit functions)(p is the probability that the event occurs),
Above-mentioned formula three and formula four can be converted to:
It can be seen from formula five, in binary logistic regression algorithm, output Y=1 logarithm probability is the linear letter for inputting χ
Number;In other words, linear function ω χ value is bigger, and the probability for exporting Y=1 is bigger (output Y=0 probability is smaller), conversely,
Linear function ω χ value is smaller, and the probability for exporting Y=1 is smaller (output Y=0 probability is bigger).
Therefore, for given input χ, it is possible to use it is big that above-mentioned formula five judges that it exports Y=1 or Y=0 probability
It is small.Applied to the embodiment of the present application, χ is the characteristic value to be extracted by history mail data, and Y is the result of determination of targeted mails
(such as:Y=1 can represent that targeted mails are exception mail, and Y=0 can represent that targeted mails are normal email);In weight vector
In the case that ω is determined, using either objective mail as input χ, above-mentioned binary logistic regression algorithm is utilized, it is possible to it is determined that should
Targeted mails whether be exception mail probability.The target postal is obtained for example, can be calculated by formula three and formula four respectively
Part is the probability P (Y=1 | χ) of exception mail and the probability P (Y=0 | χ) of normal email, as P (Y=1 | χ) > P (Y=0 | χ)
When, output recognition result is:Targeted mails are exception mail (Y=1);Or it is abnormal postal to calculate targeted mails by formula five
The logarithm probability of part, when the logarithm probability is more than predetermined threshold value, output recognition result is:Targeted mails are exception mail (Y=
1)。
It can be seen that, mail recognition model is set up based on binary logistic regression algorithm, key is to determine weight vector ω.This Shen
It please be trained using sample data in embodiment, purpose is to be to determine weight vector ω, can specifically use maximum likelihood
The estimation technique.
Trained by Maximum Likelihood Estimation Method and determine that weight vector ω principle is as follows:
For given sample data sets T={ (x1,y1),(x2,y2),...,(xn,yn), it is assumed that P (Y=1 | χ)=π
(χ), and P (Y=0 | χ)=1- π (χ), then likelihood function is:
Being denoted as log-likelihood function is:
To L (ω) maximizing, that is, obtain ω estimate.
From above technical scheme, the embodiment of the present application sets up mail recognition mould based on binary logistic regression algorithm
Type, while determining the key parameter needed for binary logistic regression algorithm based on Maximum Likelihood Estimation Method, it is ensured that final recognition result
Meet probability distribution principle, it is to avoid influence of the human factor to recognition result, thereby may be ensured that the degree of accuracy of recognition result.
The flow chart for another exception mail recognition methods that Fig. 2 provides for the embodiment of the present application.Reference picture 2, this method
Comprise the following steps:
S21, acquisition history mail data and the corresponding markup information of the history mail data;
S22, to the history mail data perform feature extraction, obtain the corresponding characteristic value collection of the history mail data
Close;
S23, according to the markup information and characteristic value collection, set up mail recognition model;
S24, when detect mail send event when, targeted mails are identified using the mail recognition model, with
Whether determine the targeted mails is exception mail;
S25, recognition result is verified, when checking obtains the recognition result mistake, by the targeted mails pair
The mail data answered is added to the history mail data, and returns to the step S21.
Relative to embodiment illustrated in fig. 1, embodiment illustrated in fig. 2 is after the identification step to targeted mails is completed, further
Recognition result is verified, if checking obtains the recognition result mistake, illustrates mail recognition model existing defects (over-fitting
Or poor fitting), therefore by the way that the corresponding mail data of targeted mails is also served as into history mail data, re-establish mail recognition mould
Type, eliminates its defect, with the increase of Model Reconstruction number of times, and mail recognition model is also more perfect, and its recognition accuracy is also higher.
Optionally, being verified to recognition result described in step S25, can specifically use the checking based on man-machine interaction
Method, i.e., show user, and receive the result of user's input by targeted mails and recognition result;With based on man-machine friendship
The accumulation of mutual the result, can be converted to full automatic intelligent verification.
The flow chart for another exception mail recognition methods that Fig. 3 provides for the embodiment of the present application.Reference picture 3, this method
Comprise the following steps:
S31, acquisition history mail data and the corresponding markup information of the history mail data;
S32, to the history mail data perform feature extraction, obtain the corresponding characteristic value collection of the history mail data
Close;
S33, to the characteristic value collection carry out data cleansing operation, and/or, by nonumeric type in the characteristic value collection
Characteristic value be converted to the characteristic value of numeric type;
Above-mentioned data cleansing, i.e., carry out the behaviour such as duplicate removal, filtering, completion, association to a large amount of characteristic values in characteristic value collection
Make, to improve the quality of sample data.
Because the data type of various features value may be different, have plenty of numeric type, have plenty of Boolean type, have plenty of character
String;And the training process of the embodiment of the present application is that, based on numerical value, the characteristic value of nonumeric type cannot be used directly for training, therefore
The characteristic value of nonumeric type can be converted to the characteristic value of numeric type beforehand through data conversion.
S34, according to the markup information and characteristic value collection, set up mail recognition model;
S35, when detect mail send event when, targeted mails are identified using the mail recognition model, with
Whether determine the targeted mails is exception mail.
The embodiment of the present application can improve the quality of data in characteristic value collection by data cleansing, data conversion, so that
The accuracy of model foundation and recognition result can be improved.
Accordingly, the embodiment of the present application also provides a kind of exception mail identifying device.Structural representation shown in reference picture 4,
Above-mentioned exception mail identifying device at least includes:
Data acquisition unit 100, for obtaining history mail data and the corresponding markup information of the history mail data;
The markup information is used to mark the history mail data to be normal email data or exception mail data;
Data processing unit 200, for performing feature extraction to the history mail data, obtains the history mail number
According to corresponding characteristic value collection;
Modeling unit 300, for according to the markup information and characteristic value collection, setting up mail recognition model;
Recognition unit 400, for when detecting mail transmission event, using the mail recognition model to targeted mails
It is identified, to determine whether the targeted mails are exception mail.
From above technical scheme, the embodiment of the present application to a large amount of history mail data by carrying out feature extraction and number
According to training, mail recognition model is obtained, it is whether abnormal by the mail recognition model automatic identification targeted mails.Relative to existing
Technology, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model is with a large amount of history mail data
Based on training obtain, identification range more comprehensively, can avoid causing part different because sensitive information inspection policy has leak
Normal mail can not be identified, so as to improve recognition accuracy.
Optionally, the data processing unit 200 can be specifically configured as:
Extract the corresponding independent characteristic value of each feature respectively from the history mail data;
And/or, it is mutually related the corresponding linked character value of multiple features from the history mail extracting data.
Optionally, the modeling unit 300 can be specifically configured as:
The characteristic value collection is divided into by normal email sample set or exception mail sample set according to the markup information
Close;
According to the normal email sample set and exception mail sample set, mail recognition model is set up.
Optionally, the modeling unit 300 can be specifically configured as:According to the markup information and characteristic value collection,
The mail recognition model is set up by binary logistic regression algorithm.
Optionally, the modeling unit 300 is specifically included:
Initial model sets up unit, and the first subclass in the characteristic value collection carries out data training, obtains just
Beginning model;
Model checking unit, is closed for the yield in the second subset in the characteristic value collection, and the initial model is carried out
Verification, obtains the mail recognition model.
The embodiment of the present application further utilizes history mail data pair after model foundation is completed based on history mail data
The model is verified, tuning, can improve the recognition accuracy of mail recognition model.
Optionally, the recognition unit 400 can be specifically configured as:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to the targeted mails
Each characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, the matching value is weighted and asked
With;
Determine whether the targeted mails are exception mail according to the weighted sum result.
The structural representation for the exception mail identifying device that another embodiment shown in reference picture 5 is provided, above-mentioned exception mail
Identifying device also includes at least one of following:
Authentication unit 500, for being verified to the recognition result that recognition unit is obtained, and obtains the identification in checking
As a result when wrong, data acquisition unit, data processing unit and modeling unit described in retriggered, by the targeted mails pair
The mail data answered is added to the history mail data, and re-establishes the mail recognition model.
Data-optimized unit 600, for carrying out data cleansing operation to the characteristic value collection, and/or, by the feature
The characteristic value of nonumeric type is converted to the characteristic value of numeric type in value set.
The embodiment of the present application is further verified, such as after the identification step to targeted mails is completed to recognition result
Fruit checking obtains the recognition result mistake, illustrates mail recognition model existing defects (over-fitting or poor fitting), therefore by by mesh
The corresponding mail data of mark mail also serves as history mail data, re-establishes mail recognition model, its defect is eliminated, with mould
Type rebuilds the increase of number of times, and mail recognition model is also more perfect, and its recognition accuracy is also higher.
In addition, the embodiment of the present application can improve the data matter in characteristic value collection by data cleansing, data conversion
Amount, so as to improve the accuracy of model foundation and recognition result.
On the device in above-described embodiment, wherein modules perform the concrete mode of operation in relevant this method
Embodiment in be described in detail, explanation will be not set forth in detail herein.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment
Divide mutually referring to what each embodiment was stressed is the difference with other embodiments.It is real especially for system
Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method
Part explanation.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and
And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.
Claims (10)
1. a kind of exception mail recognition methods, it is characterised in that including:
Obtain history mail data and the corresponding markup information of the history mail data;The markup information is described for marking
History mail data are normal email data or exception mail data;
Feature extraction is performed to the history mail data, the corresponding characteristic value collection of the history mail data is obtained;
According to the markup information and characteristic value collection, mail recognition model is set up;
When detecting mail transmission event, targeted mails are identified using the mail recognition model, it is described to determine
Whether targeted mails are exception mail.
2. according to the method described in claim 1, it is characterised in that according to the markup information and characteristic value collection, set up postal
Part identification model, including:
The characteristic value collection is divided into by normal email sample set or exception mail sample set according to the markup information;Institute
Stating normal email sample set at least includes mutual the first normal subclass and the second normal subclass without common factor, the exception
Mail sample set at least includes the first abnormal subclass and the second abnormal subclass mutually without common factor;
Data training is carried out according to the described first normal subclass and/or the first abnormal subclass, initial model is obtained;
According to the described second normal subclass and/or the second abnormal subclass, the initial model is verified, obtains described
Mail recognition model.
3. according to the method described in claim 1, it is characterised in that according to the markup information and characteristic value collection, set up postal
Part identification model, including:
According to the markup information and characteristic value collection, the mail recognition model is set up by binary logistic regression algorithm.
4. according to the method described in claim 1, it is characterised in that targeted mails are known using the mail recognition model
Not, including:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to each of the targeted mails
Characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, summation is weighted to the matching value;
Determine whether the targeted mails are exception mail according to the weighted sum result.
5. according to the method described in claim 1, it is characterised in that the history mail data are performed with feature extraction, including:
Extract the corresponding independent characteristic value of each feature respectively from the history mail data;
And/or, it is mutually related the corresponding linked character value of multiple features from the history mail extracting data.
6. the method according to claim 1 to 5, it is characterised in that also include:
After targeted mails are identified using the mail recognition model, recognition result is verified;
When checking obtains the recognition result mistake, the corresponding mail data of the targeted mails is added to the history postal
Number of packages evidence, and re-establish the mail recognition model.
7. the method according to claim 1 to 5, it is characterised in that according to the markup information and characteristic value collection, build
Before vertical mail recognition model, in addition to:
Data cleansing operation is carried out to the characteristic value collection;
And/or, the characteristic value of nonumeric type in the characteristic value collection is converted to the characteristic value of numeric type.
8. a kind of exception mail identifying device, it is characterised in that including:
Data acquisition unit, for obtaining history mail data and the corresponding markup information of the history mail data;The mark
Note information is used to mark the history mail data to be normal email data or exception mail data;
Data processing unit, for performing feature extraction to the history mail data, obtains the history mail data correspondence
Characteristic value collection;
Modeling unit, for according to the markup information and characteristic value collection, setting up mail recognition model;
Recognition unit, for when detecting mail transmission event, being known using the mail recognition model to targeted mails
Not, to determine whether the targeted mails are exception mail.
9. device according to claim 8, it is characterised in that also include:
Authentication unit, the recognition result mistake is obtained for being verified to the recognition result that recognition unit is obtained, and in checking
Mistake, data acquisition unit, data processing unit and modeling unit described in retriggered, by the corresponding postal of the targeted mails
Number of packages evidence is added to the history mail data, and re-establishes the mail recognition model.
10. device according to claim 8, it is characterised in that also include:
Data cleansing unit, for carrying out data cleansing operation to the characteristic value collection;
And/or, Date Conversion Unit, the spy for the characteristic value of nonumeric type in the characteristic value collection to be converted to numeric type
Value indicative.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611065946.1A CN107196844A (en) | 2016-11-28 | 2016-11-28 | Exception mail recognition methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611065946.1A CN107196844A (en) | 2016-11-28 | 2016-11-28 | Exception mail recognition methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107196844A true CN107196844A (en) | 2017-09-22 |
Family
ID=59871650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611065946.1A Pending CN107196844A (en) | 2016-11-28 | 2016-11-28 | Exception mail recognition methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107196844A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107730717A (en) * | 2017-10-31 | 2018-02-23 | 华中科技大学 | A kind of suspicious card identification method of public transport of feature based extraction |
CN107911277A (en) * | 2017-09-29 | 2018-04-13 | 北京明朝万达科技股份有限公司 | A kind of outgoing mail auditing method and system based on machine learning |
CN108334908A (en) * | 2018-03-07 | 2018-07-27 | 中国铁道科学研究院 | Railway track hurt detection method and device |
CN109145298A (en) * | 2018-08-14 | 2019-01-04 | 顺丰科技有限公司 | A kind of identifying system, method, equipment and the storage medium of illegal outgoing mailbox |
CN109391620A (en) * | 2018-10-22 | 2019-02-26 | 武汉极意网络科技有限公司 | Method for building up, system, server and the storage medium of abnormal behaviour decision model |
CN110061981A (en) * | 2018-12-13 | 2019-07-26 | 成都亚信网络安全产业技术研究院有限公司 | A kind of attack detection method and device |
CN110197435A (en) * | 2018-04-23 | 2019-09-03 | 腾讯科技(深圳)有限公司 | Object identifying method and device, storage medium and electronic device |
CN110519150A (en) * | 2018-05-22 | 2019-11-29 | 深信服科技股份有限公司 | Mail-detection method, apparatus, equipment, system and computer readable storage medium |
CN110717189A (en) * | 2019-09-29 | 2020-01-21 | 支付宝(杭州)信息技术有限公司 | Data leakage identification method, device and equipment |
CN110719272A (en) * | 2019-09-27 | 2020-01-21 | 湖南大学 | LR algorithm-based slow denial of service attack detection method |
CN110807014A (en) * | 2019-09-24 | 2020-02-18 | 国网北京市电力公司 | Cross validation based station data anomaly discrimination method and device |
CN112822168A (en) * | 2020-12-30 | 2021-05-18 | 绿盟科技集团股份有限公司 | Abnormal mail detection method and device |
CN113839852A (en) * | 2020-06-23 | 2021-12-24 | 中国科学院计算机网络信息中心 | Mail account abnormity detection method, device and storage medium |
CN115037542A (en) * | 2022-06-09 | 2022-09-09 | 北京天融信网络安全技术有限公司 | Abnormal mail detection method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1965306A (en) * | 2003-09-10 | 2007-05-16 | 菲德利斯安全系统公司 | High-performance network content analysis platform |
CN101115020A (en) * | 2006-07-25 | 2008-01-30 | 腾讯科技(深圳)有限公司 | Secret mail protecting method and mail system |
CN101257378A (en) * | 2008-04-09 | 2008-09-03 | 南京航空航天大学 | Anti-disclosure mail safe card and method for detecting disclosure mail |
US8224905B2 (en) * | 2006-12-06 | 2012-07-17 | Microsoft Corporation | Spam filtration utilizing sender activity data |
CN103490974A (en) * | 2012-06-14 | 2014-01-01 | 中国移动通信集团广西有限公司 | Junk mail detection method and device |
CN103853948A (en) * | 2012-11-28 | 2014-06-11 | 阿里巴巴集团控股有限公司 | User identity recognizing and information filtering and searching method and server |
CN104518943A (en) * | 2013-09-27 | 2015-04-15 | 无锡华润微电子有限公司 | Method and system for e-mail management |
CN104794176A (en) * | 2015-04-02 | 2015-07-22 | 中国科学院信息工程研究所 | Multiattribute-based detection method for missent e-mail |
CN104967558A (en) * | 2015-06-10 | 2015-10-07 | 东软集团股份有限公司 | Method and device for detecting junk mail |
CN105320957A (en) * | 2014-07-10 | 2016-02-10 | 腾讯科技(深圳)有限公司 | Classifier training method and device |
-
2016
- 2016-11-28 CN CN201611065946.1A patent/CN107196844A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1965306A (en) * | 2003-09-10 | 2007-05-16 | 菲德利斯安全系统公司 | High-performance network content analysis platform |
CN101115020A (en) * | 2006-07-25 | 2008-01-30 | 腾讯科技(深圳)有限公司 | Secret mail protecting method and mail system |
US8224905B2 (en) * | 2006-12-06 | 2012-07-17 | Microsoft Corporation | Spam filtration utilizing sender activity data |
CN101257378A (en) * | 2008-04-09 | 2008-09-03 | 南京航空航天大学 | Anti-disclosure mail safe card and method for detecting disclosure mail |
CN103490974A (en) * | 2012-06-14 | 2014-01-01 | 中国移动通信集团广西有限公司 | Junk mail detection method and device |
CN103853948A (en) * | 2012-11-28 | 2014-06-11 | 阿里巴巴集团控股有限公司 | User identity recognizing and information filtering and searching method and server |
CN104518943A (en) * | 2013-09-27 | 2015-04-15 | 无锡华润微电子有限公司 | Method and system for e-mail management |
CN105320957A (en) * | 2014-07-10 | 2016-02-10 | 腾讯科技(深圳)有限公司 | Classifier training method and device |
CN104794176A (en) * | 2015-04-02 | 2015-07-22 | 中国科学院信息工程研究所 | Multiattribute-based detection method for missent e-mail |
CN104967558A (en) * | 2015-06-10 | 2015-10-07 | 东软集团股份有限公司 | Method and device for detecting junk mail |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107911277A (en) * | 2017-09-29 | 2018-04-13 | 北京明朝万达科技股份有限公司 | A kind of outgoing mail auditing method and system based on machine learning |
CN107730717A (en) * | 2017-10-31 | 2018-02-23 | 华中科技大学 | A kind of suspicious card identification method of public transport of feature based extraction |
CN108334908A (en) * | 2018-03-07 | 2018-07-27 | 中国铁道科学研究院 | Railway track hurt detection method and device |
CN110197435B (en) * | 2018-04-23 | 2023-09-26 | 腾讯科技(深圳)有限公司 | Object recognition method and device, storage medium and electronic device |
CN110197435A (en) * | 2018-04-23 | 2019-09-03 | 腾讯科技(深圳)有限公司 | Object identifying method and device, storage medium and electronic device |
CN110519150A (en) * | 2018-05-22 | 2019-11-29 | 深信服科技股份有限公司 | Mail-detection method, apparatus, equipment, system and computer readable storage medium |
CN110519150B (en) * | 2018-05-22 | 2022-09-30 | 深信服科技股份有限公司 | Mail detection method, device, equipment, system and computer readable storage medium |
CN109145298A (en) * | 2018-08-14 | 2019-01-04 | 顺丰科技有限公司 | A kind of identifying system, method, equipment and the storage medium of illegal outgoing mailbox |
CN109145298B (en) * | 2018-08-14 | 2022-12-27 | 顺丰科技有限公司 | System, method, equipment and storage medium for identifying illegal outgoing mailbox |
CN109391620B (en) * | 2018-10-22 | 2021-06-25 | 武汉极意网络科技有限公司 | Method, system, server and storage medium for establishing abnormal behavior judgment model |
CN109391620A (en) * | 2018-10-22 | 2019-02-26 | 武汉极意网络科技有限公司 | Method for building up, system, server and the storage medium of abnormal behaviour decision model |
CN110061981A (en) * | 2018-12-13 | 2019-07-26 | 成都亚信网络安全产业技术研究院有限公司 | A kind of attack detection method and device |
CN110807014A (en) * | 2019-09-24 | 2020-02-18 | 国网北京市电力公司 | Cross validation based station data anomaly discrimination method and device |
CN110719272A (en) * | 2019-09-27 | 2020-01-21 | 湖南大学 | LR algorithm-based slow denial of service attack detection method |
CN110717189A (en) * | 2019-09-29 | 2020-01-21 | 支付宝(杭州)信息技术有限公司 | Data leakage identification method, device and equipment |
CN113839852A (en) * | 2020-06-23 | 2021-12-24 | 中国科学院计算机网络信息中心 | Mail account abnormity detection method, device and storage medium |
CN113839852B (en) * | 2020-06-23 | 2023-03-24 | 中国科学院计算机网络信息中心 | Mail account abnormity detection method, device and storage medium |
CN112822168A (en) * | 2020-12-30 | 2021-05-18 | 绿盟科技集团股份有限公司 | Abnormal mail detection method and device |
CN115037542A (en) * | 2022-06-09 | 2022-09-09 | 北京天融信网络安全技术有限公司 | Abnormal mail detection method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107196844A (en) | Exception mail recognition methods and device | |
CN110991486B (en) | Method and device for controlling labeling quality of multi-person collaborative image | |
CN108717545A (en) | A kind of bank slip recognition method and system based on mobile phone photograph | |
CN107967475A (en) | A kind of method for recognizing verification code based on window sliding and convolutional neural networks | |
CN104506356B (en) | A kind of method and apparatus of determining IP address credit worthiness | |
CN112420187B (en) | Medical disease analysis method based on migratory federal learning | |
CN108777021A (en) | It is a kind of to mix the bank slip recognition method and system swept based on scanner | |
CN107679046A (en) | A kind of detection method and device of fraudulent user | |
CN107895036A (en) | One kind is based on the online analysis and processing method that cheats at one's exam of safety encryption | |
CN107945003A (en) | Credit estimation method and device | |
CN110309884A (en) | Electricity consumption data anomalous identification system based on ubiquitous electric power Internet of Things net system | |
CN115759640A (en) | Public service information processing system and method for smart city | |
CN115277180A (en) | Block chain log anomaly detection and tracing system | |
CN110213152A (en) | Identify method, apparatus, server and the storage medium of spam | |
CN111079184A (en) | Method, system, device and storage medium for protecting data leakage | |
CN106897743A (en) | The anti-cheating big data detection method of movable attendance checking based on Bayesian model | |
CN110309737A (en) | A kind of information processing method applied to cigarette sales counter, apparatus and system | |
CN104871201A (en) | Forensic system, forensic method, and forensic program | |
CN116383786B (en) | Big data information supervision system and method based on Internet of things | |
CN116452212B (en) | Intelligent customer service commodity knowledge base information management method and system | |
CN107766737A (en) | A kind of database audit method | |
CN107193872A (en) | Question and answer data processing method and device | |
Imbaquingo et al. | Let’s talk about Computer Audit Quality: A systematic literature review | |
CN109816513A (en) | User credit ranking method and device, readable storage medium storing program for executing | |
CN114630110A (en) | Video image online rate detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 813, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080 Applicant after: BEIJING ULTRAPOWER INFORMATION SAFETY TECHNOLOGY Co.,Ltd. Address before: 100107 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building 6 storey block A room 604 Applicant before: BEIJING ULTRAPOWER INFORMATION SAFETY TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170922 |
|
RJ01 | Rejection of invention patent application after publication |