CN107196844A - Exception mail recognition methods and device - Google Patents

Exception mail recognition methods and device Download PDF

Info

Publication number
CN107196844A
CN107196844A CN201611065946.1A CN201611065946A CN107196844A CN 107196844 A CN107196844 A CN 107196844A CN 201611065946 A CN201611065946 A CN 201611065946A CN 107196844 A CN107196844 A CN 107196844A
Authority
CN
China
Prior art keywords
mail
data
characteristic value
history
exception
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611065946.1A
Other languages
Chinese (zh)
Inventor
赵欢
王星亮
高峰
张建军
王秀娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ultrapower Information Safety Technology Co Ltd
Beijing Shenzhou Taiyue Software Co Ltd
Original Assignee
Beijing Ultrapower Information Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ultrapower Information Safety Technology Co Ltd filed Critical Beijing Ultrapower Information Safety Technology Co Ltd
Priority to CN201611065946.1A priority Critical patent/CN107196844A/en
Publication of CN107196844A publication Critical patent/CN107196844A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the present application provides a kind of exception mail recognition methods and device, and whether it obtains mail recognition model by the way that a large amount of history mail data are carried out with feature extraction and data training, abnormal by the mail recognition model automatic identification targeted mails.Relative to prior art, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model trains what is obtained based on a large amount of history mail data, identification range is more comprehensively, it can avoid causing part exception mail not to be identified because sensitive information inspection policy has leak, so as to improve recognition accuracy.

Description

Exception mail recognition methods and device
Technical field
The application is related to technical field of network management, more particularly to a kind of exception mail recognition methods and device.
Background technology
Email so that its is easy-to-use, using quick, communication in time, cost and the advantages of contain much information, As current business and personal important means of communication.Many valuable information are generally comprised in Email, to avoid machine Confidential information or the file leakage comprising sensitive information, many enterprises are required for the Email for sending or receiving to employee to manage Reason, specific way to manage has a lot, for example:When the addressee of Email, sender, theme, annex name or mail size etc. When including sensitive information in attribute, forbid sending the Email, restriction can only use mailbox as defined in enterprise to send electronics postal Part, and mail must be made a copy for and can be just successfully transmitted to specific people (such as department manager).
Prior art all has certain limitation to the way to manage of Email.For example, above-mentioned be based on sensitive information To check and forbid the way to manage of exception mail, it is difficult to cover all sensitive information and be possible to comprising sensitive information Attribute, causes inspection policy to there is leak, and part abnormal e-mail can not be detected;Need to make a copy for specific also as noted above The way to manage of personnel, then need the specific people manually to detect whether for legitimate mail, add workload.
Therefore, need badly it is a kind of can recognize the scheme of exception mail automatically and exactly, not increase relevant people employee On the premise of measuring, the output of exception mail is reduced.
The content of the invention
This application provides a kind of exception mail recognition methods and relevant apparatus, not increase related personnel's workload Under the premise of, exception mail is recognized automatically and exactly, the output of exception mail is reduced.
In order to solve the above-mentioned technical problem, the embodiment of the present application discloses following technical scheme:
The first aspect of the embodiment of the present application there is provided a kind of exception mail recognition methods, including:
Obtain history mail data and the corresponding markup information of the history mail data;The markup information is used to mark The history mail data are normal email data or exception mail data;
Feature extraction is performed to the history mail data, the corresponding characteristic value collection of the history mail data is obtained;
According to the markup information and characteristic value collection, mail recognition model is set up;
When detecting mail transmission event, targeted mails are identified using the mail recognition model, to determine Whether the targeted mails are exception mail.
Optionally, feature extraction is performed to the history mail data, including:
Extract the corresponding independent characteristic value of each feature respectively from the history mail data;
And/or, it is mutually related the corresponding linked character value of multiple features from the history mail extracting data.
Optionally, according to the markup information and characteristic value collection, mail recognition model is set up, including:
The characteristic value collection is divided into by normal email sample set or exception mail sample set according to the markup information Close;
The normal email sample set at least includes the first normal subclass and the second normal subset mutually without common factor Close, the exception mail sample set at least includes the first abnormal subclass and the second abnormal subclass mutually without common factor;
Data training is carried out according to the described first normal subclass and/or the first abnormal subclass, initial model is obtained;
According to the described second normal subclass and/or the second abnormal subclass, the initial model is verified, obtained The mail recognition model.
Optionally, according to the markup information and characteristic value collection, mail recognition model is set up, including:
The first subclass in the characteristic value collection carries out data training, obtains initial model;
Yield in the second subset in the characteristic value collection is closed, and the initial model is verified, the mail is obtained Identification model.
Optionally, according to the markup information and characteristic value collection, mail recognition model is set up, including:
According to the markup information and characteristic value collection, the mail recognition mould is set up by binary logistic regression algorithm Type.
Optionally, targeted mails are identified using the mail recognition model, including:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to the targeted mails Each characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, the matching value is weighted and asked With;
Determine whether the targeted mails are exception mail according to the weighted sum result.
Optionally, methods described also includes:
After targeted mails are identified using the mail recognition model, recognition result is verified;
When checking obtains the recognition result mistake, the corresponding mail data of the targeted mails is gone through added to described History mail data, and re-establish the mail recognition model.
Optionally, according to the markup information and characteristic value collection, set up before mail recognition model, methods described is also Including:
Data cleansing operation is carried out to the characteristic value collection;
And/or, the characteristic value of nonumeric type in the characteristic value collection is converted to the characteristic value of numeric type.
The second aspect of the embodiment of the present application there is provided a kind of exception mail identifying device, including:
Data acquisition unit, for obtaining history mail data and the corresponding markup information of the history mail data;Institute Stating markup information is used to mark the history mail data to be normal email data or exception mail data;
Data processing unit, for performing feature extraction to the history mail data, obtains the history mail data Corresponding characteristic value collection;
Modeling unit, for according to the markup information and characteristic value collection, setting up mail recognition model;
Recognition unit, for when detecting mail transmission event, being entered using the mail recognition model to targeted mails Row identification, to determine whether the targeted mails are exception mail.
Optionally, described device also includes:
Authentication unit, the identification knot is obtained for being verified to the recognition result that recognition unit is obtained, and in checking During fruit mistake, data acquisition unit, data processing unit and modeling unit described in retriggered, by targeted mails correspondence Mail data be added to the history mail data, and re-establish the mail recognition model.
Optionally, described device also includes:
Data cleansing unit, for carrying out data cleansing operation to the characteristic value collection;
And/or, Date Conversion Unit, for the characteristic value of nonumeric type in the characteristic value collection to be converted into numeric type Characteristic value.
From above technical scheme, the embodiment of the present application to a large amount of history mail data by carrying out feature extraction and number According to training, mail recognition model is obtained, it is whether abnormal by the mail recognition model automatic identification targeted mails.Relative to existing Technology, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model is with a large amount of history mail data Based on training obtain, identification range more comprehensively, can avoid causing part different because sensitive information inspection policy has leak Normal mail can not be identified, so as to improve recognition accuracy.
It should be appreciated that the general description of the above and detailed description hereinafter are only exemplary and explanatory, not The disclosure can be limited.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other accompanying drawings according to these accompanying drawings.
A kind of flow chart for exception mail recognition methods that Fig. 1 provides for the embodiment of the present application;
The flow chart for another exception mail recognition methods that Fig. 2 provides for the embodiment of the present application;
The flow chart for another exception mail recognition methods that Fig. 3 provides for the embodiment of the present application;
A kind of structural representation for exception mail identifying device that Fig. 4 provides for the embodiment of the present application;
The structural representation for another exception mail identifying device that Fig. 5 provides for the embodiment of the present application.
Embodiment
In order that those skilled in the art more fully understands the technical scheme in the embodiment of the present application, and make the application real Applying the above-mentioned purpose of example, feature and advantage can be more obvious understandable, below in conjunction with the accompanying drawings to technical side in the embodiment of the present application Case is described in further detail.
A kind of flow chart for exception mail recognition methods that Fig. 1 provides for the embodiment of the present application.
Reference picture 1, the exception mail recognition methods comprises the following steps:
S11, acquisition history mail data and the corresponding markup information of the history mail data.
The markup information is used to mark the history mail data to be normal email data or exception mail data.
In the embodiment of the present application, the history mail data can be extracted from existing mail manager database, client Obtain, can be all history mail or the history mail in a period of time (such as 2 years, 1 year);Certainly, it is Ensure exception mail recognition accuracy, history mail data as much as possible should be obtained.Wherein, the history mail data can be with Including the related various features data of each history mail, post time, sender, transmission IP as shown in Table 1 below Address, sender's mailbox network address, and recipient mailbox's network address, transmission content, transmission annex, the client-side information for sending mail Deng.
The history mail data statistic of table 1
Post time Sender Send IP address Sender's mailbox network address
2016/6/1 8:30 User1 192.168.10.1 mail.163.com
2016/6/1 8:33 User2 192.168.10.1 mail.163.com
2016/6/2 8:35 User3 192.168.10.1 mail.126.com
2016/6/2 8:32 User4 192.168.10.1 mail.qq.com
2016/6/3 8:32 User5 192.168.10.2 mail.qq.com
2016/6/1 9:32 User6 192.168.10.2 mail.qq.com
In the embodiment of the present application, above-mentioned markup information can characterize the related every kind of data of each history mail respectively to be It is no be the information of abnormal data or characterize each history mail it is overall whether be exception mail information.
Optionally, above-mentioned markup information can, according to preset rules, automatic marking be carried out to the history mail data Obtain.For example, by traveling through and judging that sender's mailbox network address is whether in default blacklist in history mail data, next pair Sender's mailbox network address of each history mail is labeled, and the mailbox network address addition first in the default blacklist is marked Information (the first markup information represent corresponding data be abnormal data), to mailbox network address not in the default blacklist without Markup information, or the second identification information of addition (the second markup information represents that corresponding data is normal data).
Optionally, above-mentioned markup information can also be obtained according to the input information of user.For example, request can be passed through User judges whether the body matter of history mail is legal (such as, if without sensitive information etc.), and receives the judgement of user's input Object information, generates corresponding markup information, for judged result information representation body matter bag according to the judged result information Containing sensitive information, above-mentioned first identification information is added, does not include sensitive information for judged result information representation body matter , without markup information, or add above-mentioned second identification information.
S12, to the history mail data perform feature extraction, obtain the corresponding characteristic value collection of the history mail data Close.
S13, according to the markup information and characteristic value collection, set up mail recognition model.
Due to for the different corresponding different application demands of application scenarios and different exception mail characteristic, Email Various features all possibly as judge the Email whether be exception mail foundation, including above-mentioned post time, hair Part people, transmission IP address, sender's mailbox network address, recipient mailbox's network address, transmission content, the visitor for sending annex, sending mail The features such as family client information;Therefore, in the embodiment of the present application, by performing feature extraction operation to a large amount of history mail data, obtain To the characteristic value of the various features of a large amount of history mails, composition characteristic value set, namely sample data sets, so as to pass through Data training obtains mail recognition model, as described in step S13.
Optionally, described in the embodiment of the present application specific use by label of markup information of data training process has supervision Learning process.
S14, when detect mail send event when, targeted mails are identified using the mail recognition model, with Whether determine the targeted mails is exception mail.
In one feasible embodiment of the application, it can be trained in step S13 only with normal email characteristic value, Mail recognition model is obtained, during this, above-mentioned markup information can be used for rejecting the exception mail feature in characteristic value collection Value.
In the application in another feasible embodiment, in step S13 can also simultaneously using normal email characteristic value and Exception mail characteristic value is trained, during this, first can be divided into the characteristic value collection just according to the markup information Normal mail sample set or exception mail sample set (will characterize the markup information of abnormal data in the characteristic value collection Corresponding characteristic value charges to exception mail sample set, characterizes the characteristic value corresponding to the markup information of normal data and charges to just Normal mail sample set), it is trained respectively as positive sample and negative sample.
From above technical scheme, the embodiment of the present application to a large amount of history mail data by carrying out feature extraction and number According to training, mail recognition model is obtained, it is whether abnormal by the mail recognition model automatic identification targeted mails.Relative to existing Technology, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model is with a large amount of history mail data Based on training obtain, identification range more comprehensively, can avoid causing part different because sensitive information inspection policy has leak Normal mail can not be identified, so as to improve recognition accuracy.
Optionally, in the embodiment of the present application, the history mail data are performed with feature extraction, tool described in step S12 Body can be:Extract the corresponding independent characteristic value of each feature respectively from the history mail data;Can also be:From described History mail extracting data is mutually related the corresponding linked character value of multiple features.
For example, previously described post time, sender's mailbox network address, recipient mailbox's network address etc. are characterized in relative Independent, whether it normally contacts with other features without inevitable, therefore can individually extract its characteristic value to these features, remembers For independent characteristic value.
And for the corresponding linked character of multiple features, in fact it could happen that such mail:The linked character is corresponding each Feature is all normal, but the not common corresponding relation of its corresponding relation.It is special with sender and this association of transmission IP address Exemplified by levying, in the case where forbidding the application demand of other people generation hair mails, sender B uses sender A mailbox under the IP address of oneself The Email that address is sent, sender is normal, and it is also normal to send IP address, but the corresponding relation of the two is abnormal , if sender is only individually identified and the normality of IP address is sent, ignore the corresponding relation of the two, then above-mentioned generation hair mail It will be unable to be correctly validated out.Therefore, the embodiment of the present application extracts linked character correspondence in addition to extracting independent characteristic value, also Linked character value, exception mail recognition correct rate can be improved, more complicated mail management demand is met.
, specifically can be by counting respectively during step S13 based on above-mentioned independent characteristic value and linked character value The corresponding all characteristic values of each feature (including above-mentioned independent characteristic and linked character), determine the corresponding characteristic value model of this feature Enclose, matched for the individual features value with targeted mails.
For example, carrying out collect statistics for all post times extracted, it is 8 to obtain earliest time therein: 30, latest time is 17:30, so that it is determined that the corresponding range of characteristic values of this feature of post time is 8:30~17:30, So as to when carrying out anomalous identification to targeted mails, if the transmission time of targeted mails is not in the time range 8:30~17: In 30, illustrate the transmission time anomaly of targeted mails, targeted mails may be exception mail, therefore can be given birth to according to the matching result Into a matching value for representing this feature abnormalities, to improve the probability that targeted mails are identified as exception mail.
And for example, collect statistics are carried out for the sender extracted and the linked character value of transmission IP address, obtains normal Correspondence set of the sender with sending IP address in mail, so that when carrying out anomalous identification to targeted mails, judging should The corresponding relation (e.g., " User1 " of sender and transmission IP address in targeted mails:" 192.168.10.2 ") whether in the correspondence In set of relationship, if not, the sender of explanation targeted mails or transmission IP address are abnormal, targeted mails may be abnormal postal Part, therefore a matching value for representing this feature abnormalities can be equally generated according to the matching result, to improve targeted mails quilt It is identified as the probability of exception mail.
In one feasible embodiment of the application, the utilization mail recognition model described in step S14 is to target postal Part is identified, and specifically may comprise steps of:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to the targeted mails Each characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, the matching value is weighted and asked With;
Determine whether the targeted mails are exception mail according to the weighted sum result.
Optionally, if be trained in step S13 only with normal email characteristic value, the mail recognition finally given Model, can record the span for all features that normal email possesses, and each feature influences on mail normality Weight.Accordingly, when carrying out anomalous identification to new mail (i.e. targeted mails) in step S14, the mail recognition model can be with The various features value of targeted mails is matched with the individual features value scope recorded in model, and generated according to matching result One matching value, is weighted further according to the corresponding weight of various features recorded in model to the corresponding matching value of various features Summation, is that can determine that whether targeted mails are exception mail according to the size of the weighted sum result.
Optionally, if be trained in step S13 using normal email characteristic value and exception mail characteristic value simultaneously, In the mail recognition model finally given, while have recorded the range of characteristic values of normal email, the range of characteristic values of exception mail And the corresponding weight of each feature.Accordingly, when carrying out anomalous identification to targeted mails in step S14, the mail recognition mould Type can be by the various features value of targeted mails respectively with normal email range of characteristic values and the range of characteristic values of exception mail Matched, accordingly obtain two groups of matching values, the i.e. matching value relative to normal email and the matching value relative to exception mail, Summation is weighted to every group of matching value further according to various features corresponding weight, by comparing the big of two weighted sum results It is small to can determine that whether targeted mails are exception mail.Depending on specific decision condition can be according to practical application request, for example, can When the first weighted sum result relative to normal email is less than the second weighted sum result relative to exception mail, to sentence The mail that sets the goal is exception mail, can also be less than in the ratio of the first weighted sum result and the second weighted sum result During one predetermined threshold value, judge targeted mails as exception mail.
It can be seen that, matching value and weight of the embodiment of the present application based on each feature, the legitimacy of synthetic determination targeted mails, The degree of accuracy of mail recognition can be improved.
In one feasible embodiment of the application, the mail recognition model described in step S13 sets up process specifically can be with Including two steps, i.e.,:
S131, the first subclass in the characteristic value collection carry out data training, obtain initial model;
S132, the yield in the second subset in the characteristic value collection are closed, and the initial model is verified, obtains above-mentioned Mail recognition model.
In the embodiment of the present application, the characteristic value collection extracted is divided at least two subclass, the first subclass is used for Data are trained, and set up initial model, yield in the second subset is shared in the verification (tuning) to the initial model.To avoid model from excessively intending Close, should ensure that two subclass without common factor.For example, in characteristic value collection 60% characteristic value can be included in the first subclass, The characteristic value of residue 40% is included in yield in the second subset conjunction.
The embodiment of the present application further utilizes history mail data pair after model foundation is completed based on history mail data The model is verified, tuning, can improve the recognition accuracy of mail recognition model.
Optionally, above-mentioned first subclass can include the first normal subclass and the first abnormal subclass, above-mentioned second Subclass can include the second normal subclass and the second abnormal subclass.Wherein, the first normal subclass and the second normal-sub Set is belonged between previously described normal email sample set, and the first normal subclass and the second normal subclass without friendship Collection;First abnormal subclass and the second abnormal subclass belong to previously described exception mail sample set, and the first exception Without common factor between subclass and the second abnormal subclass.
That is, in the embodiment of the present application, no matter data training process, or initial model checking procedure, all simultaneously using just Sample and negative sample, can avoid, to single sample (single positive sample or single negative sample) overfitting, improving mail recognition The recognition accuracy of model.
In one feasible embodiment of the application, the mail recognition model described in step S13 sets up process (or step Initial model described in S131 sets up process), it can specifically use binary logistic regression algorithm.
It is abnormal postal because the exception mail recognition result described in the present embodiment only exists two kinds of situations, i.e. targeted mails Part, and targeted mails are normal email (not being exception mails), therefore can be modeled using binary logistic regression algorithm.This two The conditional probability distribution that item logistic regression algorithm can be expressed as:
Wherein, χ is independent variable, i.e. the input data of model, and its span is n dimension sets of real numbers Rn;Y is dependent variable, i.e., The output data of model, its span is { 0,1 }.Parameter ω is weight vector, ω ∈ Rn;Parameter b is biasing, and (R is b ∈ R Set of real numbers).
Process, can be expanded weight vector ω and input χ, i.e., to simplify the process:
ω=ω(1)(2),...,ω(n),b)T;χ=(χ(1)(2),...,χ(n),1)T
Now, above-mentioned formula one and formula two can be equivalent to:
Binding events logarithm probability formula (logit functions)(p is the probability that the event occurs), Above-mentioned formula three and formula four can be converted to:
It can be seen from formula five, in binary logistic regression algorithm, output Y=1 logarithm probability is the linear letter for inputting χ Number;In other words, linear function ω χ value is bigger, and the probability for exporting Y=1 is bigger (output Y=0 probability is smaller), conversely, Linear function ω χ value is smaller, and the probability for exporting Y=1 is smaller (output Y=0 probability is bigger).
Therefore, for given input χ, it is possible to use it is big that above-mentioned formula five judges that it exports Y=1 or Y=0 probability It is small.Applied to the embodiment of the present application, χ is the characteristic value to be extracted by history mail data, and Y is the result of determination of targeted mails (such as:Y=1 can represent that targeted mails are exception mail, and Y=0 can represent that targeted mails are normal email);In weight vector In the case that ω is determined, using either objective mail as input χ, above-mentioned binary logistic regression algorithm is utilized, it is possible to it is determined that should Targeted mails whether be exception mail probability.The target postal is obtained for example, can be calculated by formula three and formula four respectively Part is the probability P (Y=1 | χ) of exception mail and the probability P (Y=0 | χ) of normal email, as P (Y=1 | χ) > P (Y=0 | χ) When, output recognition result is:Targeted mails are exception mail (Y=1);Or it is abnormal postal to calculate targeted mails by formula five The logarithm probability of part, when the logarithm probability is more than predetermined threshold value, output recognition result is:Targeted mails are exception mail (Y= 1)。
It can be seen that, mail recognition model is set up based on binary logistic regression algorithm, key is to determine weight vector ω.This Shen It please be trained using sample data in embodiment, purpose is to be to determine weight vector ω, can specifically use maximum likelihood The estimation technique.
Trained by Maximum Likelihood Estimation Method and determine that weight vector ω principle is as follows:
For given sample data sets T={ (x1,y1),(x2,y2),...,(xn,yn), it is assumed that P (Y=1 | χ)=π (χ), and P (Y=0 | χ)=1- π (χ), then likelihood function is:
Being denoted as log-likelihood function is:
To L (ω) maximizing, that is, obtain ω estimate.
From above technical scheme, the embodiment of the present application sets up mail recognition mould based on binary logistic regression algorithm Type, while determining the key parameter needed for binary logistic regression algorithm based on Maximum Likelihood Estimation Method, it is ensured that final recognition result Meet probability distribution principle, it is to avoid influence of the human factor to recognition result, thereby may be ensured that the degree of accuracy of recognition result.
The flow chart for another exception mail recognition methods that Fig. 2 provides for the embodiment of the present application.Reference picture 2, this method Comprise the following steps:
S21, acquisition history mail data and the corresponding markup information of the history mail data;
S22, to the history mail data perform feature extraction, obtain the corresponding characteristic value collection of the history mail data Close;
S23, according to the markup information and characteristic value collection, set up mail recognition model;
S24, when detect mail send event when, targeted mails are identified using the mail recognition model, with Whether determine the targeted mails is exception mail;
S25, recognition result is verified, when checking obtains the recognition result mistake, by the targeted mails pair The mail data answered is added to the history mail data, and returns to the step S21.
Relative to embodiment illustrated in fig. 1, embodiment illustrated in fig. 2 is after the identification step to targeted mails is completed, further Recognition result is verified, if checking obtains the recognition result mistake, illustrates mail recognition model existing defects (over-fitting Or poor fitting), therefore by the way that the corresponding mail data of targeted mails is also served as into history mail data, re-establish mail recognition mould Type, eliminates its defect, with the increase of Model Reconstruction number of times, and mail recognition model is also more perfect, and its recognition accuracy is also higher.
Optionally, being verified to recognition result described in step S25, can specifically use the checking based on man-machine interaction Method, i.e., show user, and receive the result of user's input by targeted mails and recognition result;With based on man-machine friendship The accumulation of mutual the result, can be converted to full automatic intelligent verification.
The flow chart for another exception mail recognition methods that Fig. 3 provides for the embodiment of the present application.Reference picture 3, this method Comprise the following steps:
S31, acquisition history mail data and the corresponding markup information of the history mail data;
S32, to the history mail data perform feature extraction, obtain the corresponding characteristic value collection of the history mail data Close;
S33, to the characteristic value collection carry out data cleansing operation, and/or, by nonumeric type in the characteristic value collection Characteristic value be converted to the characteristic value of numeric type;
Above-mentioned data cleansing, i.e., carry out the behaviour such as duplicate removal, filtering, completion, association to a large amount of characteristic values in characteristic value collection Make, to improve the quality of sample data.
Because the data type of various features value may be different, have plenty of numeric type, have plenty of Boolean type, have plenty of character String;And the training process of the embodiment of the present application is that, based on numerical value, the characteristic value of nonumeric type cannot be used directly for training, therefore The characteristic value of nonumeric type can be converted to the characteristic value of numeric type beforehand through data conversion.
S34, according to the markup information and characteristic value collection, set up mail recognition model;
S35, when detect mail send event when, targeted mails are identified using the mail recognition model, with Whether determine the targeted mails is exception mail.
The embodiment of the present application can improve the quality of data in characteristic value collection by data cleansing, data conversion, so that The accuracy of model foundation and recognition result can be improved.
Accordingly, the embodiment of the present application also provides a kind of exception mail identifying device.Structural representation shown in reference picture 4, Above-mentioned exception mail identifying device at least includes:
Data acquisition unit 100, for obtaining history mail data and the corresponding markup information of the history mail data; The markup information is used to mark the history mail data to be normal email data or exception mail data;
Data processing unit 200, for performing feature extraction to the history mail data, obtains the history mail number According to corresponding characteristic value collection;
Modeling unit 300, for according to the markup information and characteristic value collection, setting up mail recognition model;
Recognition unit 400, for when detecting mail transmission event, using the mail recognition model to targeted mails It is identified, to determine whether the targeted mails are exception mail.
From above technical scheme, the embodiment of the present application to a large amount of history mail data by carrying out feature extraction and number According to training, mail recognition model is obtained, it is whether abnormal by the mail recognition model automatic identification targeted mails.Relative to existing Technology, the identification process of the embodiment of the present application is performed automatically by model completely, and because the model is with a large amount of history mail data Based on training obtain, identification range more comprehensively, can avoid causing part different because sensitive information inspection policy has leak Normal mail can not be identified, so as to improve recognition accuracy.
Optionally, the data processing unit 200 can be specifically configured as:
Extract the corresponding independent characteristic value of each feature respectively from the history mail data;
And/or, it is mutually related the corresponding linked character value of multiple features from the history mail extracting data.
Optionally, the modeling unit 300 can be specifically configured as:
The characteristic value collection is divided into by normal email sample set or exception mail sample set according to the markup information Close;
According to the normal email sample set and exception mail sample set, mail recognition model is set up.
Optionally, the modeling unit 300 can be specifically configured as:According to the markup information and characteristic value collection, The mail recognition model is set up by binary logistic regression algorithm.
Optionally, the modeling unit 300 is specifically included:
Initial model sets up unit, and the first subclass in the characteristic value collection carries out data training, obtains just Beginning model;
Model checking unit, is closed for the yield in the second subset in the characteristic value collection, and the initial model is carried out Verification, obtains the mail recognition model.
The embodiment of the present application further utilizes history mail data pair after model foundation is completed based on history mail data The model is verified, tuning, can improve the recognition accuracy of mail recognition model.
Optionally, the recognition unit 400 can be specifically configured as:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to the targeted mails Each characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, the matching value is weighted and asked With;
Determine whether the targeted mails are exception mail according to the weighted sum result.
The structural representation for the exception mail identifying device that another embodiment shown in reference picture 5 is provided, above-mentioned exception mail Identifying device also includes at least one of following:
Authentication unit 500, for being verified to the recognition result that recognition unit is obtained, and obtains the identification in checking As a result when wrong, data acquisition unit, data processing unit and modeling unit described in retriggered, by the targeted mails pair The mail data answered is added to the history mail data, and re-establishes the mail recognition model.
Data-optimized unit 600, for carrying out data cleansing operation to the characteristic value collection, and/or, by the feature The characteristic value of nonumeric type is converted to the characteristic value of numeric type in value set.
The embodiment of the present application is further verified, such as after the identification step to targeted mails is completed to recognition result Fruit checking obtains the recognition result mistake, illustrates mail recognition model existing defects (over-fitting or poor fitting), therefore by by mesh The corresponding mail data of mark mail also serves as history mail data, re-establishes mail recognition model, its defect is eliminated, with mould Type rebuilds the increase of number of times, and mail recognition model is also more perfect, and its recognition accuracy is also higher.
In addition, the embodiment of the present application can improve the data matter in characteristic value collection by data cleansing, data conversion Amount, so as to improve the accuracy of model foundation and recognition result.
On the device in above-described embodiment, wherein modules perform the concrete mode of operation in relevant this method Embodiment in be described in detail, explanation will be not set forth in detail herein.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiments.It is real especially for system Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims (10)

1. a kind of exception mail recognition methods, it is characterised in that including:
Obtain history mail data and the corresponding markup information of the history mail data;The markup information is described for marking History mail data are normal email data or exception mail data;
Feature extraction is performed to the history mail data, the corresponding characteristic value collection of the history mail data is obtained;
According to the markup information and characteristic value collection, mail recognition model is set up;
When detecting mail transmission event, targeted mails are identified using the mail recognition model, it is described to determine Whether targeted mails are exception mail.
2. according to the method described in claim 1, it is characterised in that according to the markup information and characteristic value collection, set up postal Part identification model, including:
The characteristic value collection is divided into by normal email sample set or exception mail sample set according to the markup information;Institute Stating normal email sample set at least includes mutual the first normal subclass and the second normal subclass without common factor, the exception Mail sample set at least includes the first abnormal subclass and the second abnormal subclass mutually without common factor;
Data training is carried out according to the described first normal subclass and/or the first abnormal subclass, initial model is obtained;
According to the described second normal subclass and/or the second abnormal subclass, the initial model is verified, obtains described Mail recognition model.
3. according to the method described in claim 1, it is characterised in that according to the markup information and characteristic value collection, set up postal Part identification model, including:
According to the markup information and characteristic value collection, the mail recognition model is set up by binary logistic regression algorithm.
4. according to the method described in claim 1, it is characterised in that targeted mails are known using the mail recognition model Not, including:
According to the corresponding range of characteristic values of each feature recorded in the mail recognition model, to each of the targeted mails Characteristic value is matched, and obtains corresponding matching value;
According to the corresponding weight of each feature recorded in the mail recognition model, summation is weighted to the matching value;
Determine whether the targeted mails are exception mail according to the weighted sum result.
5. according to the method described in claim 1, it is characterised in that the history mail data are performed with feature extraction, including:
Extract the corresponding independent characteristic value of each feature respectively from the history mail data;
And/or, it is mutually related the corresponding linked character value of multiple features from the history mail extracting data.
6. the method according to claim 1 to 5, it is characterised in that also include:
After targeted mails are identified using the mail recognition model, recognition result is verified;
When checking obtains the recognition result mistake, the corresponding mail data of the targeted mails is added to the history postal Number of packages evidence, and re-establish the mail recognition model.
7. the method according to claim 1 to 5, it is characterised in that according to the markup information and characteristic value collection, build Before vertical mail recognition model, in addition to:
Data cleansing operation is carried out to the characteristic value collection;
And/or, the characteristic value of nonumeric type in the characteristic value collection is converted to the characteristic value of numeric type.
8. a kind of exception mail identifying device, it is characterised in that including:
Data acquisition unit, for obtaining history mail data and the corresponding markup information of the history mail data;The mark Note information is used to mark the history mail data to be normal email data or exception mail data;
Data processing unit, for performing feature extraction to the history mail data, obtains the history mail data correspondence Characteristic value collection;
Modeling unit, for according to the markup information and characteristic value collection, setting up mail recognition model;
Recognition unit, for when detecting mail transmission event, being known using the mail recognition model to targeted mails Not, to determine whether the targeted mails are exception mail.
9. device according to claim 8, it is characterised in that also include:
Authentication unit, the recognition result mistake is obtained for being verified to the recognition result that recognition unit is obtained, and in checking Mistake, data acquisition unit, data processing unit and modeling unit described in retriggered, by the corresponding postal of the targeted mails Number of packages evidence is added to the history mail data, and re-establishes the mail recognition model.
10. device according to claim 8, it is characterised in that also include:
Data cleansing unit, for carrying out data cleansing operation to the characteristic value collection;
And/or, Date Conversion Unit, the spy for the characteristic value of nonumeric type in the characteristic value collection to be converted to numeric type Value indicative.
CN201611065946.1A 2016-11-28 2016-11-28 Exception mail recognition methods and device Pending CN107196844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611065946.1A CN107196844A (en) 2016-11-28 2016-11-28 Exception mail recognition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611065946.1A CN107196844A (en) 2016-11-28 2016-11-28 Exception mail recognition methods and device

Publications (1)

Publication Number Publication Date
CN107196844A true CN107196844A (en) 2017-09-22

Family

ID=59871650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611065946.1A Pending CN107196844A (en) 2016-11-28 2016-11-28 Exception mail recognition methods and device

Country Status (1)

Country Link
CN (1) CN107196844A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107730717A (en) * 2017-10-31 2018-02-23 华中科技大学 A kind of suspicious card identification method of public transport of feature based extraction
CN107911277A (en) * 2017-09-29 2018-04-13 北京明朝万达科技股份有限公司 A kind of outgoing mail auditing method and system based on machine learning
CN108334908A (en) * 2018-03-07 2018-07-27 中国铁道科学研究院 Railway track hurt detection method and device
CN109145298A (en) * 2018-08-14 2019-01-04 顺丰科技有限公司 A kind of identifying system, method, equipment and the storage medium of illegal outgoing mailbox
CN109391620A (en) * 2018-10-22 2019-02-26 武汉极意网络科技有限公司 Method for building up, system, server and the storage medium of abnormal behaviour decision model
CN110061981A (en) * 2018-12-13 2019-07-26 成都亚信网络安全产业技术研究院有限公司 A kind of attack detection method and device
CN110197435A (en) * 2018-04-23 2019-09-03 腾讯科技(深圳)有限公司 Object identifying method and device, storage medium and electronic device
CN110519150A (en) * 2018-05-22 2019-11-29 深信服科技股份有限公司 Mail-detection method, apparatus, equipment, system and computer readable storage medium
CN110717189A (en) * 2019-09-29 2020-01-21 支付宝(杭州)信息技术有限公司 Data leakage identification method, device and equipment
CN110719272A (en) * 2019-09-27 2020-01-21 湖南大学 LR algorithm-based slow denial of service attack detection method
CN110807014A (en) * 2019-09-24 2020-02-18 国网北京市电力公司 Cross validation based station data anomaly discrimination method and device
CN112822168A (en) * 2020-12-30 2021-05-18 绿盟科技集团股份有限公司 Abnormal mail detection method and device
CN113839852A (en) * 2020-06-23 2021-12-24 中国科学院计算机网络信息中心 Mail account abnormity detection method, device and storage medium
CN115037542A (en) * 2022-06-09 2022-09-09 北京天融信网络安全技术有限公司 Abnormal mail detection method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1965306A (en) * 2003-09-10 2007-05-16 菲德利斯安全系统公司 High-performance network content analysis platform
CN101115020A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Secret mail protecting method and mail system
CN101257378A (en) * 2008-04-09 2008-09-03 南京航空航天大学 Anti-disclosure mail safe card and method for detecting disclosure mail
US8224905B2 (en) * 2006-12-06 2012-07-17 Microsoft Corporation Spam filtration utilizing sender activity data
CN103490974A (en) * 2012-06-14 2014-01-01 中国移动通信集团广西有限公司 Junk mail detection method and device
CN103853948A (en) * 2012-11-28 2014-06-11 阿里巴巴集团控股有限公司 User identity recognizing and information filtering and searching method and server
CN104518943A (en) * 2013-09-27 2015-04-15 无锡华润微电子有限公司 Method and system for e-mail management
CN104794176A (en) * 2015-04-02 2015-07-22 中国科学院信息工程研究所 Multiattribute-based detection method for missent e-mail
CN104967558A (en) * 2015-06-10 2015-10-07 东软集团股份有限公司 Method and device for detecting junk mail
CN105320957A (en) * 2014-07-10 2016-02-10 腾讯科技(深圳)有限公司 Classifier training method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1965306A (en) * 2003-09-10 2007-05-16 菲德利斯安全系统公司 High-performance network content analysis platform
CN101115020A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Secret mail protecting method and mail system
US8224905B2 (en) * 2006-12-06 2012-07-17 Microsoft Corporation Spam filtration utilizing sender activity data
CN101257378A (en) * 2008-04-09 2008-09-03 南京航空航天大学 Anti-disclosure mail safe card and method for detecting disclosure mail
CN103490974A (en) * 2012-06-14 2014-01-01 中国移动通信集团广西有限公司 Junk mail detection method and device
CN103853948A (en) * 2012-11-28 2014-06-11 阿里巴巴集团控股有限公司 User identity recognizing and information filtering and searching method and server
CN104518943A (en) * 2013-09-27 2015-04-15 无锡华润微电子有限公司 Method and system for e-mail management
CN105320957A (en) * 2014-07-10 2016-02-10 腾讯科技(深圳)有限公司 Classifier training method and device
CN104794176A (en) * 2015-04-02 2015-07-22 中国科学院信息工程研究所 Multiattribute-based detection method for missent e-mail
CN104967558A (en) * 2015-06-10 2015-10-07 东软集团股份有限公司 Method and device for detecting junk mail

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107911277A (en) * 2017-09-29 2018-04-13 北京明朝万达科技股份有限公司 A kind of outgoing mail auditing method and system based on machine learning
CN107730717A (en) * 2017-10-31 2018-02-23 华中科技大学 A kind of suspicious card identification method of public transport of feature based extraction
CN108334908A (en) * 2018-03-07 2018-07-27 中国铁道科学研究院 Railway track hurt detection method and device
CN110197435B (en) * 2018-04-23 2023-09-26 腾讯科技(深圳)有限公司 Object recognition method and device, storage medium and electronic device
CN110197435A (en) * 2018-04-23 2019-09-03 腾讯科技(深圳)有限公司 Object identifying method and device, storage medium and electronic device
CN110519150A (en) * 2018-05-22 2019-11-29 深信服科技股份有限公司 Mail-detection method, apparatus, equipment, system and computer readable storage medium
CN110519150B (en) * 2018-05-22 2022-09-30 深信服科技股份有限公司 Mail detection method, device, equipment, system and computer readable storage medium
CN109145298A (en) * 2018-08-14 2019-01-04 顺丰科技有限公司 A kind of identifying system, method, equipment and the storage medium of illegal outgoing mailbox
CN109145298B (en) * 2018-08-14 2022-12-27 顺丰科技有限公司 System, method, equipment and storage medium for identifying illegal outgoing mailbox
CN109391620B (en) * 2018-10-22 2021-06-25 武汉极意网络科技有限公司 Method, system, server and storage medium for establishing abnormal behavior judgment model
CN109391620A (en) * 2018-10-22 2019-02-26 武汉极意网络科技有限公司 Method for building up, system, server and the storage medium of abnormal behaviour decision model
CN110061981A (en) * 2018-12-13 2019-07-26 成都亚信网络安全产业技术研究院有限公司 A kind of attack detection method and device
CN110807014A (en) * 2019-09-24 2020-02-18 国网北京市电力公司 Cross validation based station data anomaly discrimination method and device
CN110719272A (en) * 2019-09-27 2020-01-21 湖南大学 LR algorithm-based slow denial of service attack detection method
CN110717189A (en) * 2019-09-29 2020-01-21 支付宝(杭州)信息技术有限公司 Data leakage identification method, device and equipment
CN113839852A (en) * 2020-06-23 2021-12-24 中国科学院计算机网络信息中心 Mail account abnormity detection method, device and storage medium
CN113839852B (en) * 2020-06-23 2023-03-24 中国科学院计算机网络信息中心 Mail account abnormity detection method, device and storage medium
CN112822168A (en) * 2020-12-30 2021-05-18 绿盟科技集团股份有限公司 Abnormal mail detection method and device
CN115037542A (en) * 2022-06-09 2022-09-09 北京天融信网络安全技术有限公司 Abnormal mail detection method and device

Similar Documents

Publication Publication Date Title
CN107196844A (en) Exception mail recognition methods and device
CN110991486B (en) Method and device for controlling labeling quality of multi-person collaborative image
CN108717545A (en) A kind of bank slip recognition method and system based on mobile phone photograph
CN107967475A (en) A kind of method for recognizing verification code based on window sliding and convolutional neural networks
CN104506356B (en) A kind of method and apparatus of determining IP address credit worthiness
CN112420187B (en) Medical disease analysis method based on migratory federal learning
CN108777021A (en) It is a kind of to mix the bank slip recognition method and system swept based on scanner
CN107679046A (en) A kind of detection method and device of fraudulent user
CN107895036A (en) One kind is based on the online analysis and processing method that cheats at one's exam of safety encryption
CN107945003A (en) Credit estimation method and device
CN110309884A (en) Electricity consumption data anomalous identification system based on ubiquitous electric power Internet of Things net system
CN115759640A (en) Public service information processing system and method for smart city
CN115277180A (en) Block chain log anomaly detection and tracing system
CN110213152A (en) Identify method, apparatus, server and the storage medium of spam
CN111079184A (en) Method, system, device and storage medium for protecting data leakage
CN106897743A (en) The anti-cheating big data detection method of movable attendance checking based on Bayesian model
CN110309737A (en) A kind of information processing method applied to cigarette sales counter, apparatus and system
CN104871201A (en) Forensic system, forensic method, and forensic program
CN116383786B (en) Big data information supervision system and method based on Internet of things
CN116452212B (en) Intelligent customer service commodity knowledge base information management method and system
CN107766737A (en) A kind of database audit method
CN107193872A (en) Question and answer data processing method and device
Imbaquingo et al. Let’s talk about Computer Audit Quality: A systematic literature review
CN109816513A (en) User credit ranking method and device, readable storage medium storing program for executing
CN114630110A (en) Video image online rate detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 813, 8 / F, 34 Haidian Street, Haidian District, Beijing 100080

Applicant after: BEIJING ULTRAPOWER INFORMATION SAFETY TECHNOLOGY Co.,Ltd.

Address before: 100107 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building 6 storey block A room 604

Applicant before: BEIJING ULTRAPOWER INFORMATION SAFETY TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170922

RJ01 Rejection of invention patent application after publication