Content of the invention
It is an object of the invention to provide a kind of have the method generating sender reputation's value based on user behavior analysis, for preferably filtering spam mail.
The object of the present invention is achieved like this: a kind of method generating credit value based on user behavior analysis.It includes step:
A), initialization system, loading configuration file, extract addresser's eigenvalue, connection features data base from log information, comprising:
(a1) prepare before running, load massive logs file and be analyzed, extract mail body size from log information, transmit and successfully count, transmit and unsuccessfully count, transmit sum, addressee replys number, Mail Contents, sender's domain name, and ip transmits the information that successfully and unsuccessfully counts;
(a2) eigenvalue that daily record is extracted is saved in property data base;
B), sender reputation's value generation phase, this stage is mainly by transmitting number to user's history, transmits success rate, the same day transmits number, whether addressee replys, Mail Contents are analyzed generating corresponding credit value;
C), sender reputation's value binning phase, specifically comprises the following steps that
(c1) if the eigenvalue of sender have matched the credit value described in step b), the credit value of generation is saved in data base;
(c2) if the credit value described in the eigenvalue non-matching step b) of sender, eigenvalue is saved in data base, analyzes again for next time.
The present invention is compared to common sender reputation's generation method, the invention has the beneficial effects as follows by user behavior analysis are carried out to massive logs, sender reputation's value that the key property such as sender considering spam transmits sum, the same day transmits quantity, transmit success rate, mail size, Mail Contents, inter-trust domain transmit etc. and generates, the erroneous judgement of spam so can be avoided, improve and solve the ability of Spam filtering well.
Specific embodiment
As shown in figure 1, the present invention relates to a kind of method generating sender reputation's value based on user behavior analysis, including step:
A), initialization system, loading configuration file, extract addresser's eigenvalue, connection features database from log information;
(1) prepare before running, load massive logs file and be analyzed, from log information, extract mail body size,
Transmit and successfully count, transmit and unsuccessfully count, transmit sum, addressee replys number, Mail Contents, sender's domain name, ip transmits the successfully and unsuccessfully information such as several;
(2) eigenvalue that daily record is extracted is saved in property data base.
B), sender reputation's value generation phase, this stage is mainly by transmitting number to user's history, transmits success rate, the same day transmits number, whether addressee replys, Mail Contents are analyzed generating corresponding credit value, and key step is as follows:
(1) judged after extracting eigenvalue from data base, if addresser's history transmits sum and is less than 3 envelopes, data volume it is impossible to generate credit value, directly terminates flow process very little;
(2) when the history amount of transmitting is more than 3 envelopes, when mail sends success rate less than 0.76, setting credit value is 30 points;
(3) it is 100% when transmit success rate transmitting record success rate for 100%, ip, and when addressee has reply or Mail Contents coupling Trusted Critical word or mail size to have more than 500k or have inter-trust domain to transmit either condition and meet, setting credit value is 40 points;
(4) when the amount of transmitting is more than 5 envelopes, transmitting the frequency of failure is 0, addressee and sum is more than 3, and when mail contains Trusted Critical word, setting credit value is 80 points;
(5) when the amount of transmitting is more than 5 envelopes, transmitting the frequency of failure is 0, if the same day transmits more than 1 envelope, and the Trusted Critical word of mail coupling more than 2 or oriented inter-trust domain transmits or addressee has the mail size write in reply or send when meeting more than 2 envelope either condition more than 500k, setting credit value is 80 points;
(6) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number and be more than 0 to 2 envelopes, be that inter-trust domain transmits, and when the same day transmits more than 1 envelope, setting credit value is 70 points;
(7) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number and be more than 0 to 2 envelopes, have addressee to write in reply, and when the same day transmits more than 1 envelope, setting credit value is 70 points;
(8) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number and be more than 0 to 2 envelopes, Mail Contents contain believable key word, Mail Contents mate believable key word more than 2, and when the same day transmits more than 1 envelope, setting credit value is 70 points;
(9) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number and be more than 0 to 2 envelopes, Mail Contents contain believable key word, when the mail size of transmission is at least 1 envelope more than 500k, setting credit value is 70 points;
(10) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number and be more than 0 to 2 envelopes, Mail Contents contain believable key word, and Mail Contents contain believable key word, addressee have identical and sum more than 3 when, setting credit value is 70 points;
(11) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number be more than 2 to 9 envelopes, transmit unsuccessfully number for 3 and the same day transmit less than 3 envelope when, setting credit value be 30 points;
(12) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number and be more than 2 to 9 envelopes, the amount of transmitting is more than 20 envelopes, Mail Contents mate believable key word number more than 4 and addressee's sum more than 12 and of the same name more than 4 when, setting credit value is 70 points;
(13) when the amount of transmitting is more than 5 envelopes, transmit unsuccessfully number and be more than 2 to 9 envelopes, the amount of transmitting is more than 20 envelopes, Mail Contents mate believable key word number more than 4 and when the same day transmits more than 4 envelope, setting credit value is 70 points;
(14) when transmitting number of times less than 5 envelopes, transmit unsuccessfully number and be more than 0 to 2 envelopes, mail size is at least 1 envelope more than 500k, when Mail Contents contain believable key word, setting credit value is 70 points.
C), sender reputation's value binning phase, specifically comprises the following steps that
(1) if the eigenvalue of sender have matched a certain rule above, the credit value of generation is saved in data base.
(2) if the eigenvalue of sender does not match arbitrary rule, eigenvalue is saved in data base, analyzes again for next time.
It is analyzed by the behavior that transmits long-term to user, whether user's history transmits behavior to email in future is that spam has predictability, such as sender sent spam in the past, send out again later an envelope mail be spam probability very high, by intelligent algorithm, the behavior analysiss that transmit of user are drawn, spam possesses following characteristic:
1) mail size is not too large, and too conference affects the delivery speed of spam.
2) success rate sending is not high, and some mails are given by anti-spam system and intercepted.
3) traffic volume is big, is typically transmitted by mass-sending instrument.
4) addressee will not reply.
5) Mail Contents mostly are advertisement, political or pornographic speech.
6) transmit domain name and mostly be strange domain name.
By the method for machine learning, the massive logs producing on line are analyzed, choose mail body size, transmit and successfully count, transmit and unsuccessfully count, transmit sum, addressee replys number, Mail Contents, sender's domain name, i () ip transmits successfully and unsuccessfully multiple characteristic dimension such as number, by massive logs, characteristic model is trained, (ii) these eigenvalues are generated with an overall reputation score storehouse, mail mates this feature prestige storehouse in real time, (iii) the specific credit value of sender is generated to the sender meeting condition, improve the accuracy of credit value.
Sender reputation's value is intelligently generated by above characteristic, is a kind of good method for filtering spam mail, and fact proved highly effective, False Rate is very low.
In sum, the invention has the beneficial effects as follows by user behavior analysis are carried out to massive logs, sender reputation's value that the key property such as sender considering spam transmits sum, the same day transmits quantity, transmit success rate, mail size, Mail Contents, inter-trust domain transmit etc. and generates, so can prevent certain single features from causing the deviation of credit value, cause the erroneous judgement of spam, improve and solve the ability of Spam filtering well.