CN102880952A

CN102880952A - Method for collecting and classifying E-mails

Info

Publication number: CN102880952A
Application number: CN2012103276245A
Authority: CN
Inventors: 林延中; 潘庆峰
Original assignee: MAIMAILTECH (BEIJING) CO Ltd
Current assignee: MAIMAILTECH (BEIJING) CO Ltd
Priority date: 2012-09-07
Filing date: 2012-09-07
Publication date: 2013-01-16
Also published as: WO2014036788A1

Abstract

The invention discloses a method for collecting and classifying E-mails, which comprises the following steps: scanning all reported E-mails in a server, extracting target E-mails with the report time more than or equal to n, wherein n is a default value and the reported E-mails comprise E-mails reported to be both normal E-mails and junk E-mails; computing confidence coefficients of the target E-mails and figuring out the results; and judging the target E-mails as the junk-E-mails or the normal E-mails according to the computed results and storing the judged target E-mails in a database. According to the invention, no person needs to be specially assigned to classify and label a great quantity of E-mails, but a computer is directly applied to collect feedback information of users, so that the manual workload is reduced, the accuracy rate of classification is guaranteed, and the E-mails are not read artificially, so as to protect the privacy of the users.

Description

A kind of Email is collected sorting technique

Technical field

The present invention relates to communication technical field, relate in particular to a kind of Email and collect sorting technique.

Background technology

At present, what carry out that text classification uses is the artificial intelligence sorting algorithm, and these algorithms need learn learning sample first, construct corresponding discrimination model after, just can carry out text classification; Therefore, need obtain first learning sample, the method for obtaining at present learning sample is manually directly a collection of sampling to be marked, and the mark mail is spam or non-spam.

Because sorting algorithm need to have enough learning information amounts, at least need several ten thousand envelope learning samples are learnt just can construct a reliable model, therefore, need to arrange the special messenger that several ten thousand envelope mails are carried out classification annotation, its workload is huge, and manually carries out for a long time this class repeated work, easily makes a fault, cause the sample error rate to increase, affect the final results of learning of sorting algorithm; In addition, when mail is carried out classification annotation, need manual read's user mail, invaded user's privacy.

Summary of the invention

Embodiment of the invention technical matters to be solved is; provide a kind of Email to collect sorting technique; the method need not to arrange the special messenger that a large amount of mails are carried out classification annotation; but directly utilize computing machine to collect user's feedback information; alleviated artificial workload; guarantee the accuracy rate of classification, also need not manually mail to be read simultaneously, protected user's privacy.

In order to solve the problems of the technologies described above, the embodiment of the invention provides a kind of Email to collect sorting technique, comprise: all mails of being reported in the scanning server, extract by the targeted mails of report number of times more than or equal to n, n is default value, and it is that normal email and quilt report are the mail of spam that the described mail of being reported comprises by report; Calculate the degree of confidence of described targeted mails, draw result of calculation; Judge that according to described result of calculation described targeted mails is spam or normal email, and store in the database.

As the improvement of such scheme, the step of the degree of confidence of the described targeted mails of described calculating comprises: with the degree of confidence addition of institute's handlebar targeted mails report for the informer of normal email, draw total normal email degree of confidence X; With the degree of confidence addition of institute's handlebar targeted mails report for the informer of spam, draw total spam confidence Y; Calculating the absolute value of the difference of total normal email degree of confidence X and total spam confidence Y | X-Y| draws result of calculation.

Improvement as such scheme, describedly judging that according to described result of calculation described targeted mails comprises as the step of spam or normal email: with the absolute value of described total normal email degree of confidence X with the difference of total spam confidence Y | X-Y| and threshold value T compare, judge | whether X-Y| is less than T, be judged as when being, temporarily this mail is not judged, be judged as when no, the size that compares X and Y, as X during greater than Y, judge that mail is normal email, as X during less than Y, judge that mail is spam.

As the improvement of such scheme, also comprised before the step of the degree of confidence of the described targeted mails of described calculating: the initial degree of confidence that will report for the first time the informer of mail is preset as 1.

As the improvement of such scheme, described Email is collected sorting technique and is also comprised: upgrade informer's degree of confidence, increase the informer's consistent with the final decision result degree of confidence, reduce the degree of confidence with the inconsistent informer of final decision result.

As the improvement of such scheme, gathering way of described degree of confidence is slower than underspeeding.

As the improvement of such scheme, described degree of confidence is provided with maximal value and minimum value, and described degree of confidence rises to after the maximal value just no longer to be increased, and just no longer reduces after dropping to minimum value.

Implementing beneficial effect of the present invention is: by all mails of being reported in the computer scanning server, extract by the targeted mails of report number of times more than or equal to system default value, based on degree of confidence targeted mails is carried out confidence calculations, then judge that according to result of calculation the mail of being reported is spam or normal email, and collect in the corresponding database; This process is field feedback directly to be processed in degree of confidence by computer based, has alleviated artificial working strength and workload, has guaranteed the accuracy rate of classification, and need not manually mail to be read, and has protected user's privacy.

Description of drawings

Fig. 1 is that a kind of Email of the present invention is collected the first embodiment flowage structure schematic diagram of sorting technique;

Fig. 2 is that a kind of Email of the present invention is collected the second embodiment flowage structure schematic diagram of sorting technique;

Fig. 3 is that a kind of Email of the present invention is collected the 3rd embodiment flowage structure schematic diagram of sorting technique;

Fig. 4 is that a kind of Email of the present invention is collected the 4th embodiment flowage structure schematic diagram of sorting technique.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with accompanying drawing.

Fig. 1 is that a kind of Email of the present invention is collected the first embodiment flowage structure schematic diagram of sorting technique, comprising:

S100, all mails of being reported in the scanning server extract by the targeted mails of report number of times more than or equal to n.

N is default value, and it is that normal email and quilt report are the mail of spam that the described mail of being reported comprises by report.

Need to prove, be automatically all mails of being reported in the server to be scanned by computing machine, and computing machine at regular intervals will be to the server run-down; Tolerant value n can arrange as the case may be, and preferably, tolerant value n is 3.

S101 calculates the degree of confidence of described targeted mails, draws result of calculation.

S102 judges that according to described result of calculation described targeted mails is spam or normal email, and stores in the database.

Need to prove, result of determination is storing in the spam database of spam, and result of determination is storing in the normal email database of normal email.

Fig. 2 is that a kind of Email of the present invention is collected the second embodiment flowage structure schematic diagram of sorting technique, comprising:

S200, all mails of being reported in the scanning server extract by the targeted mails of report number of times more than or equal to n.

S201 with the degree of confidence addition of institute's handlebar targeted mails report for the informer of normal email, draws total normal email degree of confidence X.

S202 with the degree of confidence addition of institute's handlebar targeted mails report for the informer of spam, draws total spam confidence Y.

Need to prove, step S201 and S202 do not have sequencing, can carry out simultaneously.

S203, calculate the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| draws result of calculation.

S204 judges that according to described result of calculation described targeted mails is spam or normal email, and stores in the database.

For example, the M mail has been reported 4 times through scanning discovery, default greater than default value 3(), therefore be extracted as targeted mails, wherein informer A and B are normal email with the report of M mail, and informer C and D are spam with the report of M mail, the degree of confidence of informer A is 5, the degree of confidence of informer B is 10, and the degree of confidence of informer C is 3, and the degree of confidence of informer D is 8; Then total normal email degree of confidence X is 5+10=15, and total spam confidence Y is 3+8=11, the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| is | and 15-11|=4.

Fig. 3 is that a kind of Email of the present invention is collected the 3rd embodiment flowage structure schematic diagram of sorting technique, comprising:

S300, all mails of being reported in the scanning server extract by the targeted mails of report number of times more than or equal to n.

S301 with the degree of confidence addition of institute's handlebar targeted mails report for the informer of normal email, draws total normal email degree of confidence X.

S302 with the degree of confidence addition of institute's handlebar targeted mails report for the informer of spam, draws total spam confidence Y.

Need to prove, step S301 and S302 do not have sequencing, can carry out simultaneously.

S303, calculate the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| draws result of calculation.

S304, with the absolute value of described total normal email degree of confidence X with the difference of total spam confidence Y | X-Y| and threshold value T compare, and judge | and whether X-Y| is less than T.

Need to prove, threshold value T can preset as the case may be, and threshold value T will be higher than initial degree of confidence usually, and preferably threshold value T is 3.

Be judged as when being, temporarily this mail do not judged.

Need to prove, to the targeted mails of temporarily not judging, it is continued in the temporary server, stay and give the follow up scan judgement.

Be judged as when no, the size of X and Y relatively as X during greater than Y, judges that mail is normal email, as X during less than Y, judges that mail is spam.

For example, the m mail has been reported 4 times through scanning discovery, default greater than default value 3(), therefore be extracted as targeted mails, wherein informer a and b are normal email with the report of m mail, and informer c and d are spam with the report of m mail, the degree of confidence of informer a is 5, the degree of confidence of informer b is 10, and the degree of confidence of informer c is 5, and the degree of confidence of informer d is 8; Then total normal email degree of confidence X is 5+10=15, total spam confidence Y is 3+8=13, the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| is | 15-13|=2, and threshold value T is preset as 3, then | X-Y|＜T, therefore temporarily this m mail is not judged, this m mail is continued in the temporary server, stay and give the follow up scan judgement.

And for example, the M mail has been reported 4 times through scanning discovery, greater than default value 3, therefore be extracted as targeted mails, wherein informer A and B are normal email with the report of M mail, and informer C and D are spam with the report of M mail, if the degree of confidence of informer A is 5, the degree of confidence of informer B is 10, and the degree of confidence of informer C is 3, and the degree of confidence of informer D is 8; Then total normal email degree of confidence X is 5+10=15, total spam confidence Y is 3+8=11, the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| is | 15-11|=4, and threshold value T is preset as 3, and then | X-Y|〉T, therefore need to compare the size of X and Y, X=15 again, Y=11, X〉Y, judge that then the M mail is normal email, and the M mail is collected in the normal email database.

If the degree of confidence of informer A is 3, the degree of confidence of informer B is 8, and the degree of confidence of informer C is 5, and the degree of confidence of informer D is 10; Then total normal email degree of confidence X is 3+8=11, total spam confidence Y is 5+10=15, the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| is | 11-15|=4, and threshold value T is preset as 3, and then | X-Y|〉T, therefore need to compare the size of X and Y, X=11 again, Y=15, X＜Y, judge that then the M mail is spam, and the M mail is collected in the spam database.

Fig. 4 is that a kind of Email of the present invention is collected the 4th embodiment flowage structure schematic diagram of sorting technique, comprising:

S400, all mails of being reported in the scanning server extract by the targeted mails of report number of times more than or equal to n.

S401 is preset as 1 with the initial degree of confidence of reporting for the first time the informer of mail.

S402 with the degree of confidence addition of institute's handlebar targeted mails report for the informer of normal email, draws total normal email degree of confidence X.

Need to prove, step S401 and S402 do not have sequencing, can carry out simultaneously.

S403 with the degree of confidence addition of institute's handlebar targeted mails report for the informer of spam, draws total spam confidence Y.

S404, calculate the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| draws result of calculation.

S405, with the absolute value of described total normal email degree of confidence X with the difference of total spam confidence Y | X-Y| and threshold value T compare, and judge | and whether X-Y| is less than T.

Be judged as when being, temporarily this mail do not judged.

S406 upgrades informer's degree of confidence, increases the informer's consistent with the final decision result degree of confidence, reduction and the inconsistent informer's of final decision result degree of confidence.

Need to prove, the increase of degree of confidence and reduction amplitude can be preset as required, and preferably, the increasing degree of degree of confidence is+1; The reduction amplitude of degree of confidence is got the amplitude the greater among both for descending 10% or-1.

More preferably, described degree of confidence gather way slower than underspeeding.

Need to prove, gathering way of degree of confidence is slower than underspeeding, and can guarantee that the informer who has high confidence level has more confidence level, and it is professional stronger, thereby guarantees that final decision is more accurate.

More preferably, described degree of confidence is provided with maximal value and minimum value, and described degree of confidence rises to after the maximal value just no longer to be increased, and just no longer reduces after dropping to minimum value.

Need to prove, maximal value or minimum value can be preset as required, and preferably, maximal value is 50, and minimum value is 0.

And for example, the M mail has been reported 4 times through scanning discovery, greater than default value 3, therefore be extracted as targeted mails, wherein informer A and B are normal email with the report of M mail, informer C and D are spam with the report of M mail, if informer A is first report, giving the initial degree of confidence of informer A is 1, and the degree of confidence of informer B is 14, the degree of confidence of informer C is 3, and the degree of confidence of informer D is 8; Then total normal email degree of confidence X is 1+14=15, total spam confidence Y is 3+8=11, the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| is | 15-11|=4, and threshold value T is preset as 3, then | X-Y|〉T, therefore need compare the size of X and Y, again X=15, Y=11, X〉Y, judge that then the M mail is normal email, and the M mail collected in the normal email database, simultaneously, upgrade informer's degree of confidence, informer A is consistent with result of determination with B, so the degree of confidence of informer A and B+1, the degree of confidence of informer A becomes 2, and the degree of confidence of informer B becomes 15; Informer C and D and result of determination are inconsistent, therefore the degree of confidence of informer C and D descends 10% or-1, the original degree of confidence of informer C is 3, descend 10% less than-1 amplitude, then the degree of confidence of informer C is 2 after descending, the original degree of confidence of informer D is 8, descends 10% less than-1 amplitude, and then the degree of confidence of informer D is 7 after descending.

If the degree of confidence of informer A is 3, the degree of confidence of informer B is 15, and the degree of confidence of informer C is 5, and the degree of confidence of informer D is 20; Then total normal email degree of confidence X is 3+15=18, total spam confidence Y is 5+20=25, the absolute value of total normal email degree of confidence X and the difference of total spam confidence Y | X-Y| is | 18-25|=7, and threshold value T is preset as 3, then | X-Y|〉T, therefore need compare the size of X and Y, again X=18, Y=25, X＜Y judges that then the M mail is spam, and the M mail is collected in the spam database, simultaneously, upgrade informer's degree of confidence, informer C is consistent with result of determination with D, so the degree of confidence of informer C and D+1, the degree of confidence of informer C becomes 6, and the degree of confidence of informer D becomes 21; Informer A and B and result of determination are inconsistent, therefore the degree of confidence of informer A and B descends 10% or-1, the original degree of confidence of informer A is 3, descend 10% less than-1 amplitude, be 2 after then the degree of confidence of informer A descends, the original degree of confidence of informer B is 15, descends 10% greater than-1 amplitude, then the degree of confidence of informer B descends 1.5, becomes 13.5.

As from the foregoing, by all mails of being reported in the computer scanning server, extract by the targeted mails of report number of times more than or equal to system default value, based on degree of confidence targeted mails is carried out confidence calculations, then judge that according to result of calculation the mail of being reported is spam or normal email, and collect in the corresponding database; This process is field feedback directly to be processed in degree of confidence by computer based, has alleviated artificial working strength and workload, has guaranteed the accuracy rate of classification, and need not manually mail to be read, and has protected user's privacy.

The above is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also are considered as protection scope of the present invention.

Claims

1. an Email is collected sorting technique, it is characterized in that, comprising:

All mails of being reported in the scanning server, extraction is by the targeted mails of report number of times more than or equal to n, and n is default value, and it is that normal email and quilt report are the mail of spam that the described mail of being reported comprises by report;

Calculate the degree of confidence of described targeted mails, draw result of calculation;

Judge that according to described result of calculation described targeted mails is spam or normal email, and store in the database.

2. Email as claimed in claim 1 is collected sorting technique, it is characterized in that, the step of the degree of confidence of the described targeted mails of described calculating comprises:

With the degree of confidence addition of institute's handlebar targeted mails report for the informer of normal email, draw total normal email degree of confidence X;

With the degree of confidence addition of institute's handlebar targeted mails report for the informer of spam, draw total spam confidence Y;

Calculating the absolute value of the difference of total normal email degree of confidence X and total spam confidence Y | X-Y| draws result of calculation.

3. Email as claimed in claim 2 is collected sorting technique, it is characterized in that, describedly judges that according to described result of calculation described targeted mails comprises as the step of spam or normal email:

With the absolute value of described total normal email degree of confidence X with the difference of total spam confidence Y | X-Y| and threshold value T compare, judgement | and whether X-Y| less than T,

Be judged as when being, temporarily this mail do not judged,

4. Email as claimed in claim 2 is collected sorting technique, it is characterized in that, also comprises before the step of the degree of confidence of the described targeted mails of described calculating:

The initial degree of confidence of reporting for the first time the informer of mail is preset as 1.

5. Email as claimed in claim 1 is collected sorting technique, it is characterized in that, also comprises:

Upgrade informer's degree of confidence, increase the informer's consistent with the final decision result degree of confidence, reduce the degree of confidence with the inconsistent informer of final decision result.

6. Email as claimed in claim 5 is collected sorting technique, it is characterized in that, gathering way of described degree of confidence is slower than underspeeding.

7. Email as claimed in claim 5 is collected sorting technique, it is characterized in that, described degree of confidence is provided with maximal value and minimum value, and described degree of confidence just no longer increases after rising to maximal value, just no longer reduces after dropping to minimum value.