CN103902673B - Anti-spam filtering rule upgrade method and device - Google Patents

Anti-spam filtering rule upgrade method and device Download PDF

Info

Publication number
CN103902673B
CN103902673B CN201410102982.5A CN201410102982A CN103902673B CN 103902673 B CN103902673 B CN 103902673B CN 201410102982 A CN201410102982 A CN 201410102982A CN 103902673 B CN103902673 B CN 103902673B
Authority
CN
China
Prior art keywords
user
word
mail
report
report mail
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410102982.5A
Other languages
Chinese (zh)
Other versions
CN103902673A (en
Inventor
戴明洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN201410102982.5A priority Critical patent/CN103902673B/en
Publication of CN103902673A publication Critical patent/CN103902673A/en
Application granted granted Critical
Publication of CN103902673B publication Critical patent/CN103902673B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of anti-spam filtering rule upgrade method and device, methods described to include:The mail text of user's report mail to currently obtaining obtains the set of words of the user's report mail after carrying out denoising, word segmentation processing;The word to match in set of words with the word in IDF dictionaries is defined as to effective word of the user's report mail;For each effective word of the user's report mail, word frequency TF value of the effective word in the mail text of the user's report mail is counted, and according to the TF values counted and the IDF values of effective word in IDF dictionaries, calculate the weighted value of effective word;By it is each effectively word weighted value it is descending be ranked up, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail, according to obtain signature word upgrade anti-spam filtering rule in regular word.The present invention can save manpower, improve anti-spam filtering rule upgrading efficiency and the validity of anti-spam filtering rule.

Description

Anti-spam filtering rule upgrade method and device
Technical field
The present invention relates to internet arena, more particularly to a kind of anti-spam filtering rule upgrade method and device.
Background technology
In the current internet information epoch, the behavior that people are exchanged or communicated by Email is increasingly Generally.Email progressively transmission information on network using storage-pass-through mode, has that spread speed is fast, communicatee is wide General, the features such as cost is cheap.Some businessmans or tissue also take this opportunity to issue some and included in advertising content or malice falseness The Email of appearance, these Emails cause very big interference to people.
At present, user can use the report function that E-mail address provides, by clicking on the report button in E-mail address The Email received is reported, user's report data system is by the email record of user's report to user behavior In daily record.For convenience of description, the Email of user's report is referred to as user's report mail.Related operation maintenance personnel can pass through Shell(Shell)The User action log that order or script record to user's report data system carries out keyword extraction, with The user's report mail of correlation type is checked, and therefrom sums up the characteristic of user's report mail, to carry out anti-spam mistake Filter the setting or upgrading of rule.Wherein, the characteristic of user's report mail can namely reflect the feature of user's report mail Some words, these words can be set as the regular word in anti-spam filtering rule, will include this by related operation maintenance personnel The user's report mail of a little regular words is identified or intercepted as spam.
However, determine in the prior art in anti-spam filtering rule during regular word, related operation maintenance personnel directly from User's report mail is extracted in substantial amounts of User action log, artificially the characteristic of user's report mail is analyzed and from Middle to determine regular word, this mode takes time and effort, and amount of calculation is larger, and the upgrading of anti-spam filtering rule is less efficient.It is and related Operation maintenance personnel is difficult the user's report mail checked one by one in all User action logs so which easily causes data leakage Look into, and then the validity for the anti-spam filtering rule for draw based on which may be poor, causes in user's report mail The anti-spam filtering rule that still will not be set of a part filter out, so as to be unfavorable for the progress of anti-spam work.
Therefore, it is necessary to provide a kind of anti-spam filtering rule upgrade method, this method can both save manpower, and and can is enough Improve upgrading efficiency and the validity of anti-spam filtering rule.
The content of the invention
In view of the above-mentioned drawbacks of the prior art, the invention provides a kind of anti-spam filtering rule upgrade method and dress Put, to save manpower, improve anti-spam filtering rule upgrading efficiency and the validity of anti-spam filtering rule.
According to an aspect of the invention, there is provided a kind of anti-spam filtering rule upgrade method, including:
The mail text of user's report mail to currently obtaining obtains the user's report postal after carrying out denoising, word segmentation processing The set of words of part;
The word to match in the set of words with the word in reverse document-frequency IDF dictionaries is defined as the user Report effective word of mail;
For each effective word of the user's report mail, mail text of the effective word in the user's report mail is counted Word frequency TF values in this, and according to the TF values counted and the IDF values of effective word in the IDF dictionaries, it is effective to calculate this The weighted value of word;
By calculate it is each effectively word weighted value it is descending be ranked up, by the effective of the setting quantity for sorting forward Signature word of the word as the user's report mail;
Signature word according to obtaining upgrades the regular word in the anti-spam filtering rule.
Wherein, the IDF dictionaries are predetermined, and the determination method of the IDF dictionaries, including:
The user's report mail obtained in setting time section obtains user's report mail set;
For the often envelope user's report mail in the user's report mail set, to the mail text of the user's report mail This progress denoising, word segmentation processing, the set of words of the user's report mail is obtained, and to the set of words of the user's report mail In the part-of-speech information of each word be labeled;After removing the stop words in the set of words of the user's report mail, this is used The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of family report mail, is defined as The reservation word of the user's report mail;
Afterwards, for the often envelope user's report mail in the user's report mail set, the user's report postal is counted The frequency that each reservation word of part occurs in each user's report mail in the user's report mail set;
It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in the user's report postal After IDF values in part set, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
It is preferred that the user's report mail obtained in setting time section obtains user's report mail set, it is specially:
When reaching each update cycle, obtain the user's report mail in the setting time section and obtain user's act Report mail set.
It is preferred that effective word using the setting quantity for sorting forward as the user's report mail signature word it Afterwards, in addition to:
Obtained signature word is ranked up according to initial, forms the signature term vector of the user's report mail;
By the signature term vector of the user's report mail currently comprised, each user with acquisition before being recorded in caching The signature term vector of report mail is compared;If it is different, then by the signature term vector of the user's report mail currently comprised It is recorded in the caching;And
The signature word that the basis obtains upgrades the regular word in the anti-spam filtering rule, specifically includes:
When the signature term vector recorded in the caching reaches setting quantity or reached each upgrade cycle, according to Each signature term vector recorded in the caching, upgrades the regular word in the anti-spam filtering rule, afterwards by the caching Empty.
It is preferred that the mail text of the user's report mail to currently obtaining carries out denoising, word segmentation processing, specific bag Include:
Additional character in the mail text of the user's report mail currently obtained, punctuation mark, space are removed, and Word segmentation processing is carried out to the mail text of the user's report mail currently obtained with condition random field CRF algorithms.
According to another aspect of the present invention, a kind of anti-spam filtering rule update device is additionally provided, including:
Set of words determining module, the mail text for the user's report mail to currently obtaining carry out denoising, participle Processing obtains the set of words of the user's report mail;
Effective word determining module, for by the set of words with the word phase in reverse document-frequency IDF dictionaries The word matched somebody with somebody is defined as effective word of the user's report mail;
Weight value calculation module, for each effective word for the user's report mail, effective word is counted at this Word frequency TF values in the mail text of user's report mail, and according to the TF values counted and effective word in the IDF dictionaries IDF values, calculate the weighted value of effective word;
Signature word determining module, the weighted value of each effectively word for the weight computation module to be calculated are descending It is ranked up, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail;
Regular upgraded module, the signature word for being obtained according to the signature word determining module upgrade the anti-spam and filtered Regular word in rule.
Further, the anti-spam filtering rule update device, in addition to:
IDF dictionary determining modules, for obtaining the user's report mail in setting time section, obtain user's report mail collection Close;For the often envelope user's report mail in the user's report mail set, the mail text of the user's report mail is entered Row denoising, word segmentation processing obtain the set of words of the user's report mail, and in the set of words of the user's report mail The part-of-speech information of each word is labeled;After removing the stop words in the set of words of the user's report mail, the user is lifted The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of mail is reported, is defined as the use Report the reservation word of mail in family;Afterwards, for the often envelope user's report mail in the user's report mail set, this is counted The frequency that each reservation word of user's report mail occurs in each user's report mail in the user's report mail set; It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in the user's report mail set IDF values after, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
It is preferred that the IDF dictionaries determining module specifically includes:
User's report mail acquiring unit, for when reaching each update cycle, obtaining in the setting time section User's report mail, obtain the user's report mail set;
Retain word determining unit, for for the often envelope user's report mail in the user's report mail set, to this The mail text of user's report mail carries out denoising, word segmentation processing obtains the set of words of the user's report mail, and to the use The part-of-speech information of each word in the set of words of family report mail is labeled;Remove the set of words of the user's report mail In stop words after, the part-of-speech information that will record in part-of-speech information in the set of words of the user's report mail and part-of-speech information table The word to match, it is defined as the reservation word of the user's report mail;
Frequency statistics unit, for for the often envelope user's report mail in the user's report mail set, counting The frequency that each reservation word of the user's report mail occurs in each user's report mail in the user's report mail set Number;
IDF dictionary determining units, the frequency for being counted for the Frequency statistics unit are less than the every of given threshold Individual reservation word, after calculating IDF values of the reservation word in the user's report mail set, by the reservation word and its IDF values pair It should recorded in the IDF dictionaries.
It is preferred that it is described signature word determining module be additionally operable to effective word using the setting quantity for sorting forward as After the signature word of the user's report mail, obtained signature word is ranked up according to initial, forms the user's report mail Signature term vector;By the signature term vector of the user's report mail currently comprised, with obtaining before being recorded in caching The signature term vector of each user's report mail is compared;If it is different, then by the signature of the user's report mail currently comprised Term vector is recorded in the caching;And
The signature term vector that the regular upgraded module is specifically used for recording in the caching is determined reaches setting number Measure or when reaching each upgrade cycle, according to each signature term vector recorded in the caching, upgrade the anti-spam mistake Regular word in filter rule, the caching is emptied afterwards.
It is preferred that the set of words determining module is specifically used for the postal for removing the user's report mail currently obtained Additional character, punctuation mark in part text, space, and with condition random field CRF algorithms to the user currently obtained Report that the mail text of mail carries out word segmentation processing, obtain the set of words of the user's report mail.
In technical scheme, denoising is carried out to the mail text of user's report mail currently obtained, at participle Reason obtains the set of words of user's report mail;After effective word that user's report mail is determined according to IDF dictionaries, count every Individual effectively TF value of the word in the mail text of user's report mail, and according to the IDF values recorded in IDF dictionaries, obtaining each The weighted value of effective word;According to the weighted value of each effectively word, from each signature word that user's report mail is effectively determined in word, with For the selection of regular word in anti-spam filtering rule.Because the present invention can automatically determine out the signature word of user's report mail, and According to the regular word in obtained signature word upgrading anti-spam filtering rule so that related operation maintenance personnel need not be from substantial amounts of user The inquiry of user's report mail is carried out in user behaviors log, need not also audit the full text for the user's report mail that inquiry obtains one by one, So as to reduce artificial amount of calculation, people's power is saved, improves anti-spam filtering rule upgrading efficiency;Moreover, the automatic upgrading Mode can carry out word extraction of signing to each user's report mail in User action log, not easily cause data under-enumeration, energy The enough validity for improving anti-spam filtering rule as far as possible, is more beneficial for the progress of anti-spam work.
Brief description of the drawings
Fig. 1 a are the method flow diagram of the determination IDF dictionaries of the embodiment of the present invention;
Fig. 1 b are the method flow diagram of the reservation word for obtaining user's report mail of the embodiment of the present invention;
Fig. 2 is the flow chart of the anti-spam filtering rule upgrade method of the embodiment of the present invention;
Fig. 3 is the internal structure block diagram of the anti-spam filtering rule update device of the embodiment of the present invention;
Fig. 4 is the internal structure block diagram of the IDF dictionary determining modules of the embodiment of the present invention.
Embodiment
Clear, complete description is carried out to technical scheme below with reference to accompanying drawing, it is clear that described implementation Example is only the part of the embodiment of the present invention, rather than whole embodiments.It is general based on the embodiment in the present invention, this area Logical technical staff all other embodiment resulting on the premise of creative work is not made, belongs to the present invention and is protected The scope of shield.
The term such as " module " used in this application, " system " is intended to include the entity related to computer, such as but unlimited In hardware, firmware, combination thereof, software or executory software.For example, module can be, it is not limited to:Processing The process run on device, processor, object, executable program, thread, program and/or the computer performed.For example, count It can be module to calculate the application program run in equipment and this computing device.One or more modules can be located at executory In one process and/or thread, a module can also be located on a computer and/or be distributed in two or more platforms and calculate Between machine.
The present invention main thought be:For the user's report mail obtained from User action log, based on TF-IDF (Term frequency-inverse document frequency, word frequency-reverse document-frequency)Model, automatically from user Report the word that several substantive contents that can reflect the user's report mail are obtained in the mail text of mail(That is characteristic According to), the signature word as the user's report mail;Afterwards, anti-spam is carried out according to the signature word of obtained user's report mail The upgrading of filtering rule.This user's report mail to being obtained in User action log is analyzed, and carries out anti-rubbish automatically The mode of rubbish filtering rule upgrading so that related operation maintenance personnel need not carry out user's report postal from substantial amounts of User action log The inquiry of part, the full text for the user's report mail that inquiry obtains need not be also audited one by one, so as to reduce artificial amount of calculation, save people Power, improve anti-spam filtering rule upgrading efficiency;Meanwhile this upgrade mode automatically can be to each user in User action log Report mail carries out word extraction of signing, and does not easily cause data under-enumeration, can improve the validity of anti-spam filtering rule as far as possible, It is more beneficial for the progress of anti-spam work.
Based on above-mentioned thinking, in technical scheme, for the user's report obtained from User action log Mail, the set of words of user's report mail is obtained after the mail text progress denoising, word segmentation processing to user's report mail;Will Word in set of words is matched with the word in the IDF dictionaries being previously obtained, and determines the effective of user's report mail After word, each effectively TF value of the word in the mail text of user's report mail, and having according to what is recorded in IDF dictionaries is counted The IDF values of word are imitated, obtain the weighted value of each effectively word;User is determined from each effectively word according to the weighted value of each effectively word The signature word of mail is reported, to carry out anti-spam filtering rule upgrading.So as to save manpower, while improve anti-spam mistake Filter rule upgrading efficiency, the validity of anti-spam filtering rule is also improved as far as possible.
The technical scheme that the invention will now be described in detail with reference to the accompanying drawings.In the specific embodiment of the invention, it is determined that user Before the signature word for reporting mail, it may be predetermined that an IDF dictionary.Specifically, setting is obtained from User action log User's report mail in period obtains a user's report mail set, and each user in user's report mail set is lifted Report mail is respectively processed, and obtains the reservation word of each user's report mail, and therefrom select a part according to setting rule Retain word, the reservation word selected and its IDF values recorded in IDF dictionaries, determine the flow of the specific method of IDF dictionaries, As shown in Figure 1a, comprise the following steps:
S101:The user's report mail obtained in setting time section obtains user's report mail set.
Specifically, the user's report mail in setting time section can be obtained from User action log, by these users Report mail is put into a user's report mail set as set element.Wherein, setting time section is by people in the art Member is set.
Further, the user's report mail that can also periodically obtain in setting time section obtains user's report mail collection Close, periodically to update IDF dictionaries.It is, when reaching each update cycle, setting is obtained from User action log User's report mail in period obtains the user's report mail set of current update cycle.Wherein, the update cycle specifically by Those skilled in the art are set, can be identical with setting time section, can also be different.For example, the update cycle is specially two Week, setting time section are specially one week;So, if in the Monday in current week, Monday last week to Sunday is obtained(I.e. one week)Interior After user's report mail, IDF dictionaries are updated, then can obtain next Monday to next Sunday in the Monday of week after next(I.e. one Week)Interior user's report mail, is updated to IDF dictionaries again, and IDF dictionaries are periodically updated so as to realize.
S102:For the often envelope user's report mail in user's report mail set, the guarantor of the user's report mail is determined Stay word.
Specifically, for the often envelope user's report mail in user's report mail set, the postal to the user's report mail Part text carries out denoising, word segmentation processing, obtains the set of words of the user's report mail, and to the word of the user's report mail The part-of-speech information of each word in set is labeled;, will after removing the stop words in the set of words of the user's report mail The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of the user's report mail, make For the reservation word of the user's report mail.
Wherein, for any envelope user's report mail in user's report mail set, the user's report mail is determined Retain the method for word, described in detail in following combination Fig. 1 b.
S103:For the often envelope user's report mail in user's report mail set, the user's report mail is counted The frequency occurred in each each user's report mail for retaining word in user's report mail set.
Specifically, there are some envelope user's report mails in user's report mail set, for an envelope user's report mail In a reservation word, the reservation word may occur more than once in the user's report mail, it is also possible in other users Occur in report mail.So, time that will occur in each user's report mail of the reservation word in user's report mail set Number is added up, and can obtain the frequency occurred in each user's report mail of the reservation word in user's report mail set.
S104:It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in user's report After IDF values in mail set, by the reservation word and its IDF values corresponding record into IDF dictionaries.
Specifically, in anti-spam field, if one retains each user's report mail of the word in user's report mail set The frequency of middle appearance is higher, then it has been generally acknowledged that the reservation word does not possess representativeness, and the relatively low reservation word of some frequencies have it is very big The alternative word for the regular word being likely to become in anti-spam filtering rule, these, which retain word, contributes to operation maintenance personnel to filter anti-spam The excavation and discovery of rule.Therefore, in this step, the reservation word for the frequency counted being less than to given threshold recorded IDF In dictionary;Meanwhile it is less than each reservation word of given threshold for frequency, the reservation word is also calculated in user's report mail collection IDF values in conjunction, by IDF value of the reservation word in user's report mail set and the reservation word corresponding record to IDF dictionaries In.
Wherein, calculating IDF value of the reservation word in user's report mail set is specially:By user's report mail collection After total user's report mail number in conjunction divided by the number of the user's report mail comprising the reservation word, obtained business is taken pair Number, just obtain the IDF values of the reservation word.
In actual applications, for any envelope user's report mail A in user's report mail set, user's report is determined The flow of the method for mail A reservation word as shown in Figure 1 b, specifically comprises the following steps:
S111:Denoising is carried out to user's report mail A mail text.
Specifically, the mail text of user's report mail is often different from general text or short text, and its sender is led to Can often use various ways to add various noises at it, for example, additional character " ☆ ", " ◆ ", "" etc..Therefore, in this step In, additional character, space and punctuation mark in removal user's report mail A mail text etc., to obtain only including word User's report mail A mail text, i.e., the mail text of the user's report mail A after denoising.Wherein, user's report Mail A mail text can include user's report mail A mail matter topics text and Mail Contents text.
S112:Word segmentation processing is carried out to the mail text of the user's report mail A after denoising, obtains user's report postal Part A set of words, and the part-of-speech information of each word in set of words is labeled.
Specifically, using Word Intelligent Segmentation algorithm CRF(Conditional Random Field, condition random field)Algorithm pair The mail text of user's report mail A after denoising is segmented, that is, by the user's report mail A after denoising Mail text in continuous word sequence be divided into word one by one.
Moreover, being based on CRF algorithms, the part-of-speech information of each word in user's report mail A set of words can be entered Rower is noted.For example, the part-of-speech information of " discount " is labeled as noun.
S113:After removing the stop words in user's report mail A set of words, by the word collection of the user's report mail The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in conjunction, the reservation as the user's report mail Word.
Specifically, for each word in user's report mail A set of words, according to the deactivation recorded in deactivation vocabulary Word, remove the stop words in set of words, that is, delete as " ", " " etc. in user's report mail A mail text There is no the word of substantive significance.
Moreover, person skilled can be rule of thumb, it will help the part-of-speech information of anti-spam filtering rule upgrading, such as name Word, verb, adverbial word idiom, distinction word function idiom etc., are recorded in part-of-speech information table.In this step, disabled for removing Each word in after word, user's report mail A set of words, however, it is determined that record has the part of speech of the word in part-of-speech information table Information, it is determined that go out the reservation word that the word is user's report mail A.For example, remove after stop words, user's report mail A Set of words in have a word " discount ", its part-of-speech information is noun, if in part-of-speech information table record have noun, it is determined that Go out the reservation word that " discount " is user's report mail A.
Based on the above-mentioned IDF dictionaries predefined out, regular word in anti-spam filtering rule provided in an embodiment of the present invention Determination method flow, as shown in Fig. 2 specifically comprising the following steps:
S201:The mail text of user's report mail to currently obtaining carries out denoising, word segmentation processing, obtains user act Report the set of words of mail.
Specifically, the additional character in the mail text for the user's report mail that removal currently obtains, punctuation mark, space Deng, and carry out word segmentation processing with the mail text of user's report mail of the condition random field CRF algorithms to currently obtaining.
S202:The word that will be matched in the set of words of the user's report mail currently obtained with the word in IDF dictionaries Language is defined as effective word of the user's report mail.
Specifically, to each word in the set of words of the user's report mail currently obtained, searching in IDF dictionaries is No record has the word;If so, then effective word using the word as the user's report mail;Otherwise, it regard the word as this The invalid word of user's report mail.
S203:For each effective word of the user's report mail currently obtained, the weighted value of effective word is calculated.
Specifically, for each effective word of the user's report mail currently obtained, effective word is counted in the user Report the TF values in the mail text of mail, and the TF values according to the effective word counted and effective word in IDF dictionaries IDF values, calculate the weighted value of effective word;Wherein, the weighted value for calculating effective word is typically the effective word that will be counted TF values be multiplied with the IDF values of the effective word of this in IDF dictionaries, weighted value of the obtained result of being multiplied as effective word.
Wherein, for an effective word of user's report mail, mail of the effective word in the user's report mail is calculated TF values in text, it is, calculating number and the use that effective word occurs in the mail text of the user's report mail The ratio of effective word sum in the mail text of family report mail.
S204:By the descending setting quantity for being ranked up, sorting forward of weighted value of each effectively word calculated Signature word of effective word as the user's report mail currently obtained.
For an effective word, TF value of the effective word in the mail text of the user's report mail currently obtained is got over Greatly, illustrate that the frequency that effective word occurs in the user's report mail is higher, and effective word is in user's report mail set In IDF values it is bigger, illustrate that the frequency that effective word occurs in user's report mail set is lower.Therefore, if one effective The weighted value of word(That is the product of TF values and IDF values)It is larger, illustrate the use that effective word can currently be obtained with reflected well The feature of mail is reported at family, correspondingly also can preferably reflect the substantive content of the user's report mail currently obtained.
Wherein, setting quantity is set by those skilled in the art, is specifically as follows 5 or 10.
S205:According to the regular word in obtained signature word upgrading anti-spam filtering rule.
Specifically, can by the signature word of the user's report mail currently obtained, different from former anti-spam filtering rule In regular word word, upgrade anti-spam filtering rule as newly-increased regular word.
In fact, the signature word of user's report mail can also be directly viewable by related operation maintenance personnel, user's report is learnt The substantive content of mail, the rule in anti-spam filtering rule are selected from each signature word of user's report mail manually Then word, carry out anti-spam filtering rule upgrading.In this fashion, even related operation maintenance personnel is manually chosen from signature word Regular word, the amount of calculation of related operation maintenance personnel is also greatly reduced, avoid related operation maintenance personnel to user's report mail full text Check, so as to greatly save manpower.Moreover, the present invention automatically derives the mode of the signature word of user's report mail, can make Obtain different operation maintenance personnels and intuitively check oneself required data, be easy to data sharing.
More preferably, the signature of the invention that after the signature word of the user's report mail currently obtained, will can also obtain Word is ranked up according to initial, forms the signature term vector of the user's report mail;And user's report that will be currently comprised The signature term vector of mail, compared with the signature term vector of each user's report mail obtained before being recorded in caching; If it is different, then the signature term vector of the user's report mail currently comprised is recorded in caching;If with recording in caching At least one identical in each signature term vector, then the signature term vector of the user's report mail to currently comprising is not remembered Record.In fact, before current time, after the signature term vector for obtaining user's report mail every time, the signature word that will obtain Vector is recorded in caching.
Correspondingly, the signature term vector that the present invention can also record in caching is determined reaches setting quantity or every When secondary upgrade cycle reaches, according to each signature term vector recorded in caching, upgrade the regular word in anti-spam filtering rule, That is the vector element in each signature term vector recorded in caching determines the regular word of upgrading;Afterwards will caching Empty.Wherein, setting quantity and upgrade cycle can specifically be set by those skilled in the art.
Further, also signature term vector identical user's report mail can be classified as one kind, and of a sort user is lifted Report mail signs that term vector is corresponding stores with corresponding.So, related operation maintenance personnel upgrades anti-spam rule manually During regular word in then, the signature word of each user's report mail need not be also checked one by one, for term vector identical one of signing Class user's report mail, related operation maintenance personnel need to only be checked for the signature term vector, upgrade the consideration of regular word; So as to further improve upgrading efficiency.
Based on the determination method of regular word in above-mentioned anti-spam filtering rule, anti-spam mistake provided in an embodiment of the present invention The internal structure block diagram of the determining device of regular word in filter rule, as shown in figure 3, specifically including:Set of words determining module 301st, effective word determining module 302, weight value calculation module 303, signature word determining module 304 and regular upgraded module 305.
The mail text that set of words determining module 301 is used for the user's report mail to currently obtaining carries out denoising, divided Word processing, obtains the set of words of the user's report mail.Specifically, set of words determining module 301 removes what is currently obtained Additional character, punctuation mark in the mail text of user's report mail, space, and with CRF algorithms to the use that currently obtains The mail text of family report mail carries out word segmentation processing.
Effective word determining module 302 be used for by the set of words of the user's report mail currently obtained with IDF dictionaries The word that matches of word be defined as effective word of the user's report mail.
Weight value calculation module 303 is used for for effective word determining module 302 is determining, user's report that currently obtain Each effective word of mail, counts word frequency TF value of the effective word in the mail text of the user's report mail, and according to The TF values counted and the IDF values of effective word in IDF dictionaries, calculate the weighted value of effective word.
Signature word determining module 304 be used for the weighted value of each effectively word for calculating weight computation module 303 by greatly to It is small to be ranked up, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail.
Regular upgraded module 305 is used for the signature word upgrading anti-spam filtering rule obtained according to signature word determining module 304 Regular word in then.
Further, word determining module 304 of signing is additionally operable to effective word in the setting quantity that will sort forward as the use After the signature word of family report mail, obtained signature word is ranked up according to initial, forms the label of the user's report mail Name term vector;By the signature term vector of the user's report mail currently comprised, each use with acquisition before being recorded in caching The signature term vector of family report mail is compared;If it is different, then by the signature word of the user's report mail currently comprised to Amount is recorded in caching;If identical, the signature term vector of the user's report mail to currently comprising does not record.
Correspondingly, the signature term vector that regular upgraded module 305 is specifically used for recording in caching is determined reaches setting Quantity or when reaching each upgrade cycle, according to each signature term vector recorded in caching, upgrade anti-spam filtering rule In regular word, caching is emptied afterwards.
Further, the determining device of regular word may also include in above-mentioned anti-spam filtering rule:IDF dictionary determining modules 306。
The user's report mail that IDF dictionaries determining module 306 is used to obtain in setting time section obtains user's report mail Set;For the often envelope user's report mail in the user's report mail set, to the mail text of the user's report mail Denoising, word segmentation processing are carried out, obtains the set of words of the user's report mail, and in the set of words of the user's report mail The part-of-speech information of each word be labeled;After removing the stop words in the set of words of the user's report mail, by the user The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of mail is reported, as the use Report the reservation word of mail in family;Afterwards, for the often envelope user's report mail in user's report mail set, the user is counted Report the frequency occurred in each user's report mail of each reservation word of mail in user's report mail set;For statistics The frequency gone out is less than each reservation word of given threshold, will after calculating IDF values of the reservation word in user's report mail set The reservation word and its IDF values corresponding record are into IDF dictionaries.
Specifically, the internal structure block diagram of IDF dictionaries determining module 306 is as shown in figure 4, specifically include:User's report postal Part acquiring unit 401, retain word determining unit 402, Frequency statistics unit 403 and IDF dictionaries determining unit 404.
User's report mail acquiring unit 401 is used for when reaching each update cycle, obtains in the setting time section User's report mail, obtain user's report mail set.
Retain word determining unit 402 to be used for for the often envelope user's report mail in user's report mail set, to the use The mail text of family report mail carries out denoising, word segmentation processing, obtains the set of words of the user's report mail, and to the user Report that the part-of-speech information of each word in the set of words of mail is labeled;In the set of words for removing the user's report mail Stop words after, by the part of speech recorded in the set of words of the user's report mail, part-of-speech information and part-of-speech information table believe The word of manner of breathing matching, the reservation word as the user's report mail.
Frequency statistics unit 403 is used for for the often envelope user's report mail in user's report mail set, counts this The frequency that each reservation word of user's report mail occurs in each user's report mail in user's report mail set.
The frequency that IDF dictionaries determining unit 404 is used to count for Frequency statistics unit 403 is less than the every of given threshold Individual reservation word, after calculating IDF values of the reservation word in user's report mail set, by the reservation word and its corresponding note of IDF values Record in IDF dictionaries.
In technical scheme, denoising is carried out to the mail text of user's report mail currently obtained, at participle Reason obtains the set of words of user's report mail;After effective word that user's report mail is determined according to IDF dictionaries, count every Individual effectively TF value of the word in the mail text of user's report mail, and according to the IDF values recorded in IDF dictionaries, obtaining each The weighted value of effective word;According to the weighted value of each effectively word, from each signature word that user's report mail is effectively determined in word, with For the selection of regular word in anti-spam filtering rule.Because the present invention can automatically determine out the signature word of user's report mail, and According to the regular word in obtained signature word upgrading anti-spam filtering rule so that related operation maintenance personnel need not be from substantial amounts of user The inquiry of user's report mail is carried out in user behaviors log, need not also audit the full text for the user's report mail that inquiry obtains one by one, So as to reduce artificial amount of calculation, people's power is saved, improves anti-spam filtering rule upgrading efficiency;Moreover, the automatic upgrading Mode can carry out word extraction of signing to each user's report mail in User action log, not easily cause data under-enumeration, energy The enough validity for improving anti-spam filtering rule as far as possible, is more beneficial for the progress of anti-spam work.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (8)

  1. A kind of 1. anti-spam filtering rule upgrade method, it is characterised in that including:
    The mail text of user's report mail to currently obtaining obtains the user's report mail after carrying out denoising, word segmentation processing Set of words;
    The word to match in the set of words with the word in reverse document-frequency IDF dictionaries is defined as the user's report Effective word of mail;
    For each effective word of the user's report mail, effective word is counted in the mail text of the user's report mail Word frequency TF values, and according to the TF values that count and the IDF values of effective word in the IDF dictionaries, calculate effective word Weighted value;
    It is ranked up the weighted value of each effectively word calculated is descending, by effective word of the setting quantity for sorting forward work For the signature word of the user's report mail;
    Signature word according to obtaining upgrades the regular word in the anti-spam filtering rule;
    Wherein, the IDF dictionaries are predetermined that the determination method of the IDF dictionaries includes:
    The user's report mail obtained in setting time section obtains user's report mail set;
    For the often envelope user's report mail in the user's report mail set, the mail text of the user's report mail is entered Row denoising, word segmentation processing, the set of words of the user's report mail is obtained, and in the set of words of the user's report mail The part-of-speech information of each word is labeled;After removing the stop words in the set of words of the user's report mail, the user is lifted The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of mail is reported, is defined as the use Report the reservation word of mail in family;
    Afterwards, for the often envelope user's report mail in the user's report mail set, the user's report mail is counted The frequency occurred in each each user's report mail for retaining word in the user's report mail set;
    It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in the user's report mail collection After IDF values in conjunction, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
  2. 2. the method as described in claim 1, it is characterised in that the user's report mail obtained in setting time section obtains User's report mail set, it is specially:
    When reaching each update cycle, obtain the user's report mail in the setting time section and obtain the user's report postal Part set.
  3. 3. the method as described in claim 1, it is characterised in that effective word using the setting quantity for sorting forward is used as this After the signature word of user's report mail, in addition to:
    Obtained signature word is ranked up according to initial, forms the signature term vector of the user's report mail;
    By the signature term vector of the user's report mail currently comprised, each user's report with acquisition before being recorded in caching The signature term vector of mail is compared;If it is different, then the signature term vector of the user's report mail currently comprised is recorded In the caching;And
    The signature word that the basis obtains upgrades the regular word in the anti-spam filtering rule, specifically includes:
    When the signature term vector recorded in the caching reaches setting quantity or reached each upgrade cycle, according to described Each signature term vector recorded in caching, upgrades the regular word in the anti-spam filtering rule, afterwards empties the caching.
  4. 4. the method as described in claim 1-3 is any, it is characterised in that described to the postal of the user's report mail currently obtained Part text carries out denoising, word segmentation processing, specifically includes:
    Additional character in the mail text of the user's report mail currently obtained, punctuation mark, space are removed, and is used Condition random field CRF algorithms carry out word segmentation processing to the mail text of the user's report mail currently obtained.
  5. A kind of 5. anti-spam filtering rule update device, it is characterised in that including:
    Set of words determining module, the mail text for the user's report mail to currently obtaining carry out denoising, word segmentation processing Obtain the set of words of the user's report mail;
    Effective word determining module, for will match in the set of words with the word in reverse document-frequency IDF dictionaries Word is defined as effective word of the user's report mail;
    Weight value calculation module, for each effective word for the user's report mail, effective word is counted in the user The word frequency TF values in the mail text of mail are reported, and according to the TF values counted and the IDF of effective word in the IDF dictionaries Value, calculate the weighted value of effective word;
    Signature word determining module, for the descending progress of weighted value for each effectively word for calculating the weight computation module Sequence, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail;
    Regular upgraded module, the signature word for being obtained according to the signature word determining module upgrade the anti-spam filtering rule In regular word;
    IDF dictionary determining modules, for obtaining the user's report mail in setting time section, obtain user's report mail set; For the often envelope user's report mail in the user's report mail set, the mail text of the user's report mail is gone Make an uproar, word segmentation processing obtains the set of words of the user's report mail, and to each word in the set of words of the user's report mail The part-of-speech information of language is labeled;After removing the stop words in the set of words of the user's report mail, by the user's report postal The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of part, it is defined as user act Report the reservation word of mail;Afterwards, for the often envelope user's report mail in the user's report mail set, the user is counted Report the frequency occurred in each user's report mail of each reservation word of mail in the user's report mail set;For The frequency counted is less than each reservation word of given threshold, calculates the reservation word in the user's report mail set After IDF values, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
  6. 6. device as claimed in claim 5, it is characterised in that the IDF dictionaries determining module specifically includes:
    User's report mail acquiring unit, for when reaching each update cycle, obtaining the user in the setting time section Mail is reported, obtains the user's report mail set;
    Retain word determining unit, for for the often envelope user's report mail in the user's report mail set, to the user The mail text of report mail carries out denoising, word segmentation processing obtains the set of words of the user's report mail, and the user is lifted The part-of-speech information of each word in the set of words of mail is reported to be labeled;In the set of words for removing the user's report mail After stop words, by part-of-speech information in the set of words of the user's report mail and the part-of-speech information phase that is recorded in part-of-speech information table The word matched somebody with somebody, it is defined as the reservation word of the user's report mail;
    Frequency statistics unit, for for the often envelope user's report mail in the user's report mail set, counting the use The frequency that each reservation word of family report mail occurs in each user's report mail in the user's report mail set;
    IDF dictionary determining units, the frequency for being counted for the Frequency statistics unit are less than each guarantor of given threshold Word is stayed, after calculating IDF values of the reservation word in the user's report mail set, by the reservation word and its corresponding note of IDF values Record in the IDF dictionaries.
  7. 7. device as claimed in claim 5, it is characterised in that
    The signature word determining module is additionally operable in effective word using the setting quantity for sorting forward as the user's report After the signature word of mail, obtained signature word is ranked up according to initial, form the signature word of the user's report mail to Amount;By the signature term vector of the user's report mail currently comprised, each user's report with acquisition before being recorded in caching The signature term vector of mail is compared;If it is different, then the signature term vector of the user's report mail currently comprised is recorded In the caching;And
    The regular upgraded module be specifically used for the signature term vector that is recorded in the caching is determined reach setting quantity or Person, according to each signature term vector recorded in the caching, upgrades the anti-spam filtering rule when reaching each upgrade cycle Regular word in then, the caching is emptied afterwards.
  8. 8. the device as described in claim 5-7 is any, it is characterised in that
    The set of words determining module is specifically used in the mail text of the removal user's report mail currently obtained Additional character, punctuation mark, space, and with condition random field CRF algorithms to the user's report mail currently obtained Mail text carries out word segmentation processing, obtains the set of words of the user's report mail.
CN201410102982.5A 2014-03-19 2014-03-19 Anti-spam filtering rule upgrade method and device Active CN103902673B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410102982.5A CN103902673B (en) 2014-03-19 2014-03-19 Anti-spam filtering rule upgrade method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410102982.5A CN103902673B (en) 2014-03-19 2014-03-19 Anti-spam filtering rule upgrade method and device

Publications (2)

Publication Number Publication Date
CN103902673A CN103902673A (en) 2014-07-02
CN103902673B true CN103902673B (en) 2017-11-24

Family

ID=50993995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410102982.5A Active CN103902673B (en) 2014-03-19 2014-03-19 Anti-spam filtering rule upgrade method and device

Country Status (1)

Country Link
CN (1) CN103902673B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460551B (en) * 2018-10-29 2023-04-18 北京知道创宇信息技术股份有限公司 Signature information extraction method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103473218A (en) * 2013-09-04 2013-12-25 盈世信息科技(北京)有限公司 Email classification method and email classification device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130006996A1 (en) * 2011-06-22 2013-01-03 Google Inc. Clustering E-Mails Using Collaborative Information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103473218A (en) * 2013-09-04 2013-12-25 盈世信息科技(北京)有限公司 Email classification method and email classification device

Also Published As

Publication number Publication date
CN103902673A (en) 2014-07-02

Similar Documents

Publication Publication Date Title
US9105008B2 (en) Detecting controversial events
CN108647309B (en) Chat content auditing method and system based on sensitive words
CN105243152B (en) A kind of automaticabstracting based on graph model
CN105488092B (en) A kind of time-sensitive and adaptive sub-topic online test method and system
CN103914494B (en) Method and system for identifying identity of microblog user
US20090125371A1 (en) Domain-Specific Sentiment Classification
CN109948121A (en) Article similarity method for digging, system, equipment and storage medium
CN110263248A (en) A kind of information-pushing method, device, storage medium and server
CN107480123A (en) A kind of recognition methods, device and the computer equipment of rubbish barrage
US20140013221A1 (en) Method and device for filtering harmful information
CN104102681A (en) Microblog key event acquiring method and device
CN105095179B (en) The method and device that user's evaluation is handled
CN101197793B (en) Garbage information detection method and device
CN105893615B (en) Owner's characteristic attribute method for digging and its system based on Mobile Phone Forensics data
CN106407280A (en) Query target matching method and device
CN103279478A (en) Method for extracting features based on distributed mutual information documents
CN104281565B (en) Semantic dictionary construction method and device
CN102207961A (en) Automatic web page classification method and device
Ahmed et al. Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness
CN109978020A (en) A kind of social networks account vest identity identification method based on multidimensional characteristic
CN107544961A (en) A kind of sentiment analysis method, equipment and its storage device of social media comment
CN105512300B (en) information filtering method and system
CN109299463B (en) Emotion score calculation method and related equipment
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN107066441A (en) A kind of method and device for calculating part of speech correlation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230417

Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193

Patentee after: Sina Technology (China) Co.,Ltd.

Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor

Patentee before: Sina.com Technology (China) Co.,Ltd.