CN103902673B - Anti-spam filtering rule upgrade method and device - Google Patents
Anti-spam filtering rule upgrade method and device Download PDFInfo
- Publication number
- CN103902673B CN103902673B CN201410102982.5A CN201410102982A CN103902673B CN 103902673 B CN103902673 B CN 103902673B CN 201410102982 A CN201410102982 A CN 201410102982A CN 103902673 B CN103902673 B CN 103902673B
- Authority
- CN
- China
- Prior art keywords
- user
- word
- report
- report mail
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of anti-spam filtering rule upgrade method and device, methods described to include:The mail text of user's report mail to currently obtaining obtains the set of words of the user's report mail after carrying out denoising, word segmentation processing;The word to match in set of words with the word in IDF dictionaries is defined as to effective word of the user's report mail;For each effective word of the user's report mail, word frequency TF value of the effective word in the mail text of the user's report mail is counted, and according to the TF values counted and the IDF values of effective word in IDF dictionaries, calculate the weighted value of effective word;By it is each effectively word weighted value it is descending be ranked up, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail, according to obtain signature word upgrade anti-spam filtering rule in regular word.The present invention can save manpower, improve anti-spam filtering rule upgrading efficiency and the validity of anti-spam filtering rule.
Description
Technical field
The present invention relates to internet arena, more particularly to a kind of anti-spam filtering rule upgrade method and device.
Background technology
In the current internet information epoch, the behavior that people are exchanged or communicated by Email is increasingly
Generally.Email progressively transmission information on network using storage-pass-through mode, has that spread speed is fast, communicatee is wide
General, the features such as cost is cheap.Some businessmans or tissue also take this opportunity to issue some and included in advertising content or malice falseness
The Email of appearance, these Emails cause very big interference to people.
At present, user can use the report function that E-mail address provides, by clicking on the report button in E-mail address
The Email received is reported, user's report data system is by the email record of user's report to user behavior
In daily record.For convenience of description, the Email of user's report is referred to as user's report mail.Related operation maintenance personnel can pass through
Shell(Shell)The User action log that order or script record to user's report data system carries out keyword extraction, with
The user's report mail of correlation type is checked, and therefrom sums up the characteristic of user's report mail, to carry out anti-spam mistake
Filter the setting or upgrading of rule.Wherein, the characteristic of user's report mail can namely reflect the feature of user's report mail
Some words, these words can be set as the regular word in anti-spam filtering rule, will include this by related operation maintenance personnel
The user's report mail of a little regular words is identified or intercepted as spam.
However, determine in the prior art in anti-spam filtering rule during regular word, related operation maintenance personnel directly from
User's report mail is extracted in substantial amounts of User action log, artificially the characteristic of user's report mail is analyzed and from
Middle to determine regular word, this mode takes time and effort, and amount of calculation is larger, and the upgrading of anti-spam filtering rule is less efficient.It is and related
Operation maintenance personnel is difficult the user's report mail checked one by one in all User action logs so which easily causes data leakage
Look into, and then the validity for the anti-spam filtering rule for draw based on which may be poor, causes in user's report mail
The anti-spam filtering rule that still will not be set of a part filter out, so as to be unfavorable for the progress of anti-spam work.
Therefore, it is necessary to provide a kind of anti-spam filtering rule upgrade method, this method can both save manpower, and and can is enough
Improve upgrading efficiency and the validity of anti-spam filtering rule.
The content of the invention
In view of the above-mentioned drawbacks of the prior art, the invention provides a kind of anti-spam filtering rule upgrade method and dress
Put, to save manpower, improve anti-spam filtering rule upgrading efficiency and the validity of anti-spam filtering rule.
According to an aspect of the invention, there is provided a kind of anti-spam filtering rule upgrade method, including:
The mail text of user's report mail to currently obtaining obtains the user's report postal after carrying out denoising, word segmentation processing
The set of words of part;
The word to match in the set of words with the word in reverse document-frequency IDF dictionaries is defined as the user
Report effective word of mail;
For each effective word of the user's report mail, mail text of the effective word in the user's report mail is counted
Word frequency TF values in this, and according to the TF values counted and the IDF values of effective word in the IDF dictionaries, it is effective to calculate this
The weighted value of word;
By calculate it is each effectively word weighted value it is descending be ranked up, by the effective of the setting quantity for sorting forward
Signature word of the word as the user's report mail;
Signature word according to obtaining upgrades the regular word in the anti-spam filtering rule.
Wherein, the IDF dictionaries are predetermined, and the determination method of the IDF dictionaries, including:
The user's report mail obtained in setting time section obtains user's report mail set;
For the often envelope user's report mail in the user's report mail set, to the mail text of the user's report mail
This progress denoising, word segmentation processing, the set of words of the user's report mail is obtained, and to the set of words of the user's report mail
In the part-of-speech information of each word be labeled;After removing the stop words in the set of words of the user's report mail, this is used
The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of family report mail, is defined as
The reservation word of the user's report mail;
Afterwards, for the often envelope user's report mail in the user's report mail set, the user's report postal is counted
The frequency that each reservation word of part occurs in each user's report mail in the user's report mail set;
It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in the user's report postal
After IDF values in part set, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
It is preferred that the user's report mail obtained in setting time section obtains user's report mail set, it is specially:
When reaching each update cycle, obtain the user's report mail in the setting time section and obtain user's act
Report mail set.
It is preferred that effective word using the setting quantity for sorting forward as the user's report mail signature word it
Afterwards, in addition to:
Obtained signature word is ranked up according to initial, forms the signature term vector of the user's report mail;
By the signature term vector of the user's report mail currently comprised, each user with acquisition before being recorded in caching
The signature term vector of report mail is compared;If it is different, then by the signature term vector of the user's report mail currently comprised
It is recorded in the caching;And
The signature word that the basis obtains upgrades the regular word in the anti-spam filtering rule, specifically includes:
When the signature term vector recorded in the caching reaches setting quantity or reached each upgrade cycle, according to
Each signature term vector recorded in the caching, upgrades the regular word in the anti-spam filtering rule, afterwards by the caching
Empty.
It is preferred that the mail text of the user's report mail to currently obtaining carries out denoising, word segmentation processing, specific bag
Include:
Additional character in the mail text of the user's report mail currently obtained, punctuation mark, space are removed, and
Word segmentation processing is carried out to the mail text of the user's report mail currently obtained with condition random field CRF algorithms.
According to another aspect of the present invention, a kind of anti-spam filtering rule update device is additionally provided, including:
Set of words determining module, the mail text for the user's report mail to currently obtaining carry out denoising, participle
Processing obtains the set of words of the user's report mail;
Effective word determining module, for by the set of words with the word phase in reverse document-frequency IDF dictionaries
The word matched somebody with somebody is defined as effective word of the user's report mail;
Weight value calculation module, for each effective word for the user's report mail, effective word is counted at this
Word frequency TF values in the mail text of user's report mail, and according to the TF values counted and effective word in the IDF dictionaries
IDF values, calculate the weighted value of effective word;
Signature word determining module, the weighted value of each effectively word for the weight computation module to be calculated are descending
It is ranked up, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail;
Regular upgraded module, the signature word for being obtained according to the signature word determining module upgrade the anti-spam and filtered
Regular word in rule.
Further, the anti-spam filtering rule update device, in addition to:
IDF dictionary determining modules, for obtaining the user's report mail in setting time section, obtain user's report mail collection
Close;For the often envelope user's report mail in the user's report mail set, the mail text of the user's report mail is entered
Row denoising, word segmentation processing obtain the set of words of the user's report mail, and in the set of words of the user's report mail
The part-of-speech information of each word is labeled;After removing the stop words in the set of words of the user's report mail, the user is lifted
The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of mail is reported, is defined as the use
Report the reservation word of mail in family;Afterwards, for the often envelope user's report mail in the user's report mail set, this is counted
The frequency that each reservation word of user's report mail occurs in each user's report mail in the user's report mail set;
It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in the user's report mail set
IDF values after, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
It is preferred that the IDF dictionaries determining module specifically includes:
User's report mail acquiring unit, for when reaching each update cycle, obtaining in the setting time section
User's report mail, obtain the user's report mail set;
Retain word determining unit, for for the often envelope user's report mail in the user's report mail set, to this
The mail text of user's report mail carries out denoising, word segmentation processing obtains the set of words of the user's report mail, and to the use
The part-of-speech information of each word in the set of words of family report mail is labeled;Remove the set of words of the user's report mail
In stop words after, the part-of-speech information that will record in part-of-speech information in the set of words of the user's report mail and part-of-speech information table
The word to match, it is defined as the reservation word of the user's report mail;
Frequency statistics unit, for for the often envelope user's report mail in the user's report mail set, counting
The frequency that each reservation word of the user's report mail occurs in each user's report mail in the user's report mail set
Number;
IDF dictionary determining units, the frequency for being counted for the Frequency statistics unit are less than the every of given threshold
Individual reservation word, after calculating IDF values of the reservation word in the user's report mail set, by the reservation word and its IDF values pair
It should recorded in the IDF dictionaries.
It is preferred that it is described signature word determining module be additionally operable to effective word using the setting quantity for sorting forward as
After the signature word of the user's report mail, obtained signature word is ranked up according to initial, forms the user's report mail
Signature term vector;By the signature term vector of the user's report mail currently comprised, with obtaining before being recorded in caching
The signature term vector of each user's report mail is compared;If it is different, then by the signature of the user's report mail currently comprised
Term vector is recorded in the caching;And
The signature term vector that the regular upgraded module is specifically used for recording in the caching is determined reaches setting number
Measure or when reaching each upgrade cycle, according to each signature term vector recorded in the caching, upgrade the anti-spam mistake
Regular word in filter rule, the caching is emptied afterwards.
It is preferred that the set of words determining module is specifically used for the postal for removing the user's report mail currently obtained
Additional character, punctuation mark in part text, space, and with condition random field CRF algorithms to the user currently obtained
Report that the mail text of mail carries out word segmentation processing, obtain the set of words of the user's report mail.
In technical scheme, denoising is carried out to the mail text of user's report mail currently obtained, at participle
Reason obtains the set of words of user's report mail;After effective word that user's report mail is determined according to IDF dictionaries, count every
Individual effectively TF value of the word in the mail text of user's report mail, and according to the IDF values recorded in IDF dictionaries, obtaining each
The weighted value of effective word;According to the weighted value of each effectively word, from each signature word that user's report mail is effectively determined in word, with
For the selection of regular word in anti-spam filtering rule.Because the present invention can automatically determine out the signature word of user's report mail, and
According to the regular word in obtained signature word upgrading anti-spam filtering rule so that related operation maintenance personnel need not be from substantial amounts of user
The inquiry of user's report mail is carried out in user behaviors log, need not also audit the full text for the user's report mail that inquiry obtains one by one,
So as to reduce artificial amount of calculation, people's power is saved, improves anti-spam filtering rule upgrading efficiency;Moreover, the automatic upgrading
Mode can carry out word extraction of signing to each user's report mail in User action log, not easily cause data under-enumeration, energy
The enough validity for improving anti-spam filtering rule as far as possible, is more beneficial for the progress of anti-spam work.
Brief description of the drawings
Fig. 1 a are the method flow diagram of the determination IDF dictionaries of the embodiment of the present invention;
Fig. 1 b are the method flow diagram of the reservation word for obtaining user's report mail of the embodiment of the present invention;
Fig. 2 is the flow chart of the anti-spam filtering rule upgrade method of the embodiment of the present invention;
Fig. 3 is the internal structure block diagram of the anti-spam filtering rule update device of the embodiment of the present invention;
Fig. 4 is the internal structure block diagram of the IDF dictionary determining modules of the embodiment of the present invention.
Embodiment
Clear, complete description is carried out to technical scheme below with reference to accompanying drawing, it is clear that described implementation
Example is only the part of the embodiment of the present invention, rather than whole embodiments.It is general based on the embodiment in the present invention, this area
Logical technical staff all other embodiment resulting on the premise of creative work is not made, belongs to the present invention and is protected
The scope of shield.
The term such as " module " used in this application, " system " is intended to include the entity related to computer, such as but unlimited
In hardware, firmware, combination thereof, software or executory software.For example, module can be, it is not limited to:Processing
The process run on device, processor, object, executable program, thread, program and/or the computer performed.For example, count
It can be module to calculate the application program run in equipment and this computing device.One or more modules can be located at executory
In one process and/or thread, a module can also be located on a computer and/or be distributed in two or more platforms and calculate
Between machine.
The present invention main thought be:For the user's report mail obtained from User action log, based on TF-IDF
(Term frequency-inverse document frequency, word frequency-reverse document-frequency)Model, automatically from user
Report the word that several substantive contents that can reflect the user's report mail are obtained in the mail text of mail(That is characteristic
According to), the signature word as the user's report mail;Afterwards, anti-spam is carried out according to the signature word of obtained user's report mail
The upgrading of filtering rule.This user's report mail to being obtained in User action log is analyzed, and carries out anti-rubbish automatically
The mode of rubbish filtering rule upgrading so that related operation maintenance personnel need not carry out user's report postal from substantial amounts of User action log
The inquiry of part, the full text for the user's report mail that inquiry obtains need not be also audited one by one, so as to reduce artificial amount of calculation, save people
Power, improve anti-spam filtering rule upgrading efficiency;Meanwhile this upgrade mode automatically can be to each user in User action log
Report mail carries out word extraction of signing, and does not easily cause data under-enumeration, can improve the validity of anti-spam filtering rule as far as possible,
It is more beneficial for the progress of anti-spam work.
Based on above-mentioned thinking, in technical scheme, for the user's report obtained from User action log
Mail, the set of words of user's report mail is obtained after the mail text progress denoising, word segmentation processing to user's report mail;Will
Word in set of words is matched with the word in the IDF dictionaries being previously obtained, and determines the effective of user's report mail
After word, each effectively TF value of the word in the mail text of user's report mail, and having according to what is recorded in IDF dictionaries is counted
The IDF values of word are imitated, obtain the weighted value of each effectively word;User is determined from each effectively word according to the weighted value of each effectively word
The signature word of mail is reported, to carry out anti-spam filtering rule upgrading.So as to save manpower, while improve anti-spam mistake
Filter rule upgrading efficiency, the validity of anti-spam filtering rule is also improved as far as possible.
The technical scheme that the invention will now be described in detail with reference to the accompanying drawings.In the specific embodiment of the invention, it is determined that user
Before the signature word for reporting mail, it may be predetermined that an IDF dictionary.Specifically, setting is obtained from User action log
User's report mail in period obtains a user's report mail set, and each user in user's report mail set is lifted
Report mail is respectively processed, and obtains the reservation word of each user's report mail, and therefrom select a part according to setting rule
Retain word, the reservation word selected and its IDF values recorded in IDF dictionaries, determine the flow of the specific method of IDF dictionaries,
As shown in Figure 1a, comprise the following steps:
S101:The user's report mail obtained in setting time section obtains user's report mail set.
Specifically, the user's report mail in setting time section can be obtained from User action log, by these users
Report mail is put into a user's report mail set as set element.Wherein, setting time section is by people in the art
Member is set.
Further, the user's report mail that can also periodically obtain in setting time section obtains user's report mail collection
Close, periodically to update IDF dictionaries.It is, when reaching each update cycle, setting is obtained from User action log
User's report mail in period obtains the user's report mail set of current update cycle.Wherein, the update cycle specifically by
Those skilled in the art are set, can be identical with setting time section, can also be different.For example, the update cycle is specially two
Week, setting time section are specially one week;So, if in the Monday in current week, Monday last week to Sunday is obtained(I.e. one week)Interior
After user's report mail, IDF dictionaries are updated, then can obtain next Monday to next Sunday in the Monday of week after next(I.e. one
Week)Interior user's report mail, is updated to IDF dictionaries again, and IDF dictionaries are periodically updated so as to realize.
S102:For the often envelope user's report mail in user's report mail set, the guarantor of the user's report mail is determined
Stay word.
Specifically, for the often envelope user's report mail in user's report mail set, the postal to the user's report mail
Part text carries out denoising, word segmentation processing, obtains the set of words of the user's report mail, and to the word of the user's report mail
The part-of-speech information of each word in set is labeled;, will after removing the stop words in the set of words of the user's report mail
The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of the user's report mail, make
For the reservation word of the user's report mail.
Wherein, for any envelope user's report mail in user's report mail set, the user's report mail is determined
Retain the method for word, described in detail in following combination Fig. 1 b.
S103:For the often envelope user's report mail in user's report mail set, the user's report mail is counted
The frequency occurred in each each user's report mail for retaining word in user's report mail set.
Specifically, there are some envelope user's report mails in user's report mail set, for an envelope user's report mail
In a reservation word, the reservation word may occur more than once in the user's report mail, it is also possible in other users
Occur in report mail.So, time that will occur in each user's report mail of the reservation word in user's report mail set
Number is added up, and can obtain the frequency occurred in each user's report mail of the reservation word in user's report mail set.
S104:It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in user's report
After IDF values in mail set, by the reservation word and its IDF values corresponding record into IDF dictionaries.
Specifically, in anti-spam field, if one retains each user's report mail of the word in user's report mail set
The frequency of middle appearance is higher, then it has been generally acknowledged that the reservation word does not possess representativeness, and the relatively low reservation word of some frequencies have it is very big
The alternative word for the regular word being likely to become in anti-spam filtering rule, these, which retain word, contributes to operation maintenance personnel to filter anti-spam
The excavation and discovery of rule.Therefore, in this step, the reservation word for the frequency counted being less than to given threshold recorded IDF
In dictionary;Meanwhile it is less than each reservation word of given threshold for frequency, the reservation word is also calculated in user's report mail collection
IDF values in conjunction, by IDF value of the reservation word in user's report mail set and the reservation word corresponding record to IDF dictionaries
In.
Wherein, calculating IDF value of the reservation word in user's report mail set is specially:By user's report mail collection
After total user's report mail number in conjunction divided by the number of the user's report mail comprising the reservation word, obtained business is taken pair
Number, just obtain the IDF values of the reservation word.
In actual applications, for any envelope user's report mail A in user's report mail set, user's report is determined
The flow of the method for mail A reservation word as shown in Figure 1 b, specifically comprises the following steps:
S111:Denoising is carried out to user's report mail A mail text.
Specifically, the mail text of user's report mail is often different from general text or short text, and its sender is led to
Can often use various ways to add various noises at it, for example, additional character " ☆ ", " ◆ ", "" etc..Therefore, in this step
In, additional character, space and punctuation mark in removal user's report mail A mail text etc., to obtain only including word
User's report mail A mail text, i.e., the mail text of the user's report mail A after denoising.Wherein, user's report
Mail A mail text can include user's report mail A mail matter topics text and Mail Contents text.
S112:Word segmentation processing is carried out to the mail text of the user's report mail A after denoising, obtains user's report postal
Part A set of words, and the part-of-speech information of each word in set of words is labeled.
Specifically, using Word Intelligent Segmentation algorithm CRF(Conditional Random Field, condition random field)Algorithm pair
The mail text of user's report mail A after denoising is segmented, that is, by the user's report mail A after denoising
Mail text in continuous word sequence be divided into word one by one.
Moreover, being based on CRF algorithms, the part-of-speech information of each word in user's report mail A set of words can be entered
Rower is noted.For example, the part-of-speech information of " discount " is labeled as noun.
S113:After removing the stop words in user's report mail A set of words, by the word collection of the user's report mail
The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in conjunction, the reservation as the user's report mail
Word.
Specifically, for each word in user's report mail A set of words, according to the deactivation recorded in deactivation vocabulary
Word, remove the stop words in set of words, that is, delete as " ", " " etc. in user's report mail A mail text
There is no the word of substantive significance.
Moreover, person skilled can be rule of thumb, it will help the part-of-speech information of anti-spam filtering rule upgrading, such as name
Word, verb, adverbial word idiom, distinction word function idiom etc., are recorded in part-of-speech information table.In this step, disabled for removing
Each word in after word, user's report mail A set of words, however, it is determined that record has the part of speech of the word in part-of-speech information table
Information, it is determined that go out the reservation word that the word is user's report mail A.For example, remove after stop words, user's report mail A
Set of words in have a word " discount ", its part-of-speech information is noun, if in part-of-speech information table record have noun, it is determined that
Go out the reservation word that " discount " is user's report mail A.
Based on the above-mentioned IDF dictionaries predefined out, regular word in anti-spam filtering rule provided in an embodiment of the present invention
Determination method flow, as shown in Fig. 2 specifically comprising the following steps:
S201:The mail text of user's report mail to currently obtaining carries out denoising, word segmentation processing, obtains user act
Report the set of words of mail.
Specifically, the additional character in the mail text for the user's report mail that removal currently obtains, punctuation mark, space
Deng, and carry out word segmentation processing with the mail text of user's report mail of the condition random field CRF algorithms to currently obtaining.
S202:The word that will be matched in the set of words of the user's report mail currently obtained with the word in IDF dictionaries
Language is defined as effective word of the user's report mail.
Specifically, to each word in the set of words of the user's report mail currently obtained, searching in IDF dictionaries is
No record has the word;If so, then effective word using the word as the user's report mail;Otherwise, it regard the word as this
The invalid word of user's report mail.
S203:For each effective word of the user's report mail currently obtained, the weighted value of effective word is calculated.
Specifically, for each effective word of the user's report mail currently obtained, effective word is counted in the user
Report the TF values in the mail text of mail, and the TF values according to the effective word counted and effective word in IDF dictionaries
IDF values, calculate the weighted value of effective word;Wherein, the weighted value for calculating effective word is typically the effective word that will be counted
TF values be multiplied with the IDF values of the effective word of this in IDF dictionaries, weighted value of the obtained result of being multiplied as effective word.
Wherein, for an effective word of user's report mail, mail of the effective word in the user's report mail is calculated
TF values in text, it is, calculating number and the use that effective word occurs in the mail text of the user's report mail
The ratio of effective word sum in the mail text of family report mail.
S204:By the descending setting quantity for being ranked up, sorting forward of weighted value of each effectively word calculated
Signature word of effective word as the user's report mail currently obtained.
For an effective word, TF value of the effective word in the mail text of the user's report mail currently obtained is got over
Greatly, illustrate that the frequency that effective word occurs in the user's report mail is higher, and effective word is in user's report mail set
In IDF values it is bigger, illustrate that the frequency that effective word occurs in user's report mail set is lower.Therefore, if one effective
The weighted value of word(That is the product of TF values and IDF values)It is larger, illustrate the use that effective word can currently be obtained with reflected well
The feature of mail is reported at family, correspondingly also can preferably reflect the substantive content of the user's report mail currently obtained.
Wherein, setting quantity is set by those skilled in the art, is specifically as follows 5 or 10.
S205:According to the regular word in obtained signature word upgrading anti-spam filtering rule.
Specifically, can by the signature word of the user's report mail currently obtained, different from former anti-spam filtering rule
In regular word word, upgrade anti-spam filtering rule as newly-increased regular word.
In fact, the signature word of user's report mail can also be directly viewable by related operation maintenance personnel, user's report is learnt
The substantive content of mail, the rule in anti-spam filtering rule are selected from each signature word of user's report mail manually
Then word, carry out anti-spam filtering rule upgrading.In this fashion, even related operation maintenance personnel is manually chosen from signature word
Regular word, the amount of calculation of related operation maintenance personnel is also greatly reduced, avoid related operation maintenance personnel to user's report mail full text
Check, so as to greatly save manpower.Moreover, the present invention automatically derives the mode of the signature word of user's report mail, can make
Obtain different operation maintenance personnels and intuitively check oneself required data, be easy to data sharing.
More preferably, the signature of the invention that after the signature word of the user's report mail currently obtained, will can also obtain
Word is ranked up according to initial, forms the signature term vector of the user's report mail;And user's report that will be currently comprised
The signature term vector of mail, compared with the signature term vector of each user's report mail obtained before being recorded in caching;
If it is different, then the signature term vector of the user's report mail currently comprised is recorded in caching;If with recording in caching
At least one identical in each signature term vector, then the signature term vector of the user's report mail to currently comprising is not remembered
Record.In fact, before current time, after the signature term vector for obtaining user's report mail every time, the signature word that will obtain
Vector is recorded in caching.
Correspondingly, the signature term vector that the present invention can also record in caching is determined reaches setting quantity or every
When secondary upgrade cycle reaches, according to each signature term vector recorded in caching, upgrade the regular word in anti-spam filtering rule,
That is the vector element in each signature term vector recorded in caching determines the regular word of upgrading;Afterwards will caching
Empty.Wherein, setting quantity and upgrade cycle can specifically be set by those skilled in the art.
Further, also signature term vector identical user's report mail can be classified as one kind, and of a sort user is lifted
Report mail signs that term vector is corresponding stores with corresponding.So, related operation maintenance personnel upgrades anti-spam rule manually
During regular word in then, the signature word of each user's report mail need not be also checked one by one, for term vector identical one of signing
Class user's report mail, related operation maintenance personnel need to only be checked for the signature term vector, upgrade the consideration of regular word;
So as to further improve upgrading efficiency.
Based on the determination method of regular word in above-mentioned anti-spam filtering rule, anti-spam mistake provided in an embodiment of the present invention
The internal structure block diagram of the determining device of regular word in filter rule, as shown in figure 3, specifically including:Set of words determining module
301st, effective word determining module 302, weight value calculation module 303, signature word determining module 304 and regular upgraded module 305.
The mail text that set of words determining module 301 is used for the user's report mail to currently obtaining carries out denoising, divided
Word processing, obtains the set of words of the user's report mail.Specifically, set of words determining module 301 removes what is currently obtained
Additional character, punctuation mark in the mail text of user's report mail, space, and with CRF algorithms to the use that currently obtains
The mail text of family report mail carries out word segmentation processing.
Effective word determining module 302 be used for by the set of words of the user's report mail currently obtained with IDF dictionaries
The word that matches of word be defined as effective word of the user's report mail.
Weight value calculation module 303 is used for for effective word determining module 302 is determining, user's report that currently obtain
Each effective word of mail, counts word frequency TF value of the effective word in the mail text of the user's report mail, and according to
The TF values counted and the IDF values of effective word in IDF dictionaries, calculate the weighted value of effective word.
Signature word determining module 304 be used for the weighted value of each effectively word for calculating weight computation module 303 by greatly to
It is small to be ranked up, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail.
Regular upgraded module 305 is used for the signature word upgrading anti-spam filtering rule obtained according to signature word determining module 304
Regular word in then.
Further, word determining module 304 of signing is additionally operable to effective word in the setting quantity that will sort forward as the use
After the signature word of family report mail, obtained signature word is ranked up according to initial, forms the label of the user's report mail
Name term vector;By the signature term vector of the user's report mail currently comprised, each use with acquisition before being recorded in caching
The signature term vector of family report mail is compared;If it is different, then by the signature word of the user's report mail currently comprised to
Amount is recorded in caching;If identical, the signature term vector of the user's report mail to currently comprising does not record.
Correspondingly, the signature term vector that regular upgraded module 305 is specifically used for recording in caching is determined reaches setting
Quantity or when reaching each upgrade cycle, according to each signature term vector recorded in caching, upgrade anti-spam filtering rule
In regular word, caching is emptied afterwards.
Further, the determining device of regular word may also include in above-mentioned anti-spam filtering rule:IDF dictionary determining modules
306。
The user's report mail that IDF dictionaries determining module 306 is used to obtain in setting time section obtains user's report mail
Set;For the often envelope user's report mail in the user's report mail set, to the mail text of the user's report mail
Denoising, word segmentation processing are carried out, obtains the set of words of the user's report mail, and in the set of words of the user's report mail
The part-of-speech information of each word be labeled;After removing the stop words in the set of words of the user's report mail, by the user
The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of mail is reported, as the use
Report the reservation word of mail in family;Afterwards, for the often envelope user's report mail in user's report mail set, the user is counted
Report the frequency occurred in each user's report mail of each reservation word of mail in user's report mail set;For statistics
The frequency gone out is less than each reservation word of given threshold, will after calculating IDF values of the reservation word in user's report mail set
The reservation word and its IDF values corresponding record are into IDF dictionaries.
Specifically, the internal structure block diagram of IDF dictionaries determining module 306 is as shown in figure 4, specifically include:User's report postal
Part acquiring unit 401, retain word determining unit 402, Frequency statistics unit 403 and IDF dictionaries determining unit 404.
User's report mail acquiring unit 401 is used for when reaching each update cycle, obtains in the setting time section
User's report mail, obtain user's report mail set.
Retain word determining unit 402 to be used for for the often envelope user's report mail in user's report mail set, to the use
The mail text of family report mail carries out denoising, word segmentation processing, obtains the set of words of the user's report mail, and to the user
Report that the part-of-speech information of each word in the set of words of mail is labeled;In the set of words for removing the user's report mail
Stop words after, by the part of speech recorded in the set of words of the user's report mail, part-of-speech information and part-of-speech information table believe
The word of manner of breathing matching, the reservation word as the user's report mail.
Frequency statistics unit 403 is used for for the often envelope user's report mail in user's report mail set, counts this
The frequency that each reservation word of user's report mail occurs in each user's report mail in user's report mail set.
The frequency that IDF dictionaries determining unit 404 is used to count for Frequency statistics unit 403 is less than the every of given threshold
Individual reservation word, after calculating IDF values of the reservation word in user's report mail set, by the reservation word and its corresponding note of IDF values
Record in IDF dictionaries.
In technical scheme, denoising is carried out to the mail text of user's report mail currently obtained, at participle
Reason obtains the set of words of user's report mail;After effective word that user's report mail is determined according to IDF dictionaries, count every
Individual effectively TF value of the word in the mail text of user's report mail, and according to the IDF values recorded in IDF dictionaries, obtaining each
The weighted value of effective word;According to the weighted value of each effectively word, from each signature word that user's report mail is effectively determined in word, with
For the selection of regular word in anti-spam filtering rule.Because the present invention can automatically determine out the signature word of user's report mail, and
According to the regular word in obtained signature word upgrading anti-spam filtering rule so that related operation maintenance personnel need not be from substantial amounts of user
The inquiry of user's report mail is carried out in user behaviors log, need not also audit the full text for the user's report mail that inquiry obtains one by one,
So as to reduce artificial amount of calculation, people's power is saved, improves anti-spam filtering rule upgrading efficiency;Moreover, the automatic upgrading
Mode can carry out word extraction of signing to each user's report mail in User action log, not easily cause data under-enumeration, energy
The enough validity for improving anti-spam filtering rule as far as possible, is more beneficial for the progress of anti-spam work.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (8)
- A kind of 1. anti-spam filtering rule upgrade method, it is characterised in that including:The mail text of user's report mail to currently obtaining obtains the user's report mail after carrying out denoising, word segmentation processing Set of words;The word to match in the set of words with the word in reverse document-frequency IDF dictionaries is defined as the user's report Effective word of mail;For each effective word of the user's report mail, effective word is counted in the mail text of the user's report mail Word frequency TF values, and according to the TF values that count and the IDF values of effective word in the IDF dictionaries, calculate effective word Weighted value;It is ranked up the weighted value of each effectively word calculated is descending, by effective word of the setting quantity for sorting forward work For the signature word of the user's report mail;Signature word according to obtaining upgrades the regular word in the anti-spam filtering rule;Wherein, the IDF dictionaries are predetermined that the determination method of the IDF dictionaries includes:The user's report mail obtained in setting time section obtains user's report mail set;For the often envelope user's report mail in the user's report mail set, the mail text of the user's report mail is entered Row denoising, word segmentation processing, the set of words of the user's report mail is obtained, and in the set of words of the user's report mail The part-of-speech information of each word is labeled;After removing the stop words in the set of words of the user's report mail, the user is lifted The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of mail is reported, is defined as the use Report the reservation word of mail in family;Afterwards, for the often envelope user's report mail in the user's report mail set, the user's report mail is counted The frequency occurred in each each user's report mail for retaining word in the user's report mail set;It is less than each reservation word of given threshold for the frequency counted, calculates the reservation word in the user's report mail collection After IDF values in conjunction, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
- 2. the method as described in claim 1, it is characterised in that the user's report mail obtained in setting time section obtains User's report mail set, it is specially:When reaching each update cycle, obtain the user's report mail in the setting time section and obtain the user's report postal Part set.
- 3. the method as described in claim 1, it is characterised in that effective word using the setting quantity for sorting forward is used as this After the signature word of user's report mail, in addition to:Obtained signature word is ranked up according to initial, forms the signature term vector of the user's report mail;By the signature term vector of the user's report mail currently comprised, each user's report with acquisition before being recorded in caching The signature term vector of mail is compared;If it is different, then the signature term vector of the user's report mail currently comprised is recorded In the caching;AndThe signature word that the basis obtains upgrades the regular word in the anti-spam filtering rule, specifically includes:When the signature term vector recorded in the caching reaches setting quantity or reached each upgrade cycle, according to described Each signature term vector recorded in caching, upgrades the regular word in the anti-spam filtering rule, afterwards empties the caching.
- 4. the method as described in claim 1-3 is any, it is characterised in that described to the postal of the user's report mail currently obtained Part text carries out denoising, word segmentation processing, specifically includes:Additional character in the mail text of the user's report mail currently obtained, punctuation mark, space are removed, and is used Condition random field CRF algorithms carry out word segmentation processing to the mail text of the user's report mail currently obtained.
- A kind of 5. anti-spam filtering rule update device, it is characterised in that including:Set of words determining module, the mail text for the user's report mail to currently obtaining carry out denoising, word segmentation processing Obtain the set of words of the user's report mail;Effective word determining module, for will match in the set of words with the word in reverse document-frequency IDF dictionaries Word is defined as effective word of the user's report mail;Weight value calculation module, for each effective word for the user's report mail, effective word is counted in the user The word frequency TF values in the mail text of mail are reported, and according to the TF values counted and the IDF of effective word in the IDF dictionaries Value, calculate the weighted value of effective word;Signature word determining module, for the descending progress of weighted value for each effectively word for calculating the weight computation module Sequence, the signature word using the effective word for the setting quantity for sorting forward as the user's report mail;Regular upgraded module, the signature word for being obtained according to the signature word determining module upgrade the anti-spam filtering rule In regular word;IDF dictionary determining modules, for obtaining the user's report mail in setting time section, obtain user's report mail set; For the often envelope user's report mail in the user's report mail set, the mail text of the user's report mail is gone Make an uproar, word segmentation processing obtains the set of words of the user's report mail, and to each word in the set of words of the user's report mail The part-of-speech information of language is labeled;After removing the stop words in the set of words of the user's report mail, by the user's report postal The word that part-of-speech information matches with the part-of-speech information recorded in part-of-speech information table in the set of words of part, it is defined as user act Report the reservation word of mail;Afterwards, for the often envelope user's report mail in the user's report mail set, the user is counted Report the frequency occurred in each user's report mail of each reservation word of mail in the user's report mail set;For The frequency counted is less than each reservation word of given threshold, calculates the reservation word in the user's report mail set After IDF values, by the reservation word and its IDF values corresponding record into the IDF dictionaries.
- 6. device as claimed in claim 5, it is characterised in that the IDF dictionaries determining module specifically includes:User's report mail acquiring unit, for when reaching each update cycle, obtaining the user in the setting time section Mail is reported, obtains the user's report mail set;Retain word determining unit, for for the often envelope user's report mail in the user's report mail set, to the user The mail text of report mail carries out denoising, word segmentation processing obtains the set of words of the user's report mail, and the user is lifted The part-of-speech information of each word in the set of words of mail is reported to be labeled;In the set of words for removing the user's report mail After stop words, by part-of-speech information in the set of words of the user's report mail and the part-of-speech information phase that is recorded in part-of-speech information table The word matched somebody with somebody, it is defined as the reservation word of the user's report mail;Frequency statistics unit, for for the often envelope user's report mail in the user's report mail set, counting the use The frequency that each reservation word of family report mail occurs in each user's report mail in the user's report mail set;IDF dictionary determining units, the frequency for being counted for the Frequency statistics unit are less than each guarantor of given threshold Word is stayed, after calculating IDF values of the reservation word in the user's report mail set, by the reservation word and its corresponding note of IDF values Record in the IDF dictionaries.
- 7. device as claimed in claim 5, it is characterised in thatThe signature word determining module is additionally operable in effective word using the setting quantity for sorting forward as the user's report After the signature word of mail, obtained signature word is ranked up according to initial, form the signature word of the user's report mail to Amount;By the signature term vector of the user's report mail currently comprised, each user's report with acquisition before being recorded in caching The signature term vector of mail is compared;If it is different, then the signature term vector of the user's report mail currently comprised is recorded In the caching;AndThe regular upgraded module be specifically used for the signature term vector that is recorded in the caching is determined reach setting quantity or Person, according to each signature term vector recorded in the caching, upgrades the anti-spam filtering rule when reaching each upgrade cycle Regular word in then, the caching is emptied afterwards.
- 8. the device as described in claim 5-7 is any, it is characterised in thatThe set of words determining module is specifically used in the mail text of the removal user's report mail currently obtained Additional character, punctuation mark, space, and with condition random field CRF algorithms to the user's report mail currently obtained Mail text carries out word segmentation processing, obtains the set of words of the user's report mail.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410102982.5A CN103902673B (en) | 2014-03-19 | 2014-03-19 | Anti-spam filtering rule upgrade method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410102982.5A CN103902673B (en) | 2014-03-19 | 2014-03-19 | Anti-spam filtering rule upgrade method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103902673A CN103902673A (en) | 2014-07-02 |
CN103902673B true CN103902673B (en) | 2017-11-24 |
Family
ID=50993995
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410102982.5A Active CN103902673B (en) | 2014-03-19 | 2014-03-19 | Anti-spam filtering rule upgrade method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103902673B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460551B (en) * | 2018-10-29 | 2023-04-18 | 北京知道创宇信息技术股份有限公司 | Signature information extraction method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
CN103473218A (en) * | 2013-09-04 | 2013-12-25 | 盈世信息科技(北京)有限公司 | Email classification method and email classification device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130006996A1 (en) * | 2011-06-22 | 2013-01-03 | Google Inc. | Clustering E-Mails Using Collaborative Information |
-
2014
- 2014-03-19 CN CN201410102982.5A patent/CN103902673B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
CN103473218A (en) * | 2013-09-04 | 2013-12-25 | 盈世信息科技(北京)有限公司 | Email classification method and email classification device |
Also Published As
Publication number | Publication date |
---|---|
CN103902673A (en) | 2014-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9105008B2 (en) | Detecting controversial events | |
CN108647309B (en) | Chat content auditing method and system based on sensitive words | |
CN105243152B (en) | A kind of automaticabstracting based on graph model | |
CN105488092B (en) | A kind of time-sensitive and adaptive sub-topic online test method and system | |
CN103914494B (en) | Method and system for identifying identity of microblog user | |
US20090125371A1 (en) | Domain-Specific Sentiment Classification | |
CN109948121A (en) | Article similarity method for digging, system, equipment and storage medium | |
CN110263248A (en) | A kind of information-pushing method, device, storage medium and server | |
CN107480123A (en) | A kind of recognition methods, device and the computer equipment of rubbish barrage | |
US20140013221A1 (en) | Method and device for filtering harmful information | |
CN104102681A (en) | Microblog key event acquiring method and device | |
CN105095179B (en) | The method and device that user's evaluation is handled | |
CN101197793B (en) | Garbage information detection method and device | |
CN105893615B (en) | Owner's characteristic attribute method for digging and its system based on Mobile Phone Forensics data | |
CN106407280A (en) | Query target matching method and device | |
CN103279478A (en) | Method for extracting features based on distributed mutual information documents | |
CN104281565B (en) | Semantic dictionary construction method and device | |
CN102207961A (en) | Automatic web page classification method and device | |
Ahmed et al. | Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness | |
CN109978020A (en) | A kind of social networks account vest identity identification method based on multidimensional characteristic | |
CN107544961A (en) | A kind of sentiment analysis method, equipment and its storage device of social media comment | |
CN105512300B (en) | information filtering method and system | |
CN109299463B (en) | Emotion score calculation method and related equipment | |
CN110019763B (en) | Text filtering method, system, equipment and computer readable storage medium | |
CN107066441A (en) | A kind of method and device for calculating part of speech correlation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230417 Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193 Patentee after: Sina Technology (China) Co.,Ltd. Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor Patentee before: Sina.com Technology (China) Co.,Ltd. |