CN1922837A - Method and device for filtrating rubbish E-mail based on similarity measurement - Google Patents

Method and device for filtrating rubbish E-mail based on similarity measurement Download PDF

Info

Publication number
CN1922837A
CN1922837A CN200480017663.9A CN200480017663A CN1922837A CN 1922837 A CN1922837 A CN 1922837A CN 200480017663 A CN200480017663 A CN 200480017663A CN 1922837 A CN1922837 A CN 1922837A
Authority
CN
China
Prior art keywords
data
document
email message
spam messages
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200480017663.9A
Other languages
Chinese (zh)
Inventor
马特·格勒森
大卫·赫格斯塔特
桑迪·詹森
埃里·曼特尔
阿特·麦德拉
肯·施奈德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Brightmail Inc
Original Assignee
Brightmail Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Brightmail Inc filed Critical Brightmail Inc
Priority claimed from PCT/US2004/015383 external-priority patent/WO2004105332A2/en
Publication of CN1922837A publication Critical patent/CN1922837A/en
Pending legal-status Critical Current

Links

Images

Abstract

A method and system for filtering email spam based on similarity measures are described. In one embodiment, the method includes receiving an incoming email message, generating data characterizing the incoming email message based on the content of the incoming email message, and comparing the generated data with a set of data characterizing spam messages. The method further includes determining whether a resemblance between the data characterizing the incoming email message and any data item from the set of data characterizing spam messages exceeds a threshold.

Description

Filter the method and apparatus of spam based on similarity measurement
Related application
The application requires the U.S. Provisional Application sequence number No.60/471 that submitted on May 15th, 2003, and 242 priority has been incorporated its full content here into.
Technical field
The present invention relates to filtering electronic mail (email); More specifically, the present invention relates to filter spam (email spam) based on similarity measurement (similarity measure).
Background technology
The internet is popularized gradually, and more and more people are engaged in business activity by the internet, and propagates their products ﹠ services by generating and send the electronics group mail.These electronic informations (email) are normally uncalled, and the person of being received regards tedious thing as, because these message have taken essential in a large number and the important required memory spaces of data processing.For example, when the memory capacity of mail server was filled up to maximum by the undesired Email that comprises advertisement, mail server may have to reject important and/or required Email.In addition, the thin client system such as set-top box, PDA, network computer and beep-pager all has limited memory capacity.In any system in these systems, unwanted electronic-mail can take user's limited resources.In addition, typical user can lose time owing to downloading a large amount of useless advertising messages.These unwanted electronic-mail are commonly called spam.
Current, existence can filter out the product of undesired message.For example, there is following spam obstruction method, this method is preserved all spams agencies' (promptly generating the company of a large amount of uncalled Emails) index list, and is provided for stopping the device of any Email of the company's transmission from this tabulation.
Another current available " spam " filter has adopted aforesaid filter based on predetermined word and pattern.If the theme of the mail that imports into has comprised known spam pattern, then this mail is appointed as undesired mail.
But along with the raising of Spam filtering technical complexity, the spammer is used to avoid the technology of filter also in raising.The recent tactful example that the spammer adopted comprises that randomization, source hide and use the filter escape of HTML.
Summary of the invention
The invention describes the method and system that is used for filtering spam based on similarity measurement.According to an aspect, described method comprises the email message that reception is imported into, content based on the email message that imports into generates the data that characterize the email message that imports into, and the data that generate are compared with the data acquisition system that characterizes spam messages.This method also comprises to be determined to characterize the data of the email message that imports into and whether exceeds threshold value from the similarity (resemblance) between any data item of the data acquisition system that characterizes spam messages.
From the following drawings with describing in detail, other features of the present invention will be more obvious.
Description of drawings
From the accompanying drawing of detailed description given below and various embodiment of the present invention, will more fully understand the present invention, still, these embodiment should not be understood that to limit the invention to specific embodiment, and are only used for illustrating and understand.
Fig. 1 is the block diagram of an embodiment that is used to control the system of sending of SPAM.
Fig. 2 is the block diagram of an embodiment of spam content preparation module.
Fig. 3 is the block diagram of an embodiment of similitude determination module.
Fig. 4 is the flow chart of an embodiment that is used for the process of disposal of refuse email message.
Fig. 5 is the flow chart of an embodiment that is used for filtering based on similarity measurement the process of spam.
Fig. 6 A is the flow chart of an embodiment of process that is used to create the signature of email message.
Fig. 6 B is used to use the signature of email message to detect of process of spam
The flow chart of embodiment.
Fig. 7 is used for document is carried out flow chart based on an embodiment of the process of the comparison of character.
Fig. 8 is the flow chart that is used for determining an embodiment of the process that two documents are whether similar.
Fig. 9 is the flow chart of an embodiment of process that is used for reducing the noise of email message.
Figure 10 is used to revise the flow chart of email message with an embodiment of the process that reduces noise.
Figure 11 is the block diagram of exemplary computer system.
Embodiment
The invention describes the method and apparatus that is used for filtering spam based on similarity measurement.In the following description, a plurality of details have been proposed.But, it will be apparent to those skilled in the art that need not these details also can realize the present invention.In other examples, known structure and equipment illustrate and are not described in detail with the block diagram form, so that avoid fuzzy theme of the present invention.
Some part in below describing in detail is to provide according to algorithm and symbolic representation for the operation of the data bit in the computer storage.These arthmetic statements and expression are that the technical staff of data processing field is used for passing on most effectively to the others skilled in the art in this field the means of their work essence.Algorithm here, and in the ordinary course of things, the sequence of steps of the required result's that is considered to be used to lead self-consistentency.These steps need the physical treatment of physical quantity.This tittle takes usually to be stored, transmits, makes up, relatively and the otherwise processed signal of telecommunication or the form of magnetic signal, but this is optional.Verified, it is easily sometimes that these signals are called position, value, element, symbol, character, item, numeral etc., and this is mainly for the reason of public use.
But should remember that all these and similar terms all are associated with suitable physical quantity, and only are the labels that makes things convenient for that is applied to this tittle.Unless otherwise indicated, otherwise from following argumentation, it is evident that, run through whole specification, the argumentation of the term of utilization such as " processing " or " calculating (computing) " or " calculating (calculating) " or " determining " or " demonstration " or the like refers to the action and the processing of computer system or similar electronic computing device, described computer system or similar electronic computing device are handled the data that are represented as physics (electronics) amount in the RS of computer system, and are converted into the such information stores of computer system memory or register or other, in transmission or the display device other are expressed as the data of physical quantity similarly.
The invention still further relates to the device that is used to carry out the operation here.This device can specifically make up according to required purpose, and perhaps it also can comprise the all-purpose computer that is come selective activation or reshuffled by the computer program that is stored in the computer.This computer program can be stored in the computer-readable recording medium, described computer-readable recording medium is such as, but be not limited to dish, read-only memory (ROM), random access storage device (RAM), EPROM, EEPROM, the magnetic or optical card of any kind that comprises floppy disk, CD, CD-ROM and magneto optical disk or be suitable for the medium of any kind of store electrons instruction, and in them each all is coupled to computer system bus.
Algorithm given here is not relevant with any certain computer or other devices inherently with demonstration.Various general-purpose systems can prove perhaps that with using according to the program of instruction here making up more special device, to carry out required method step be very easily.From following description, the desired structure of multiple these systems will display.In addition, the present invention does not describe with reference to any certain programmed language.Will appreciate that multiple programming language can be used to realize instruction of the present invention described here.
Machine readable media comprises any mechanism that is used for storing or transmitting with the readable form of machine (for example computer) information.For example, machine readable media comprises read-only memory (ROM); Random access storage device (RAM); Magnetic disk storage medium; Optical storage media; Flash memory device; Electricity, light, sound or other forms of transmitting signal (for example carrier wave, infrared signal, digital signal or the like) or the like.
Filter spam based on similarity measurement
Fig. 1 is the block diagram of an embodiment that is used to control the system of sending of SPAM (email).This system comprises control centre 102, and the communication network 100 such as public network (for example internet, wireless network or the like) or dedicated network (for example LAN, Intranet or the like) is coupled in this control centre.Control centre 102 communicates by letter with a plurality of webservers 104 via network 100.Each server 104 all uses special use or public network to communicate by letter with user terminal 106.
Control centre 102 is anti-rubbish mail facilities, and it is responsible for analyzing the message, the exploitation that are identified as spam and is used to detect the filtering rule of spam and filtering rule is distributed to server 104.Message can be identified as spam owing to it sends the source of spam email oneself known (for example the source of spam email by using " spam detector " to determine, described spam detector that is the e-mail address of specifically selecting in order to enter spammer's mail tabulation as much as possible).
Server 104 can be a mail server, and it receives and store the user's who is addressed to the relative users terminal who has sent message.Perhaps, server 104 can be the different server that is coupled to mail server 104.Server 104 is responsible for filtering the message of importing into based on the filtering rule that receives from control centre 102.
In one embodiment, control centre 102 comprises spam content preparation module 108, and this module is responsible for generating and is characterized the data of attacking associated content with spam, and these data are sent to server 104.Each server 104 comprises similitude determination module 110, and this module is responsible for storing the spam data that receive from control centre 102, and uses the data of storage to discern with like the spam content class to import email message into.
In another embodiment, each server 104 had both comprised and generate to have characterized the spam content preparation module 108 of attacking the data of associated content with spam, also comprised to use the data that generate to discern similitude determination module 110 with email message like the spam content class.
Fig. 2 is the block diagram of an embodiment of spam content preparation module 200.Spam content preparation module 200 comprises spam Context resolution device 202, spam Data Generator 206 and spam data transmitter 208.
Spam Context resolution device 202 is responsible for parsing the main body of attacking the email message (being called as spam messages) that produces owing to spam.
Spam Data Generator 206 is responsible for generating the data that characterize spam messages.In one embodiment, the data that characterize spam messages are included as the hashed value tabulation that many group marks (token) (for example character, word, row or the like) of constituting spam messages calculate.The data that characterize spam messages or any other email message are called as information signature here.The signature of spam messages or any other email message can comprise the various data of identification message content, and can use the various algorithms of similarity measurement to create in the process of the signature of more different email messages by permission.
In one embodiment, spam content preparation module 200 also comprises noise reduction algorithms 204, and this algorithm is responsible for detecting the data of indication noise, and before the signature that generates spam messages from spam messages erased noise.Noise representative sightless data concerning the recipient, these data are added in the spam messages to hide the essence of its spam.
In one embodiment, spam content preparation module 200 also comprises message grouping algorithm (not shown), and this algorithm is responsible for the message that derives from single spam attack is assembled in groups.Grouping process can be carried out based on the special characteristics (URL that for example comprises, message part or the like) of spam messages.If the use grouping process, then spam Data Generator 206 can generate the signature that is used for one group of spam messages rather than the message that each is independent.
Spam data transmitter 208 is responsible for the signature of spam messages is distributed to the server of participation, for example server 104 of Fig. 1.In one embodiment, each server 104 periodically (for example per 5 minutes) initiate be connected (for example HTTPS of safety connection) with call center 102.Use this pull-type (pull-based) to connect, signature 102 is sent to associated server 106 from the call center.
Fig. 3 is the block diagram of an embodiment of similitude determination module 300.Similitude determination module 300 comprises and imports message parse device 302, spam data sink 306, message data maker 310, similarity identifier 312 and spam database 304 into.
Import message parse device 302 into and be responsible for resolving the main body of the email message that imports into.
Spam data sink 306 is responsible for receiving the signature of spam messages, and they are stored in the spam database 304.
Message data maker 310 is responsible for generating the signature of the email message that imports into.The signature of the email message that imports in certain embodiments, is included as the hashed value tabulation that many group marks (for example character, word, row or the like) of constituting the email message import into calculate.The signature of the email message that imports in other embodiments, comprises various other data (for example constituting the subclass of the sign set of the email message that imports into) of the content that characterizes email message.As mentioned above, the signature of email message can use the various algorithms of similarity measurement to create in the process of the signature of more different email messages by using permission.
In one embodiment, similitude determination module 300 also comprises and imports message cleaning algorithm 308 into, and as below will in greater detail, this algorithm be responsible for detecting the data of indicating noise, and before the signature of the email message that generation is imported into, erased noise from the email message that imports into.
Similarity identifier 312 is responsible for the signature of each email message that imports into is compared with the signature of spam messages in being stored in spam database 304, and determines based on the result of this comparison whether the email message that imports into is similar with any spam messages.
In one embodiment, spam database 304 is stored as the signature that the experience noise reduces signature that the spam messages spam messages of noise (promptly with) before the process generates and reduces spam messages (being the spam messages that the noise is reduced) generation after the process for these experience noises.In this embodiment, message data maker 310 at first is created on the signature of the email message that noise imports into before reducing, and similarity identifier 312 will be signed and be compared with the signature of the spam messages of being with noise.If email message and these spam messages that the result of this comparison indication is imported into are similar, 312 of similarity identifiers are labeled as spam with this email message that imports into.Perhaps, similarity identifier 312 calls and imports message cleaning algorithm 308 into, with erased noise from the email message that imports into.Then, message data maker 310 generates the amended signature that imports message into, and will the sign signature of the spam messages that is reduced with noise of similarity identifier 312 is compared then.
Fig. 4 is the flow chart of an embodiment that is used for the process 400 of disposal of refuse email message.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.In one embodiment, processing logic is positioned at control centre 102 places of Fig. 1.
With reference to figure 4, process 400 starts from processing logic and receives spam messages (processing block 402).
At processing block 404 places, processing logic is revised this spam messages to reduce noise.To come an embodiment of more detailed argumentation noise reduction algorithms in conjunction with Fig. 9 and 10 subsequently.
At processing block 406 places, processing logic generates the signature of spam messages.As below in conjunction with the more detailed argumentation of Fig. 6 A, in one embodiment, the signature of spam messages is included as the hashed value tabulation that many group marks (for example character, word, row or the like) of constituting the email message that imports into calculate.The signature of the email message that imports in other embodiments, comprises various other data of the content that characterizes email message.
At processing block 408 places, processing logic is sent to server (for example server 104 of Fig. 1) with the signature of spam messages, and this server uses the signature of spam messages to find out similar to the spam messages email message (piece 410) that imports into.
Fig. 5 is the flow chart of an embodiment that is used for filtering based on similarity measurement the process 500 of spam.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.In one embodiment, processing logic is positioned at server 104 places of Fig. 1.
With reference to figure 5, process 500 starts from processing logic and receives the email message (processing block 502) that imports into.
At processing block 504 places, processing logic is revised and is imported message into to reduce noise.To come an embodiment of more detailed argumentation noise reduction algorithms in conjunction with Fig. 9 and 10 subsequently.
At processing block 506 places, processing logic generates the signature that imports message into based on the content of importing message into.As below in conjunction with the more detailed argumentation of Fig. 6 A, in one embodiment, the signature of the email message that imports into is included as the hashed value tabulation that many group marks (for example character, word, row or the like) of constituting the email message that imports into calculate.The signature of the email message that imports in other embodiments, comprises various other data of the content that characterizes email message.
At processing block 508 places, the signature that processing will be imported message into is compared with the signature of spam messages.
At processing block 510 places, whether the similarity that processing logic is determined to import between the signature of the signature of message and certain spam messages exceeds the threshold value similarity measurement.To be used for an embodiment of the process of the similarity between definite two message in conjunction with the more detailed argumentation of Fig. 6 B subsequently.
At processing block 512 places, the email message that processing logic will import into is labeled as spam.
Fig. 6 A is the flow chart of an embodiment of process 600 that is used to create the signature of email message.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.In one embodiment, processing logic is positioned at server 104 places of Fig. 1.
With reference to figure 6A, process 600 starts from processing logic email message is divided into many group marks (processing block 602).Wherein every group mark can comprise the sequential cells from the predetermined number of email message.This predetermined number can be equal to or greater than 1.Character in the email message, word or delegation can be represented in unit.In one embodiment, every group mark and the occurrence number of this group mark in email message are combined.
At processing block 604 places, processing logic calculates the hashed value that is used for many group marks.In one embodiment, hashed value is to calculate by each combination that hash function is applied to a group mark and corresponding sign occurrence number.
At processing block 606 places, processing logic uses the hashed value that calculates to create the signature of email message.In one embodiment, by a subclass of the hashed value selecting to calculate, and the subclass that the parameter that characterizes email message is added the selected hashed value that calculates to created signature.Title of the number of the size that described parameter for example can designates e-mail message, the hashed value that calculates, the keyword that is associated with email message, annex or the like.
In one embodiment, the signature of email message is to create based on the document comparison mechanism of character by using, and this mechanism will be discussed in more detail in conjunction with Fig. 7 and 8 subsequently.
Fig. 6 B is used to use the signature of email message to detect the flow chart of an embodiment of the process 650 of spam.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.In one embodiment, processing logic is positioned at server 104 places of Fig. 1.
With reference to figure 6B, the data in the signature of the email message that process 650 will be imported into are compared with the data in the signature of each spam messages.Signed data comprises the parameter of the content that characterizes email message and is the subclass that is included in the hashed value of the sign generation in the email message.This parameter for example can designates e-mail message size, the sign number in the email message, the keyword that is associated with email message, the title of annex or the like.
This processing logic starts from relevant parameter in the signature of parameter and each spam messages in the signature of the email message that will import into compare (processing block 652).
At judgement frame 654 places, processing logic determine whether the signature of any spam messages comprised with the signature that imports message in the similar parameter of parameter.Similitude for example can be determined based on the permissible ratio of two parameters of tolerance XOR between two parameters.
If there is not the spam messages signature to comprise and the similar parameter of parameter of importing information signature into, then processing logic judges that the email message that imports into is legal (promptly not being spam) (processing block 662).
Perhaps, if one or more spam messages signature has similar parameter, then processing logic determine the signature of first spam messages whether have with the signature that imports Email in the similar hashed value (judgement frame 656) of hashed value.Based on the similitude threshold value, for example,, can think that then hashed value is similar if the ratio of the Hash value matches of some or coupling and unmatched hashed value surpasses assign thresholds.
If first spam messages signature has the similar hashed value of hashed value with the Email signature that imports into, then processing logic judges that the email message that imports into is spam (processing block 670).Otherwise processing logic further utilizes similar parameter to determine whether to exist Other Waste email message signature (judgement frame 658).If then processing logic determines whether next spam messages signature has and the similar hashed value of hashed value of importing Email signature into (judgement frame 656).If then processing logic judges that the email message that imports into is spam (processing block 670).If not, then processing logic turns back to processing block 658.
Have similar parameter if processing logic determines to no longer include Other Waste email message signature, then the email message that imports into of its judgement is not spam (processing block 662).
Document comparison mechanism based on character
Fig. 7 is used for document is carried out flow chart based on an embodiment of the process 700 of the comparison of character.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.
With reference to figure 7, process 700 starts from processing logic document is carried out preliminary treatment (processing block 702).In one embodiment, come document is carried out preliminary treatment by changing each the capitalization character in the document into the lowercase character.For example, message " I am Sam, Sam I am. " can pretreated one-tenth expression formula " i.am.sam.sam.i.am ".
At processing block 704 places, processing logic is divided into a plurality of signs with document, and wherein each sign comprises the continuation character from the predetermined number of document.In one embodiment, each sign is combined with its occurrence number.This combination is called as sign signboard (labeled shingle).For example, if the predetermined number of the continuation character in sign equals 3, then the expression formula of top appointment comprises the set of following sign signboard:
i.a1
.am1
am.1
m.s1
.sa1
sam1
sm.2
m.s1
.sm2
sam2
am.3
m.i1
.i.1
i.a2
.am4
In one embodiment, described signboard is represented as block diagram.
In processing block 706 places, the hashed value of processing logic calculation flag.In one embodiment, calculate hashed value at indicating signboard.For example, if hash function H (x) is applied to above-mentioned each sign signboard, then produce following result:
H(i.a1)->458348732
H(.am1)->200404023
H(am.1)->692939349
H(m.s1)->220443033
H(.sa1)->554034022
H(8am1)->542929292
H(am.2)->629292229
H(m.s1)->702202232
H(.sa2)->322243349
H(8am2)->993923828
H(am.3)->163393269
H(m.i1)->595437753
H(.i.1)->843438583
H(i.a2)->244485639
H(.am4)->493869359
In one embodiment, processing logic sorts to hashed value subsequently, and is as follows:
163393269
200604023
220643033
246685639
322263369
458368732
493869359
542929292
554034022
595637753
629292229
692939349
702202232
843438583
993923828
At processing block 708 places, processing logic is selected the subclass of hashed value from the hashed value that calculates.In one embodiment, processing logic is selected X minimum value from the hashed value after the ordering, and therefrom creates " summary (sketch) " of the document.For example, for X=4, summary can be expressed as follows:
[163393269?200404023?220443033?244485639]。
At processing block 710 places, processing logic is by adding the parameter about the sign of document to signature that summary is created the document.In one embodiment, the number of original logo in the parameter specified documents.In above-mentioned example, the number of original logo is 15.Therefore, the signature of document can be expressed as follows:
[15?163393269?200404023?220443033?244485639]。
Perhaps, any other characteristic of the content that parameter can specified documents (for example size of document, keyword of being associated with document or the like).
Fig. 8 is the flow chart that is used for determining an embodiment of the process 800 that two documents are whether similar.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.
With reference to figure 8, process 800 starts from processing logic and carries out following steps: will compare by the sign number of appointment in the signature of document 1 and 2, and determine whether the sign number in first signature (adjudicates frame 802) with respect to the sign number in second signature in allowed band.For example, allowed band can be that difference is 1 or littler, and perhaps ratio is 90% or higher.
If outside allowed band, then processing logic judges that document 1 and 2 is different (processing blocks 808) to the sign number in first signature with respect to the sign number in second signature.Otherwise, if the sign number during the sign number in first signature is signed with respect to second is in allowed band, then processing logic determines whether similarity between the hashed value in the signature 1 and 2 surpasses threshold value (it is identical for example, surpassing 95% hashed value) (judgement frame 804).If then processing logic judges that these two documents are similar (processing blocks 806).If not, then processing logic judges that document 1 and 2 is different (processing blocks 808).
The Spam filtering that uses noise to reduce
Fig. 9 is the flow chart of an embodiment of process 900 that is used for reducing the noise of email message.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.
With reference to figure 9, process 900 starts from processing logic detects the indication noise in email message data (processing block 902).As mentioned above, noise representative sightless data concerning the recipient of email message, it is added in the email message to avoid Spam filtering.This data for example can comprise that formatted data (for example html tag), numerical character are quoted, character entity is quoted, the url data of predetermine class or the like.Numerical character is quoted and is specified the code position of a character in document character set closes.Character entity is quoted the use designation, so that the author need not to remember code position.For example, character entity Yin Yong ﹠amp; Aring refers on lowercase " a " character and adds circle.
At processing block 904 places, processing logic is revised the content of email message to reduce noise.In one embodiment, content modification comprises the deletion formatted data, numerical character is quoted with character entity quoted the ASCII equivalent of translating into them and revise url data.
At processing block 906 places, processing logic is compared the content of modified email message with the content of spam messages.In one embodiment, carry out relatively to discern definite coupling.Perhaps, whether execution is relatively more similar to determine two documents.
Figure 10 is used to revise the flow chart of email message with an embodiment of the process 1000 that reduces noise.This process can be carried out by processing logic, and described processing logic can comprise hardware (for example special logic, FPGA (Field Programmable Gate Array), microcode or the like), software (for example operating on general-purpose computing system or the special purpose machinery) or its combination.
With reference to Figure 10, process 1000 starts from processing logic search email message to find out formatted data (for example html tag) (processing block 1002).
At judgement frame 1004 places, processing logic is determined the whether qualified exception that becomes of the formatted data found out.Usually, html format does not add anything in the information content of message.But, also have some exceptions.These exceptions be the label that comprised the useful information that is used for further processing messages (for example label<BODY 〉,<A,<IMG and<FONT).For example,<FONT〉and<BODY〉to be that " Bai Shangjia white (white on white) " text is eliminated required for label, and<A〉and<IMG〉label comprises the link information that can be used for other assemblies of data passes in the system usually.
If formatted data does not have qualification to become exception, then from email message, extract formatted data (processing block 1006).
Next, processing logic is quoted each numerical character to quote with character entity and is converted corresponding ascii character (processing block 1008) to.
In HTML, numerical character is quoted and can be taked two kinds of forms:
1. grammer “ ﹠amp; #D; ", wherein D is a ten's digit, with reference to ISO 10646 character to decimals numeral D; And
2. grammer “ ﹠amp; #xH; " or “ ﹠amp; #XH; ", wherein H is a hexadecimal digit, with reference to ISO
10646 hexadecimal characters numeral H.Hexadecimal digit in numerical character is quoted is insensitive to capital and small letter.
For example, may be shown in following formula in the main body through randomized character:
Th&#101&#32&#83a&#118&#105n&#103&#115R&#101&#103is
&#116e&#114&#119&#97&#110&#116&#115&#32yo&#117。
The meaning of this expression formula is phrase " The SavingsRegister wants you. ".
The conversion of carrying out at processing block 1008 places may need to be repeated sometimes.For example, string “ ﹠amp; #38; " corresponding to the string “ ﹠amp among the ASCII; ", string “ ﹠amp; #35 " corresponding to the string among the ASCII " # ", string “ ﹠amp; #51; " corresponding to 3 among the ASCII, string " #56; " corresponding to 8 among the ASCII, " #59; " corresponding to the string among the ASCII "; ".Therefore, as combination string “ ﹠amp; #38; ﹠amp; #35; ﹠amp; #51; ﹠amp; #56; ﹠amp; #59; " when being converted, the string “ ﹠amp that generation need be converted; #38; ".
Therefore, after first conversion operations at processing block 1008 places, whether the processing logic inspection comprises still that through data converted numerical character quotes with character entity and quote (judgement frame 1010).If check result is sure, then processing logic repeats the conversion operations at processing block 1008 places.Otherwise processing logic advances to processing block 1012.
At processing block 1012 places, processing logic is revised the url data of predetermine class.These classifications for example can comprise that the processed logical transition that is included among the URL becomes the numerical character of corresponding ascii character to quote.In addition, URL " password " grammer can be used to " @ " front interpolation character in the URL host name.These characters are ignored by the target web service device, but they have added much noise information in each URL.Processing logic is revised url data by deleting these additional characters.At last, processing logic deletion URL end place's string "? " URL afterwards " inquiry " part.
The example of URL is as follows:
Http%3a%2f%2flotsofjunk@www.foo.com%2fbar.html? the muchmorejunk processing logic is revised as above url data Http:// www.foo.com/bar.html
Exemplary computer system
Figure 11 is the block diagram that can be used for carrying out the exemplary computer system 1100 of the one or more operations in the operation described here.In alternative embodiment, machine can comprise network router, the network switch, bridge, PDA(Personal Digital Assistant), cell phone, web instrument or any machine that can carry out the command sequence of specifying the action that this machine will take.
Computer system 1100 comprises processor 1102, main storage 1104 and static memory 1106, and they communicate with one another via bus 1108.Computer system 1100 can also comprise video display unit 1110 (for example LCD (LCD) or cathode ray tube (CRT)).Computer system 1100 comprises that also Alphanumeric Entry Device 1112 (for example keyboard), cursor control device 1114 (for example mouse), disk drive unit 1116, signal generate equipment 1120 (for example loud speaker) and Network Interface Unit 1122.
Disk drive unit 1116 comprises computer-readable medium 1124, has stored the one group of instruction (being software) 1126 that embodies any or all method in the said method on it.Software 1126 also is illustrated as fully or is positioned at main storage 1104 and/or processor 1102 to small part.Software 1126 can also send or receive via Network Interface Unit 1122.Purpose for this specification, term " computer-readable medium " is appreciated that and comprises and can store or any medium of coded command sequence, described instruction be used for being carried out by computer and its cause computer to carry out any method of method of the present invention.Therefore, term " computer-readable medium " should correspondingly be understood to include but be not limited to solid-state memory, CD and disk and carrier signal.
Though after having read foregoing description, those of ordinary skills will recognize multiple variation of the present invention and modification undoubtedly, but should be appreciated that any specific embodiment that illustrates and describe in illustrational mode is never wanted to be counted as restrictive here.Therefore, do not wish the scope of quoting restriction claims to the details of various embodiment, described claims itself have only been narrated and have been counted as those necessary for purposes of the invention features.

Claims (79)

1. method comprises:
Receive email message;
Based on the content of described email message, generate the data that characterize described email message;
The data of the described email message of described sign are compared with the data acquisition system that characterizes a plurality of spam messages; And
Determine whether the similarity between any data item in the data acquisition system of the data of the described email message of described sign and the described a plurality of spam messages of described sign surpasses threshold value.
2. the method for claim 1 also comprises:
If the similarity between any data item in the data acquisition system of the data of the described email message of described sign and the described a plurality of spam messages of described sign surpasses threshold value, then described email message is labeled as spam.
3. the method for claim 1 also comprises:
Receive the data that characterize new spam messages; And
With the storage that receives in database.
4. the method for claim 1 also comprises:
When receiving described email message, estimate being used to of adding in the described email message avoid Spam filtering noise have a situation; And
The content of revising described email message is to reduce described noise.
5. method as claimed in claim 4, wherein estimate noise in the described email message exist the step of situation to comprise to detect formatted data, numerical character are quoted, character entity is quoted with the predetermined URL data at least one.
6. the method for claim 1, the step that wherein generates the data that characterize described email message comprises:
Described email message is divided into a plurality of signs; And
Calculate a plurality of hashed values at described a plurality of signs.
7. method as claimed in claim 6, wherein the step that the data of the described email message of described sign are compared with the data acquisition system of the described a plurality of spam messages of described sign comprises:
Find out one or more data item in the data acquisition system of the described a plurality of spam messages of described sign, these data item have and the similar additional information of additional information that is included in the data of the described a plurality of spam messages of described sign; And
The subclass of the hashed value in the subclass of the hashed value in the data of the described email message of described sign and the data item that each is found out is compared, till finding out similar hashed value subclass.
8. method as claimed in claim 3 also comprises:
Estimate the situation that exists of noise in the described new spam messages;
The content of revising described new spam messages is to reduce described noise; And
Based on the content of modified new spam messages, generate the data that characterize described spam messages.
9. method comprises:
Receive spam messages;
Based on the content of described spam messages, generate the data that characterize described spam messages; And
The data of the described spam messages of described sign are sent to server, and the data of the described spam messages of described sign are used to find out the import message similar to described spam messages subsequently.
10. method as claimed in claim 9 also comprises:
When receiving described spam messages, estimate the situation that exists of noise in the described spam messages; And
The content of revising described spam messages is to reduce described noise.
11. method as claimed in claim 9, wherein estimate the step that has situation of noise in the described spam messages comprise detect that formatted data, numerical character are quoted, character entity is quoted with the predetermined URL data at least one.
12. method as claimed in claim 9, the step that wherein generates the data that characterize described spam messages comprises:
Described spam messages is divided into a plurality of signs; And
Calculate a plurality of hashed values at described a plurality of signs.
13. a system comprises:
Import the message parse device into, be used to receive email message;
The message data maker is used for generating the data that characterize described email message based on the content of described email message; And
The similarity identifier, be used for the data of the described email message of described sign are compared with the data acquisition system that characterizes a plurality of spam messages, and whether the similarity between any data item in the data acquisition system of the data of definite described email message of described sign and the described a plurality of spam messages of described sign surpasses threshold value.
14. system as claimed in claim 13 also comprises:
Database is used to store the data that characterize new spam messages.
15. system as claimed in claim 13 also comprises:
Message cleaning algorithm, be used for estimating described email message adds be used to avoid Spam filtering noise have a situation, and the content of revising described email message is to reduce described noise.
16. a system comprises:
Spam Context resolution device is used to receive spam messages;
The spam Data Generator is used for generating the data that characterize described spam messages based on the content of described spam messages; And
The spam data transmitter is used for the data of the described spam messages of described sign are sent to server, and the data of the described spam messages of described sign are used to find out the import message similar to described spam messages subsequently.
17. system as claimed in claim 16 also comprises noise reduction algorithms, it is used for estimating the situation that exists of described spam messages noise, and the content of revising described spam messages is to reduce described noise.
18. system as claimed in claim 17, wherein said noise reduction algorithms is used for quoting by detecting formatted data, numerical character, character entity is quoted and at least one of predetermined URL data, estimates the situation that exists of noise in the described spam messages.
19. a device comprises:
Be used to receive the device of email message;
Be used for content, generate the device of the data that characterize described email message based on described email message;
Be used for device that the data of the described email message of described sign are compared with the data acquisition system that characterizes a plurality of spam messages; And
Be used for determining whether the similarity between any data item in the data acquisition system of the data of the described email message of described sign and the described a plurality of spam messages of described sign surpasses the device of threshold value.
20. device as claimed in claim 19 also comprises:
Be used to receive the device of the data that characterize new spam messages; And
Be used to store the database of the data that receive.
21. a device comprises:
Be used to receive the device of spam messages;
Be used for content, generate the device of the data that characterize described spam messages based on described spam messages; And
Be used for the data of the described spam messages of described sign are sent to the device of server, the data of the described spam messages of described sign are used to find out the import message similar to described spam messages subsequently.
22. device as claimed in claim 21 also comprises:
Be used for estimating the device that has situation of described spam messages noise; And
Be used to revise the content of described spam messages to reduce the device of described noise.
23. comprising, device as claimed in claim 21, the wherein said device that has situation that is used for estimating the spam messages noise be used for detecting that formatted data, numerical character are quoted, character entity is quoted and at least one device of predetermined URL data.
24. a computer-readable medium that comprises executable instruction causes described treatment system to carry out the method that may further comprise the steps when described executable instruction is performed on treatment system:
Receive email message;
Based on the content of described email message, generate the data that characterize described email message;
The data of the described email message of described sign are compared with the data acquisition system that characterizes a plurality of spam messages; And
Determine whether the similarity between any data item in the data acquisition system of the data of the described email message of described sign and the described a plurality of spam messages of described sign surpasses threshold value.
25. computer-readable medium as claimed in claim 24, wherein said method also comprises:
Receive the data that characterize new spam messages; And
With the storage that receives in database.
26. computer-readable medium as claimed in claim 24, wherein said method also comprises:
When receiving described email message, estimate being used to of adding in the described email message avoid Spam filtering noise have a situation; And
The content of revising described email message is to reduce described noise.
27. a computer-readable medium that comprises executable instruction causes described treatment system to carry out the method that may further comprise the steps when described executable instruction is performed on treatment system:
Receive spam messages;
Based on the content of described spam messages, generate the data that characterize described spam messages; And
The data of the described spam messages of described sign are sent to server, and the data of the described spam messages of described sign are used to find out the import message similar to described spam messages subsequently.
28. computer-readable medium as claimed in claim 27, wherein said method also comprises:
When receiving described spam messages, estimate the situation that exists of noise in the described spam messages; And
The content of revising described spam messages is to reduce described noise.
29. computer-readable medium as claimed in claim 27, wherein estimate the step that has situation of noise in the described spam messages comprise detect that formatted data, numerical character are quoted, character entity is quoted with the predetermined URL data at least one.
30. a method comprises:
In email message, detect indication and add the data of noise that are used to avoid Spam filtering of this email message to;
The content of revising described email message is to reduce described noise; And
Content after the modification of described email message is compared with the content of spam messages.
31. method as claimed in claim 30, wherein said indication is added the data of the noise of described email message to and is selected from following group, and described group comprises: formatted data, one or more numerical character are quoted, one or more character entity is quoted and the url data of predetermine class.
32. method as claimed in claim 31, the step of wherein revising the content of described email message comprises:
Extracting from described email message does not have qualification to become the formatted data of exception.
33. method as claimed in claim 32, the processing that the wherein qualified formatted data that becomes the exception of formatted data is described email message is required.
34. method as claimed in claim 31, the step of wherein revising the content of described email message comprises:
Each numerical character quoted to quote with each character entity convert corresponding ascii character to;
Determine that quoting in quoting with character entity at least one through the numerical character of conversion comprises numerical character and quote in quoting with character entity any one; And
Any one that described numerical character is quoted in quoting with character entity converts corresponding ascii character to.
35. method as claimed in claim 31, the step of wherein revising the content of described email message comprises:
Each numerical character in the URL and each character entity quoted convert corresponding ascii character to.
36. method as claimed in claim 31, the step of wherein revising the content of described email message comprises:
The unique identifier data of deletion predetermine class from URL.
37. method as claimed in claim 31, the step of wherein revising the content of described email message comprises:
The inquiry data of deletion predetermine class from URL.
38. method as claimed in claim 30, wherein the step that content after the modification of described email message is compared with the content of described spam messages comprises:
Whether content is similar to the content of described spam messages after determining the modification of described email message.
39. method as claimed in claim 30 also comprises:
Content is with before the content of described spam messages is compared after with the modification of described email message, and the content of revising described spam messages is to reduce noise.
40. a system comprises:
Message cleaning algorithm is used for detecting the data of noise that are used to avoid Spam filtering that this email message is added in indication at email message, and the content of revising described email message is to reduce described noise; And
The similarity identifier is used for content after the modification of described email message is compared with the content of spam messages.
41. system as claimed in claim 40, wherein said indication is added the data of the noise of described email message to and is selected from following group, and described group comprises: formatted data, one or more numerical character are quoted, one or more character entity is quoted and the url data of predetermine class.
42. system as claimed in claim 41, wherein said message cleaning algorithm is by extracting the content that the formatted data that does not have qualification to become exception is revised described email message from described email message.
43. system as claimed in claim 42, the processing that the wherein qualified formatted data that becomes the exception of formatted data is described email message is required.
44. system as claimed in claim 41, wherein said message cleaning algorithm is revised the content of described email message in the following way: each numerical character is quoted to quote with each character entity convert corresponding ascii character to; Determine that quoting in quoting with character entity at least one through the numerical character of conversion comprises numerical character and quote in quoting with character entity any one; And with described numerical character is quoted in quoting with character entity any one converts corresponding ascii character to.
45. system as claimed in claim 41, wherein said message cleaning algorithm converts the content that corresponding ascii character is revised described email message to by each numerical character in the URL and each character entity are quoted.
46. system as claimed in claim 41, wherein said message cleaning algorithm is revised the content of described email message by the unique identifier data of deletion predetermine class from URL.
47. system as claimed in claim 41, wherein said message cleaning algorithm is revised the content of described email message by the inquiry data of deletion predetermine class from URL.
48. system as claimed in claim 40, wherein said similarity identifier by the modification of determining described email message after content whether content after the modification of described email message is compared with the content of described spam messages with the content of described spam messages is similar.
49. a device comprises:
Be used for detecting the device of data of noise that is used to avoid Spam filtering that this email message is added in indication at email message;
Be used to revise the content of described email message to reduce the device of described noise; And
Be used for device that content after the modification of described email message is compared with the content of spam messages.
50. device as claimed in claim 49, wherein said indication is added the data of the noise of described email message to and is selected from following group, and described group comprises: formatted data, one or more numerical character are quoted, one or more character entity is quoted and the url data of predetermine class.
51. a computer-readable medium that comprises executable instruction causes described treatment system to carry out the method that may further comprise the steps when described executable instruction is performed on treatment system:
In email message, detect indication and add the data of noise that are used to avoid Spam filtering of this email message to;
The content of revising described email message is to reduce described noise; And
Content after the modification of described email message is compared with the content of spam messages.
52. computer-readable medium as claimed in claim 51, wherein said indication is added the data of the noise of described email message to and is selected from following group, and described group comprises: formatted data, one or more numerical character are quoted, one or more character entity is quoted and the url data of predetermine class.
53. a method comprises:
First document is divided into a plurality of signs, and wherein each sign comprises the continuation character from the predetermined number of described first document;
Calculate a plurality of hashed values at described a plurality of signs; And
Create the signature of described first document, this signature comprises from the subclass of the hashed value of described a plurality of hashed values and the additional information relevant with a plurality of signs of described first document, the signature of described first document is compared with the signature of second document subsequently, to determine the similarity between described first document and described second document.
54. method as claimed in claim 53 also comprises:
Before described first document is divided into a plurality of signs, changes each the capitalization character in described first document into the lowercase character, and change each the non-alphabetic character in described first document into single predetermined non-alphabetic character.
55. method as claimed in claim 53, wherein said first document is first email message, and described second document is second email message.
56. method as claimed in claim 53, wherein the step of calculating described a plurality of hashed values at described a plurality of signs comprises:
By each sign in described a plurality of signs is combined with the occurrence number of this sign in described first document, come to create signboard for each sign in described a plurality of signs; And
Hash function is applied to the signboard that each has been created.
57. method as claimed in claim 56, wherein the signboard of creating at described a plurality of signs is represented as block diagram.
58. method as claimed in claim 53, wherein the predetermined number of the continuation character in each sign equals 3.
59. method as claimed in claim 53, the relevant additional information of wherein said and described a plurality of signs comprises the number that is included in a plurality of signs in described first document.
60. method as claimed in claim 53, the step of wherein creating the signature of described first document comprises:
Described a plurality of hashed values are sorted; And
From through a plurality of hashed values of ordering, selecting the minimum hashed value of predetermined number.
61. method as claimed in claim 59 also comprises:
The number of determining to be included in a plurality of signs in described first document with respect to the number that is included in a plurality of signs in described second document whether in allowed band; And
If the number that is included in a plurality of signs in described first document not in restricted portion, is then judged described first document and the described second document dissmilarity with respect to the number that is included in a plurality of signs in described second document.
62. method as claimed in claim 61 also comprises:
The number of determining to be included in a plurality of signs in described first document with respect to the number that is included in a plurality of signs in described second document in restricted portion;
The subclass of determining to be included in the hashed value in the signature of described first document whether with the signature that is included in described second document in the subclass of hashed value similar; And
If the subclass of the hashed value in the subclass that is included in the hashed value in the signature of described first document and the signature that is included in described second document is similar, judge that then described first document is similar to described second document.
63. method as claimed in claim 62, wherein said second email message is a spam messages.
64., also comprise as the described method of claim 63:
When judging that described first document is similar to described second document, described first email message is labeled as spam.
65. a system comprises:
Resolver is used for first document is divided into a plurality of signs, and wherein each sign comprises the continuation character from the predetermined number of described first document; And
The message data maker, be used for calculating a plurality of hashed values at described a plurality of signs, and create the signature of described first document, this signature comprises from the subclass of the hashed value of described a plurality of hashed values and the additional information relevant with a plurality of signs of described first document, the signature of described first document is compared with the signature of second document subsequently, to determine the similarity between described first document and described second document.
66. as the described system of claim 65, wherein said message data maker also is used for changing each the capitalization character in described first document into the lowercase character, and changes each the non-alphabetic character in described first document into single predetermined non-alphabetic character.
67. as the described system of claim 65, wherein said first document is first email message, described second document is described second email message.
68. as the described system of claim 65, wherein said message data maker is used for coming in the following way creating a plurality of hashed values at described a plurality of signs: combine with the occurrence number of this sign in described first document by each sign with described a plurality of signs, come to create signboard for each sign in described a plurality of signs; And hash function is applied to the signboard that each has been created.
69. as the described system of claim 65, wherein the predetermined number of the continuation character in each sign equals 3.
70. as the described system of claim 65, the relevant additional information of wherein said and described a plurality of signs comprises the number that is included in a plurality of signs in described first document.
71. as the described system of claim 65, wherein said message data maker is used for creating in the following way the signature of described first document: described a plurality of hashed values sorted, and from through a plurality of hashed values of ordering, selecting the minimum hashed value of predetermined number.
72. as the described system of claim 70, also comprise the similarity identifier, its number of a plurality of signs that is used for determining being included in described first document with respect to the number that is included in a plurality of signs in described second document whether in allowed band, if and the number that is included in a plurality of signs in described first document with respect to the number that is included in a plurality of signs in described second document not in restricted portion, then judge described first document and the described second document dissmilarity.
73. as the described system of claim 72, wherein said similarity identifier is used for: the number of a plurality of signs of determining to be included in described first document with respect to the number that is included in a plurality of signs in described second document in restricted portion; The subclass of determining to be included in the hashed value in the signature of described first document whether with the signature that is included in described second document in the subclass of hashed value similar; And if the subclass of the hashed value in the subclass that is included in the hashed value in the signature of described first document and the signature that is included in described second document is similar, judges that then described first document is similar to described second document.
74. a device comprises:
Be used for first document is divided into the device of a plurality of signs, wherein each sign comprises the continuation character from the predetermined number of described first document;
Be used for calculating the device of a plurality of hashed values at described a plurality of signs; And
Be used to create the device of the signature of described first document, this signature comprises from the subclass of the hashed value of described a plurality of hashed values and the additional information relevant with a plurality of signs of described first document, the signature of described first document is compared with the signature of second document subsequently, to determine the similarity between described first document and described second document.
75. as the described device of claim 74, wherein the predetermined number of the continuation character in each sign equals 3.
76. as the described device of claim 74, the relevant additional information of wherein said and described a plurality of signs comprises the number that is included in a plurality of signs in described first document.
77. a computer-readable medium that comprises executable instruction causes described treatment system to carry out the method that may further comprise the steps when described executable instruction is performed on treatment system:
First document is divided into a plurality of signs, and wherein each sign comprises the continuation character from the predetermined number of described first document;
Calculate a plurality of hashed values at described a plurality of signs; And
Create the signature of described first document, this signature comprises from the subclass of the hashed value of described a plurality of hashed values and the additional information relevant with a plurality of signs of described first document, the signature of described first document is compared with the signature of second document subsequently, to determine the similarity between described first document and described second document.
78. as the described computer-readable medium of claim 77, wherein the predetermined number of the continuation character in each sign equals 3.
79. as the described computer-readable medium of claim 77, the relevant additional information of wherein said and described a plurality of signs comprises the number that is included in a plurality of signs in described first document.
CN200480017663.9A 2004-05-14 2004-05-14 Method and device for filtrating rubbish E-mail based on similarity measurement Pending CN1922837A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2004/015383 WO2004105332A2 (en) 2003-05-15 2004-05-14 Method and apparatus for filtering email spam based on similarity measures

Publications (1)

Publication Number Publication Date
CN1922837A true CN1922837A (en) 2007-02-28

Family

ID=37779382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200480017663.9A Pending CN1922837A (en) 2004-05-14 2004-05-14 Method and device for filtrating rubbish E-mail based on similarity measurement

Country Status (1)

Country Link
CN (1) CN1922837A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101600178B (en) * 2009-06-26 2012-04-04 成都市华为赛门铁克科技有限公司 Method for confirming junk information as well as device and terminal therefor
CN102655480A (en) * 2011-03-03 2012-09-05 腾讯科技(深圳)有限公司 Similar mail handling system and method
CN104008105A (en) * 2013-02-25 2014-08-27 腾讯科技(北京)有限公司 Method and device for identifying rubbish text
CN104982011A (en) * 2013-03-08 2015-10-14 比特梵德知识产权管理有限公司 Document classification using multiscale text fingerprints
CN105323153A (en) * 2015-11-18 2016-02-10 Tcl集团股份有限公司 Spam mail filtering method and device
US9755616B2 (en) 2014-06-30 2017-09-05 Huawei Technologies Co., Ltd. Method and apparatus for data filtering, and method and apparatus for constructing data filter
CN108337153A (en) * 2018-01-19 2018-07-27 论客科技(广州)有限公司 A kind of monitoring method of mail, system and device
CN109635129A (en) * 2018-11-12 2019-04-16 西安万像电子科技有限公司 Data processing method, apparatus and system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101600178B (en) * 2009-06-26 2012-04-04 成都市华为赛门铁克科技有限公司 Method for confirming junk information as well as device and terminal therefor
CN102655480A (en) * 2011-03-03 2012-09-05 腾讯科技(深圳)有限公司 Similar mail handling system and method
WO2012116587A1 (en) * 2011-03-03 2012-09-07 腾讯科技(深圳)有限公司 Similar email processing system and method
CN102655480B (en) * 2011-03-03 2015-12-02 腾讯科技(深圳)有限公司 Similar mail treatment system and method
CN104008105A (en) * 2013-02-25 2014-08-27 腾讯科技(北京)有限公司 Method and device for identifying rubbish text
CN104982011A (en) * 2013-03-08 2015-10-14 比特梵德知识产权管理有限公司 Document classification using multiscale text fingerprints
CN104982011B (en) * 2013-03-08 2018-12-14 比特梵德知识产权管理有限公司 Use the document classification of multiple dimensioned text fingerprints
US9755616B2 (en) 2014-06-30 2017-09-05 Huawei Technologies Co., Ltd. Method and apparatus for data filtering, and method and apparatus for constructing data filter
CN105323153A (en) * 2015-11-18 2016-02-10 Tcl集团股份有限公司 Spam mail filtering method and device
CN108337153A (en) * 2018-01-19 2018-07-27 论客科技(广州)有限公司 A kind of monitoring method of mail, system and device
CN109635129A (en) * 2018-11-12 2019-04-16 西安万像电子科技有限公司 Data processing method, apparatus and system

Similar Documents

Publication Publication Date Title
US7831667B2 (en) Method and apparatus for filtering email spam using email noise reduction
US7739337B1 (en) Method and apparatus for grouping spam email messages
US20090070872A1 (en) System and method for filtering spam messages utilizing URL filtering module
US8526580B2 (en) System and method for voicemail organization
CN100352241C (en) Systems for customizing behaviors and interfaces in service invocations
CN1649423A (en) Electronic message forwarding
CN1300995C (en) Method and system for multiple-party, electronic mail receipts
CN1592229A (en) Electronic communications and web pages filtering based on URL
CN1702668A (en) System and method for social interaction
CN1946075A (en) Method and system to determine a user specific relevance score of a message within a messaging system
CN1609873A (en) Method, apparatus, and user interface for managing electronic mail and alert messages
CN1328668A (en) System and method for specifying www site
CN1913522A (en) RSS message interactive processing method based on XML file
CN1875361A (en) Method of predicting input
CN1894684A (en) Accessing different types of electronic messages through a common messaging interface
CN1467670A (en) Spam detector with challenges
CN101069175A (en) Dynamic message filtering
CN1918865A (en) Method, system and computer program product for generating and processing a disposable email address
CN1809821A (en) Feedback loop for spam prevention
CN1926532A (en) Data processing device capable of performing data transmission by a predetermined access method
CN101043348A (en) Method, system and equipment for realizing advertisement service
CN1933512A (en) Tollticket processing equipment and method
CN1922837A (en) Method and device for filtrating rubbish E-mail based on similarity measurement
CN1845616A (en) Short message service interface and channel adapting method for the same
CN101075989A (en) Method and system for verifying field validity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20070228