WO2011153894A1

WO2011153894A1 - Method and system for distinguishing image spam mail

Info

Publication number: WO2011153894A1
Application number: PCT/CN2011/074146
Authority: WO
Inventors: 林延中; 潘庆峰; 陈磊华
Original assignee: 盈世信息科技(北京)有限公司
Priority date: 2010-06-12
Filing date: 2011-05-17
Publication date: 2011-12-15
Also published as: CN101917352B; CN101917352A

Abstract

The present invention discloses a method and system for distinguishing image spam mail. The method includes steps: extracting image characteristics according to a compression ratio distribution of the image in the mail; according to the probability that the each characteristics appears in a spam image, calculating a probability that the mail is a spam mail by use of a probability and statistic formula; looking up in the preset weight table according to the probability that the image is a spam mail, retransmission times, and the reputation of the sender IP address, calculating the weight sum of said image; judging whether the image is a spam mail or not according to said weight sum. By use of the present invention, the image spam mail can be distinguished efficiently, and the images with distortion or background noise can be distinguished.

Description

Method and system for identifying image spam

The present invention relates to the field of communications technologies, and in particular, to a method and system for identifying picture spam. Background technique

With the rapid development of the network, it is very common to use e-mail (E-mail) for communication. Various computer files such as pictures, documents, audio and video can be transmitted to the recipient by E-mail, bringing life to people. Great convenience. At the same time, spam has also spread, which seriously threatens the stability and security of the user's mailbox.

Currently, there are two main methods for identifying image spam. First, the OCR (Optical Character Recognition) system is used to extract and extract text from the image, and the extracted words are segmented, and according to the sample library, the probability that the email corresponding to each word is spam is obtained. Finally, the probability that the email corresponding to each word is spam is substituted into the Bayesian formula to calculate the probability that the email is spam. If the probability of the message being spam is greater than the predetermined threshold, the message is marked as spam.

However, since OCR technology requires the image to be processed into pixels in advance, it is very inefficient, especially for processing high resolution images. Moreover, OCR technology can only extract the font information of the printed version. If the font in the picture is slightly deformed or the background contains noise, the recognition rate will drop rapidly or even not be recognized. Therefore, the existing garbage filtering method of extracting text from a picture using OCR technology is low in efficiency, and cannot handle a picture in which distortion or background contains noise information. Summary of the invention

Embodiments of the present invention provide a method and system for identifying picture spam, which is highly efficient in identifying picture spam, and capable of recognizing a picture that is distorted or contains background noise information.

An embodiment of the present invention provides a method for identifying image spam, including:

Extracting a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail;

Applying a probability statistical formula based on the probability that each feature value of the picture appears in the junk picture Calculating the probability of obtaining the picture as spam;

Applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent;

Obtaining a reputation value of the outgoing IP according to the sentiment IP query reputation database of the picture; and querying a preset weight according to the probability that the picture is spam, the number of times of repeated sending, and the reputation value of the sending IP a list of values, calculating a weight sum of the pictures, and determining whether the picture is spam based on the weights.

The reputation value database stores the reputation value of the outgoing IP, and the reputation value is corresponding to the sending IP. The embodiment of the present invention further provides a mail system, including

a picture feature extraction module, configured to extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail;

a spam probability acquisition module, configured to calculate a probability of obtaining the image as spam according to a probability that each feature value of the picture appears in the garbage picture;

a picture sending times obtaining module, applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent; a value obtaining module, configured to query a reputation value database according to the sending IP of the mail, to obtain a reputation value of the sending IP;

a spam determination module, configured to query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and according to the weight And determining whether the picture is spam.

The mail system further includes:

a sample database for storing all feature values of the garbage image sample and the normal image sample, and the probability that each feature value appears in the garbage picture;

a reputation value database for storing the reputation value of the outgoing IP; the reputation value is the proportion of normal mail sent by the outgoing IP in all of its sent mails;

The reputation value update module is configured to recalculate the reputation value of the sent IP of the picture after the spam determination module determines that the picture is spam, and update the corresponding reputation value in the reputation value database.

Embodiments of the present invention have the following beneficial effects: The method and system for identifying image spam according to embodiments of the present invention, extracting feature values of pictures in a mail based on a compression ratio distribution characteristic of a picture, and calculating a probability of obtaining the picture as spam by using a probability statistical formula; The weight value of the picture is the weight of the spam, the number of times of repeated transmission, and the reputation value of the outgoing IP. The weight of the picture is calculated, and based on the weight, it is determined whether the picture is spam. The present invention recognizes picture spam based on the compression ratio distribution of pictures, is highly efficient, and is capable of recognizing pictures that are distorted or whose background contains noise information. In addition, the present invention applies a hash algorithm to determine the similarity of the picture, and counts the number of times the similar picture is repeatedly transmitted, and according to this feature, it can be well judged whether the behavior of the sender is similar to the sending behavior of the spam, thereby improving The accuracy of identifying image spam. DRAWINGS

1 is a schematic flowchart of a first embodiment of a method for identifying a picture spam provided by the present invention; FIG. 2 is a schematic diagram of a support vector machine algorithm provided by the present invention;

3 is a schematic flowchart of a second embodiment of a method for identifying a picture spam provided by the present invention; FIG. 4 is a schematic flowchart of a third embodiment of a method for identifying a picture spam provided by the present invention; A schematic diagram of the structure of the mail system provided by the example;

6 is a schematic structural diagram of a picture feature extraction module according to an embodiment of the present invention;

7 is a schematic structural diagram of a spam probability acquisition module according to an embodiment of the present invention; FIG. 8 is a schematic structural diagram of a picture sending times obtaining module according to an embodiment of the present invention; FIG. 9 is a schematic diagram of a spam determining module according to an embodiment of the present invention; Schematic. detailed description

BRIEF DESCRIPTION OF THE DRAWINGS The technical solutions in the embodiments of the present invention will be described in detail with reference to the accompanying drawings. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative work are within the scope of the present invention.

The method and system for identifying image spam according to embodiments of the present invention collect pre-image and spam image samples in advance, extract image features based on image compression rate distribution characteristics, and obtain feature sets of normal pictures and spam pictures; The Yesi classifier learns these feature sets and calculates the most representative feature is the junk picture or the probability result set of the normal picture. details as follows: First, collect samples of normal pictures and spam pictures:

Use the image capture software to randomly capture images from the Internet as JPG or GIF and add them to the normal email sample library.

The reporting system is deployed in the mail system to collect spam containing images submitted by the user. After manual verification to confirm that the image is spam, the image is added to the spam sample database.

Second, extract all the features contained in normal images and spam pictures:

In the embodiment of the present invention, the picture feature is extracted based on the compression rate distribution characteristic of the picture. The following is a detailed description of the method for extracting the picture feature by taking the picture in the JPG format, the GIF format, and the PNG format as an example.

(1) Calculating the compression ratio of the JPG format image;

The compression method of the JPG format picture is to divide a sub-block every 8*8 pixels of the picture, and perform independent compression on each sub-block, and then save the compressed block information to the file. Therefore, when analyzing the image features of the JPG format, it is only necessary to obtain the size of each sub-block after the image is compressed, and then divide the sub-block size by (8*8), and the compression of the sub-block can be obtained after the rounding. Rate, no need to decompress sub-blocks.

Scan the entire JPG file to obtain a compression ratio sequence Cl, C2, C3, C4..., where CI represents the compression ratio of the 8*8 pixel sub-block in the upper left corner of the image, and C2 is a continuous adjacent sub-block. The compression ratio of the block, C3, C4 analogy.

(2) Calculating the compression ratio of the GIF format picture;

The compression method of GIF format pictures is the famous LZW compression algorithm. The main idea of the LZW algorithm is to maintain a coding table with 256 elements. If a pixel sequence in a picture has appeared in the code table, the subscript of the code table is used instead of the pixel sequence to achieve compression. purpose.

When analyzing the picture features of the GIF format, it is only necessary to read the above code table subscript (the length of the code table subscript is fixed to one byte), and calculate the pixel value corresponding to the subscript by querying the corresponding code table, thereby calculating The compression ratio of this small picture: 1 / (the pixel corresponding to the code table).

Scan the entire GIF file to obtain a compression ratio sequence Cl, C2, C3, C4..., where CI represents the compression ratio of a fixed length of pixels in the upper left corner of the picture, C2, C3, C4.

(3) calculating the compression ratio of the PNG format picture;

The PNG format picture uses the LZ77 compression algorithm, which is similar to the LZW compression algorithm of the GIF picture. The only difference is that the LZ77 algorithm does not have a fixed coding table, but uses the relative position and length of the sequence that has been encountered before to represent the pixel sequence. . For example: When compressing the pixel sequence abcdeabcde, before scanning to abcde, since there is no sequence with repeated repetitions of b, c, d or e, Therefore, the abcde is not compressed, that is, the input sequence abcde is equal to the compressed sequence. However, when scanning to abcdea, since sequence a has appeared before, and then continue to compare abcde, it is found that abcde has appeared before, so the second occurrence of abcde sequence, with an offset and length That's it. That is, the LZ77 algorithm used by PNG pictures does not have a fixed coding table, and its code table is implicit in the sequence that has appeared before the current position. It should be noted that the LZ77 compression algorithm is a well-known technology in the art. The above description is only a simple principle. In fact, the information such as the offset and length of the PNG picture is saved in bits, so as to save space.

Therefore, when analyzing the compression ratio of a PNG picture, it can be derived from the compressed PNG data stream: For a data sequence that has not been compressed, the compression ratio of these sequences is 1; for a compressed data sequence, these sequences are used. (offset, length) to represent the information corresponding to the sequence, which can be found at a specific location of the output sequence that has been previously decompressed. Assume that the save (offset, length) information requires N bytes, and the "length" (offset, length), the value of the attribute is M, then the compression ratio is N/M (ie, N bytes) Save M bytes of information).

By analyzing the compressed PNG data stream, a compression ratio sequence Cl, C2, C3, C4... can be obtained, where CI represents the compression ratio of a fixed length pixel sequence in the upper left corner of the picture, C2, C3, C4.

The embodiment of the invention does not need to decompress the picture, and saves a large amount of computing resources and memory resources.

(4) calculating the feature value of the picture;

After obtaining the picture compression rate sequence of JPG, GIF or PNG format by the above (1), (2), (3) embodiments, each 4 consecutive compression ratios are merged into a new compression rate change element D (where 4 is an empirical value and is a result of the experiment, and the present invention is not limited to 4). D represents the change of the compression ratio of the four adjacent sub-blocks of the picture, for example, for the compression rate sequence Cl, C2, C3, C4, C5, C6, C7, C8, after conversion, it becomes a sequence of D1, D2, where D1 = C1C2C3C4, D2 = C5C6C7C8.

After obtaining the compression ratio change element sequence of the picture, each compression rate change element is added with the relative position information of the element to form a feature value.

For example, the picture is divided into six areas, each of which corresponds to a fixed position code, as follows: Top left area: The position code is 1;

Upper area: position code is 2;

Upper right corner area: Position code is 3;

Lower left corner area: position code is 4; Lower area: position code is 5;

Lower right corner area: Position code is 6;

If the pixel block is located in the upper left corner of the picture and the compression rate change element is D1, the feature value F1 containing the position information is 1D1; if the pixel block is located in the upper right corner of the picture and the compression rate change element is D2, the position is included The feature value F2 of the information is 3D2. And so on, combining the compression rate change element and the position code of the pixel block corresponding to the element on the picture (position coding + compression rate change element D), and obtaining the feature sequence of the picture: Fl, F2, F3, F4.. . . .

It should be noted that the foregoing only takes the pictures in the JPG, GIF, and PNG formats as an example to illustrate the method for extracting picture features based on the compression rate characteristics of the pictures. The embodiment of the present invention can also be applied to other pictures having similar compression rate characteristics. in.

Third, the establishment of a sample database:

(1) establishing a feature set of normal pictures and spam pictures;

After all the feature values included in the normal picture and the garbage picture are calculated by the method in the above step 2, all the feature values of the normal picture are saved in the normal picture feature set HAM, and all the feature values of the junk picture are saved in the junk picture feature set. In SPAM.

In addition, the normal picture feature set HAM also records the number of times each feature value appears in all normal picture samples. For example, the number of occurrences of the feature value F1 in all normal picture samples is 10000, and the number of occurrences of the feature value F2 in all normal picture samples is 20000, and so on.

Similarly, the garbage image feature set SPAM also records the number of times each feature value appears in all junk image samples. For example, the number of occurrences of the feature value F1 in all junk picture samples is 30000, the number of times the feature value F2 appears in all junk picture samples is 40000, and so on.

For a particular feature value F _n , it may appear in the spam picture sample or in the normal mail picture sample, and the number of occurrences is generally not equal.

(2) Calculating the probability of each feature value appearing in the garbage picture, and forming a sample database; reading the feature value F from the normal picture feature set HAM and the junk picture feature set SPAM in the normal picture sample and the spam picture sample respectively The number of occurrences in the Bayesian classifier is used to calculate the probability Q of the feature value F appearing in the spam picture. For example, the probability that the feature value F1 appears in the spam picture is Q1, the probability that the feature value F2 appears in the spam picture is Q2, and the probability that the feature value F3 appears in the spam picture is Q3. Save the correspondence between F and Q, that is, save as F1:Q1, F2:Q2, F3:Q3..., and build the sample database. The sample database established by the embodiment of the present invention stores all the feature values of the garbage picture sample and the normal picture sample, and the probability that each feature value appears in the garbage picture.

Optionally, in the embodiment of the present invention, the sequence of "F1:Q1, F2:Q2, F3:Q3..." may be sorted according to the magnitude of the Q value, and only those whose Q value is greater than 80% are extracted. Sequence F:Q (indicating that these sequences are highly likely to appear in spam samples) and those sequences with Q values less than 20% F:Q (indicating that these sequences are highly probable in normal mail samples), as the final Bayeux The evaluation criteria for the assessment are saved to the sample database. Experience has shown that the sequence F:Q with a Q value between (20%, 80%) is similar to the number of occurrences of the characteristic sequence F in normal pictures and spam pictures, and F is not too much for judging whether the picture is spam or not. More help, and this kind of neutral F:Q sequence accounts for about 80% of the total number of F:Q sequences, so eliminating these neutral data will help speed up the evaluation of the efficiency of the image.

The method and system for identifying picture spam provided by the embodiment of the present invention are described in detail below with reference to FIG. 1 to FIG. The probability and statistics formula of the embodiment of the present invention includes a Bayes formula and/or a support vector machine (SVM) formula. The Bayesian formula is used for calculation. The probability that the obtained image is spam is called "first probability"; the probability of obtaining the image as spam is called "second probability" by applying the vector machine formula.

Referring to Fig. 1, a flow chart of a first embodiment of a method for identifying picture spam provided by the present invention is shown.

In the first embodiment, the Bayesian formula is applied to calculate the probability that the picture is spam. The method includes the following steps:

5101. Extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail. In a specific implementation, after receiving the email, the method includes: scanning a picture included in the email to obtain a compression ratio of each sub-block of the image; combining compression ratios of each N consecutive sub-blocks into one The new compression rate change element combines each compression rate change element with the position code in the picture in which it is located to obtain the feature value of the picture. Where N is a natural number greater than 1. Preferably, the value of N is 4.

It should be noted that the embodiment of the present invention can process pictures in JPG, GIF, PNG or other formats. The method for extracting pictures in JPG, GIF or PNG format based on the compression ratio distribution characteristic of the picture is the same as the above embodiment, and is not mentioned here.

5102, applying probabilities according to the probability that each characteristic value of the picture appears in the garbage picture The formula calculates the probability of obtaining the picture as spam.

The probabilistic statistical formula is a Bayesian formula. The classification principle of the Bayesian classifier is to calculate the posterior probability by using the Bayesian formula, that is, the probability that the object belongs to a certain class. Select the class with the largest a posteriori probability as the class to which the object belongs.

The mathematical basis of the Bayes classifier is the Bayesian formula, as follows:

If Bl, B2, ... is a series of mutually incompatible events, if P(Bi) is used, the probability of event Bi occurring, and

Qs, = Ω , P(Bi)>0, i=l,2,... then for any event eight, there is

P(B ₁ \ A ₎ = zero (^) , i=l,2, ..

Y _{k =} p ( ^B ^ ^A W After the processing of the above step S101, after obtaining all the feature values of the picture, in step S102, the sample database is queried according to each feature value of the picture, and each feature value of the picture is obtained. The probability of occurrence in the garbage picture; the probability that each feature value of the picture appears in the garbage picture is substituted into the Bayesian formula described above to calculate the first probability. The first probability is that the picture is The probability of spam.

For example, after receiving a picture mail that is unknown to be spam, apply the method of step S101 above to obtain all the feature values of the picture: Fl, F2, F3.... Then query the sample database to get the probability that each eigenvalue appears in the junk image: F1 : Q1 , F2 : Q2 , F3 : Q3 . Apply the Bayesian formula, enter the above-mentioned "Fl, F2, F3..." eigenvalue sequence and the "F1:Q1, F2:Q2, F3:Q3..." probability statistics to calculate the unknown The probability that a picture message is spam.

S103: Apply a hash algorithm to calculate a hash value of the picture, compare the hash value with a hash value of the received mail picture, and obtain the number of times the picture is repeatedly sent.

The Nilsimsa algorithm is a well-known hash algorithm. Its characteristics are: If the input information changes only a small amount, the hash value of its output will only change little or not. Since the length of the output sequence is fixed regardless of the length of the input sequence, the input sequence can be calculated by the Nilsimsa algorithm, and the similarity of the input sequence is determined by comparing the similarity of the output sequences, which greatly speeds up the process. The clustering speed of similar information.

Specifically, step S103 includes: applying a Nilsimsa algorithm to process the feature values of the image, Obtaining a hash value of the picture; comparing a hash value of the picture with a hash value of the received mail picture to obtain a similarity between the picture and the received mail picture; according to the picture and The similarity of the received mail pictures, the number of times the pictures are repeatedly sent. Examples are as follows:

It is assumed that all the feature values F1, F2, F3... of the picture are obtained in the above step S101, then in step S103, the above-mentioned feature values "Fl, F2, F3..." are processed, and the input sequence is "Fl". , F2, F3..." , the output sequence is a fixed-length binary sequence "01, 02, 03...". The length of the output sequence is generally 64 bytes, and the value of 0 is 0 or 1. The binary sequence "01, 02, 03..." is the hash of the picture. Then, the hash value of the picture is compared with the hash value of the previously received mail picture, and the number of times the similar picture is repeatedly transmitted is determined according to the similarity between the pictures.

The Nilsimsa algorithm has the following advantages: If the input sequence "Fl, F2, F3..." is only slightly modified (such as inserting multiple small sequences into it, modifying the contents of a small sequence, etc.), the output binary Sequence stability is high, with little or no change. Therefore, by comparing the similarities between the two output sequences, the similarity between the two input sequences can be known, thereby determining the number of times the similar pictures are repeatedly transmitted.

5104. Query a reputation value database according to the sending IP of the picture, and obtain a reputation value of the sending IP.

The embodiment of the invention configures a reputation value database for storing the reputation value of the outgoing IP. The method of the reputation value is as follows: The signaling behavior of the outgoing IP in the past period of time is recorded, and the proportion of the normal mail sent by the outgoing IP is used as the reputation value of the IP. For example, if a sending IP sends 100 emails in the past, and 10 of them are judged as spam, the signaling IP is obtained by mathematical calculation method ((100-10)/100-90". The reputation value is 90.

Therefore, in step S104, the reputation value of the originating IP of the picture mail is obtained by querying the reputation value database according to the sending IP of the picture mail.

5105. Query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and determine the picture according to the weight and Whether it is spam.

In the embodiment of the present invention, three weight value lists are pre-configured, and the probability that the picture is spam, the number of times of repeated transmission, and the weight value corresponding to the reputation value of the sending IP are respectively recorded.

(1) According to the embodiment of the present invention, according to the probability that the picture belongs to the spam, the "picture is The probability of spam is defined as 10 segments, and the weight value of each segment is configured. The weight list of "probability of images as spam" is as follows:

(2) According to the embodiment of the present invention, the "number of repeated transmissions of pictures" is defined as 6 segments according to the range in which the number of repeated transmissions of picture mails is located, and the weight value of each segment is configured. The weights for "Number of image resends" are as follows:

(3) In the embodiment of the present invention, the "send IP reputation value" is defined as 10 segments according to the range of the reputation value of the outgoing IP, and the weight value of each segment is configured. The weight list of "Send IP Reputation Value" is as follows: Weights

Send IP reputation value Reputation value range

(real number)

REPUTATION — 0—10 [0, 10] REPUTATION-0-10-W

REPUTATION_10_20 [10, 20] REPUTATION_10-20_W

REPUTATION-20-30 [20, 30] REPUTATION-20-30-W

REPUTATION-30-40 [30, 40] REPUTATION-30-40-W

REPUTATION-40-50 [40, 50] REPUTATION-40-50-W

REPUTATION-50-60 [50, 60] REPUTATION-50-60-W

REPUTATION-60-70 [60, 70] REPUTATION-60-70-W

REPUTATION-70-80 [70, 80] REPUTATION-70-80-W

REPUTATION_80_90 [80, 90] REPUTATION_80-90_W

REPUTATION-90-100 [90, 100] REPUTATION— 90_100_W Preferably, the weight values of the above three lists are obtained by learning a known sample using a genetic algorithm.

It should be noted that, in the embodiment of the present invention, the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP are segmented, so as to reduce the calculation amount of subsequent processing, the number of segments defined (ie, "image" The probability of being spam is defined as 10 segments, the number of "repeated transmissions of pictures" is defined as 6 segments, and the "facilitated IP reputation value" is defined as 10 segments) only empirical figures, and the present invention is not limited thereto.

Specifically, after the processes of steps S102, S103, and S104 are performed to obtain the probability that the picture is spam, the number of times of repeated transmission of the picture, and the reputation value of the transmission IP, in step S105, the following processing is performed: The probability of the mail, the number of times of repeated transmission, and the reputation value of the outgoing IP query the preset weight value list, respectively obtain the weight values of the three; and then add the weight values of the three to obtain the weight of the picture and And determining whether the weight of the picture is greater than a predetermined threshold, and if yes, determining that the picture is spam; if not, determining that the picture is a normal mail. For example, it is assumed that, for an email containing a picture, after the processing of steps S101 to S104 described above, it is found that the probability that the picture in the mail is i-paste is 95%, and the number of times of repeated transmission is 2, and the IP is sent. The reputation value is 78, the BAYES_90 in the query weight list (assuming a weight value of 0.5), REPUTATION_0_10 (assuming a weight value of 0.1), and REPUTATION_70_80 (assuming a weight value of 0.3), and the weight of the mail picture is calculated to be 0.5. +0.1+0.3=0.9, the weight is less than 1.0 (1.0 is the threshold), then the message is judged to be a normal mail.

Further, the method for identifying image spam provided by the embodiment of the present invention further includes: After the picture in the mail is spam, the reputation value of the outgoing IP of the picture is recalculated, and the corresponding reputation value in the reputation value database is updated.

In addition, the embodiment of the present invention can also use the SVM (Support Vector Machine) algorithm to calculate the probability that the picture is a junk picture. The SVM algorithm can be explained intuitively through Figure 2, as follows:

Define a function f(x,y) = al*x + a2*y +b; where x is an intrinsic feature of the message, y is another intrinsic feature of the message that is not related to X, al, a2, b are constants, Al, a2 control Figure 2 can be used to segment the slope of the planes of the two types of points. If the cross point in Figure 2 indicates spam and the dot indicates normal mail, then whether the mail is spam is only related to x and y. As long as f(x) is greater than a certain value, the mail is considered to be spam.

In practice, classifying a sample usually requires extracting hundreds to thousands of features to have a better effect. For such a multi-dimensional model, this embodiment cannot be expressed in a three-dimensional map. However, it can be inferred that the final SVM formula is a polynomial: f(x,y,z,...) - al*x + a2*y + a3*z + ..... + b; as long as the unknown sample will be The values of the features such as x, y, z, etc. are substituted into the SVM formula, and the sample is judged to be spam based on whether the result is greater than zero.

One of the keys of the SVM model is to learn the parameters al, a2, a3..., b, etc. of the above formula through unknown samples. In the specific implementation, as long as enough samples are provided (normal mail and spam are about one thousand each), the above parameters can be obtained through specific mathematical methods, thereby obtaining the SVM formula. It should be noted that there are many mature mathematical methods in the prior art for obtaining the above parameters. For example, a method for finding edge key points can be used, and details are not described herein.

Another key to the SVM model is whether the extracted "features" can better describe the problem, that is, whether the "characteristic values" represented by the above parameters such as x, y, z can better distinguish the two types of samples. The solution of the embodiment of the present invention is to use the probability that each picture feature item appears in the spam as an input feature of the SVM. In the learning process, after the probability of occurrence of each feature value in the spam is counted, an eigenvalue probability sequence is constructed according to the order in which the feature values appear, and the above SVM formula is obtained through the learning program (ie, obtaining the above al, A2, a3...b parameters). For example: there is a picture, according to the order of decomposition from the picture file, there are 4 (there may be many) feature values Tl, Τ2, Τ3, Τ4, and statistics show that the rate of occurrence in spam is For Gl, G2, G3, and G4 respectively, Gl, G2, G3, and G4 are used as vector input SVM learning programs. By learning a batch of normal mail and spam, the SVM formula suitable for learning samples can be obtained. When evaluating whether the unknown sample is spam, the #values Gl, G2, G3, G4, ί巴 Gl, G2, G3 of the feature values T1, Τ2, Τ3, Τ4 are also arranged in the order decomposed from the picture file. G4 is substituted into the SVM formula to calculate the probability that the sequence is spam.

Bayes algorithm is compared with SVM algorithm. In short, when learning normal and spam samples, Bayes method generates the probability that each feature is spam, and SVM method generates each feature. The probability of spam and the parameters of the SVM formula. When judging an unknown sample, the Bayes method inputs an unknown sample feature item, and obtains the probability that the feature item is spam by looking up the table, and then calculates the probability that the mail is spam by the Bayes formula; the SVM method also inputs the characteristics of the unknown sample. Item, by looking up the table to know the probability that the feature item is spam, and then calculating the probability that the mail is spam by the SVM formula generated by the learning process.

Referring to Figure 3, there is shown a flow diagram of a second embodiment of a method of identifying picture spam provided by the present invention. In the second embodiment, the support vector machine (SVM) formula is applied to calculate the probability that the picture is spam. The method includes the following steps:

5201: Extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail. The step S201 is the same as the step S101 of the first embodiment, and details are not described herein again.

5202. The application supports a vector machine formula to calculate a probability of obtaining the image as spam according to a probability that each feature value of the picture appears in the garbage picture;

Step S202 specifically includes: querying a sample database according to the feature value of the picture, obtaining a probability that each feature value of the picture appears in the garbage picture; and constructing a probability that each feature value of the picture appears in the garbage picture The feature vector is substituted into the support vector machine formula to obtain a second probability; the second probability is the probability that the picture is spam.

The sample database stores all feature values of the garbage image sample and the normal image sample, and the probability that each feature value appears in the garbage picture.

S203: Apply a hash algorithm to calculate a hash value of the picture, compare the hash value with a hash value of the received mail picture, and obtain the number of times the picture is repeatedly sent.

5204. Query a reputation value database according to the sent IP of the picture, and obtain a reputation value of the sent IP.

5205. Query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and determine the picture according to the weight and Is it illegal? The steps S203 to S205 are completely the same as the steps S103 to S105 of the first embodiment, and are not described herein again.

Referring to Figure 4, there is shown a flow diagram of a third embodiment of a method of identifying picture spam provided by the present invention. In the third embodiment, the Bayes formula and the SVM formula are simultaneously applied to calculate the probability that the picture is spam. The method includes the following steps:

5301. Extract a feature value of the picture according to a compression rate distribution characteristic of the picture in the mail. The step S301 is the same as the step S101 of the first embodiment, and details are not described herein again.

S302: Query a sample database according to the feature value of the picture, and obtain a probability that each feature value of the picture appears in the garbage picture;

5303. Substituting a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;

The step S303 is the same as the step S102 of the first embodiment, and details are not described herein again.

5304, constructing a probability that each feature value of the picture appears in the junk picture as a feature vector, and substituting into a support vector machine formula for calculation, to obtain a second probability;

The probability that the picture is spam includes the first probability and the second probability.

5305. Apply a hash algorithm to calculate a hash value of the picture, compare the hash value with a hash value of the received mail picture, and obtain the number of times the picture is repeatedly sent.

The step S305 is the same as the step S103 of the first embodiment, and details are not described herein again.

5306. Query a reputation value database according to the sent IP of the picture, and obtain a reputation value of the sent IP.

This step S306 is identical to the step S104 of the first embodiment described above, and details are not described herein again.

5307. Query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and determine the picture according to the weight and Whether it is spam.

The step S307 is substantially the same as the step S105 of the first embodiment described above, except that the probability that the picture is spam includes a first probability and a second probability, and respectively corresponds to a weight value list. Therefore, when querying the preset weight value list, the weight value corresponding to the "first probability", the weight value corresponding to the "second probability", the weight value corresponding to the "number of times of repeated transmission", and the "send" are respectively obtained. The reputation value of the letter IP "corresponding weight value, a total of four weight values. The four weight values are added to obtain the weight of the picture, and according to the weight, it is determined whether the picture is spam.

The method for identifying image spam provided by the embodiment of the present invention extracts the feature value of the image in the email based on the compression ratio distribution characteristic of the image, and calculates the probability that the image is spam by applying the probability statistical formula; The weight of the picture is calculated by the weight of the spam, the number of times of repeated transmission, and the reputation value of the outgoing IP. Based on the weight, it is determined whether the picture is spam. The present invention recognizes picture spam based on the compression ratio distribution of the picture, is highly efficient, and is capable of recognizing a picture that is distorted or whose background contains noise information. In addition, the present invention applies a hash algorithm to determine the similarity of the picture, and counts the number of times the similar picture is repeatedly transmitted, and according to this feature, it can be well judged whether the behavior of the sender is similar to the sending behavior of the spam, thereby improving The accuracy of identifying image spam.

Correspondingly, the embodiment of the present invention further provides a mail system, which can implement all the steps of the method for identifying picture spam in the above embodiment.

FIG. 5 is a schematic structural diagram of a mail system according to an embodiment of the present invention. The mail system includes: a picture feature extraction module 1 configured to extract feature values of the picture according to a compression rate distribution characteristic of a picture in the mail;

The spam probability acquisition module 2 is configured to calculate a probability that the image is spam according to a probability that each feature value of the picture appears in the garbage picture, and the probability statistical formula is used;

a picture sending times obtaining module 3, applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent; a reputation value obtaining module 4, configured to query a reputation value database according to the sending IP of the mail, and obtain a reputation value of the sending IP;

The spam determination module 5 is configured to query a preset weight value list according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP, calculate a weight sum of the picture, and according to the Weight and determine if the picture is spam.

As shown in FIG. 6, the picture feature extraction module 1 specifically includes:

The image scanning unit 11 is configured to scan a picture in the mail to obtain a compression ratio of each sub-block of the picture;

The picture feature generating unit 12 is configured to combine the compression ratios of each N consecutive sub-blocks into a new compression rate change element, and perform each of the compression rate change elements and the position code in the picture in which the picture is located. Combining, obtaining feature values of the picture; wherein N is a natural number greater than 1.

As shown in FIG. 7, the spam probability acquisition module 2 specifically includes:

The probability query unit 21 is configured to query the sample database according to the feature value of the picture, and obtain a probability that each feature value of the picture appears in the garbage picture;

a Bayesian calculation unit 22, configured to calculate a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;

The support vector machine calculation unit 23 is configured to construct a probability vector for each feature value of the picture to appear in the garbage picture, and perform calculation into the support vector machine formula to obtain a second probability; the picture is garbage The probability of the mail is the first probability and/or the second probability.

As shown in FIG. 8, the picture sending times obtaining module 3 specifically includes:

The hash value calculation unit 31 is configured to process the feature value of the picture by using a hash algorithm to obtain a hash value of the picture;

The similarity determining unit 32 is configured to compare the hash value of the picture with the hash value of the received mail picture to obtain a similarity between the picture and the received mail picture;

The repeated transmission number determining unit 32 is configured to obtain the number of times the picture is repeatedly transmitted according to the similarity between the picture and the received mail picture.

As shown in FIG. 9, the spam determination module 5 specifically includes:

The weight query unit 51 is configured to query the preset weight value list according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP, and obtain the weight values of the three;

The mail identifying unit 52 is configured to add the weight values of the three to obtain the weight sum of the picture; determine whether the weight of the picture is greater than a predetermined threshold, and if yes, determine that the picture is spam If no, it is determined that the picture is a normal mail.

Further, as shown in FIG. 5, the mail system further includes:

The sample database 6 is used to save all the feature values of the garbage picture sample and the normal picture sample, and the probability that each feature value appears in the garbage picture;

The reputation value database 7 is used to store the reputation value of the outgoing IP; the reputation value is the proportion of the normal mail sent by the outgoing IP in all of its sent mails;

The reputation value update module 8 is configured to: after the spam determination module determines that the picture is spam, recalculate the reputation value of the sent IP of the picture, and update the corresponding reputation value in the reputation value database. It should be noted that, in the mail system provided by the embodiment of the present invention, the process of identifying the image spam is the same as that in the foregoing embodiment, and details are not described herein again.

The mail system provided by the embodiment of the present invention extracts the feature value of the picture in the email based on the compression ratio distribution characteristic of the image, and calculates the probability that the picture is spam by using the probability statistical formula; and then according to the probability that the picture is spam The weight value of the number of times of repeated transmission and the reputation value of the outgoing IP calculates the weight of the picture, and determines whether the picture is spam based on the weight. The present invention recognizes picture spam based on the compression ratio distribution of pictures, is highly efficient, and is capable of recognizing a picture that is distorted or whose background contains noise information. In addition, the present invention applies a hash algorithm to determine the similarity of the picture, and counts the number of times the similar picture is repeatedly transmitted, and according to this feature, it can be determined whether the behavior of the sender is similar to the sending behavior of the spam, thereby Improve the accuracy of identifying image spam.

A person skilled in the art can understand that all or part of the process of implementing the above embodiment method can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. In execution, the flow of an embodiment of the methods as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

The above is a preferred embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. These improvements and retouchings are also considered. It is the scope of protection of the present invention.

Claims

Rights request

A method for identifying image spam, comprising:

According to the probability that each feature value of the picture appears in the garbage picture, the probability of obtaining the picture as a garbage message is calculated by applying a probability statistical formula;

2. The method for identifying a picture spam according to claim 1, wherein the extracting the feature value of the picture according to a compression rate distribution characteristic of the picture in the message comprises:

Scanning the picture in the mail to obtain the compression ratio of each sub-block of the picture; combining the compression ratio of each N consecutive sub-blocks into a new compression rate change element, where N is a natural number greater than 1. ;

Each of the compression rate change elements is combined with the position code in the picture in which it is located to obtain the feature values of the picture.

3. The method for identifying picture spam according to claim 2, wherein the probability statistical formula is a Bayesian formula;

Then, according to the probability that each feature value of the picture appears in the garbage picture, the probabilistic statistical formula is used to calculate the probability of obtaining the picture as spam, which specifically includes:

Querying the sample database according to the feature value of the picture, obtaining a probability that each feature value of the picture appears in the garbage picture; wherein, the sample database stores all feature values of the garbage picture sample and the normal picture sample, And the probability that each eigenvalue appears in the junk image;

Substituting a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;

The probability that the picture is a spam message is the first probability.

The method for identifying a picture spam according to claim 2, wherein the probability statistical formula is a support vector machine formula;

Querying the sample database according to the feature value of the picture, obtaining a probability that each feature value of the picture appears in the garbage picture; wherein, the sample database stores all feature values of the garbage picture sample and the normal picture sample, and The probability that each eigenvalue will appear in the junk image;

Constructing a probability that each feature value of the picture appears in the junk picture is a feature vector, and performing calculation in the support vector machine formula to obtain a second probability;

The probability that the picture is spam is the second probability.

The method for identifying picture spam according to claim 2, wherein the probability statistical formula comprises a Bayesian formula and a support vector machine formula;

Substituting the probability of occurrence of each feature value of the picture in the junk picture into a Bayesian formula to obtain a first probability;

The method for identifying picture spam according to any one of claims 3 to 5, wherein the application hash algorithm calculates a hash value of the picture, and the hash value and the received The hash value of the mail picture is compared, and the number of times the picture is repeatedly sent is obtained, which specifically includes:

Applying a hash algorithm to process the feature value of the picture to obtain a hash value of the picture; comparing the hash value of the picture with the hash value of the received mail picture to obtain the picture The similarity between the slice and the received mail picture;

Based on the similarity between the picture and the received mail picture, the number of times the picture is repeatedly transmitted is obtained.

The method for identifying a picture spam according to claim 6, wherein the querying the preset weight value according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sending IP a list, calculating a weight sum of the picture, and determining whether the picture is spam according to the weight, and specifically:

According to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sent IP, the preset weight value list is searched, and the weight values of the three are respectively obtained;

Adding the weight values of the three to obtain the weight of the picture;

Determining whether the weight of the picture is greater than a predetermined threshold, and if so, determining that the picture is spam; if not, determining that the picture is a normal mail.

8. The method for identifying picture spam according to claim 7, wherein the reputation value database stores a reputation value of the outgoing IP, wherein the reputation value is a normal mail sent by the outgoing IP address. The proportion of sent messages;

After determining that the picture is spam, the method further includes:

Recalculate the reputation value of the outgoing IP of the picture and update the corresponding reputation value in the reputation value database.

9. A mail system, comprising:

a picture sending times obtaining module, applying a hash algorithm to calculate a hash value of the picture, comparing the hash value with a hash value of the received mail picture, and obtaining the number of times the picture is repeatedly sent; a value obtaining module, configured to query a reputation value database according to the sending IP of the mail, to obtain a reputation value of the sending IP; a spam determination module, configured to query a preset weight value list according to a probability that the picture is spam, a number of times of being repeatedly sent, and a reputation value of the sending IP, calculate a weight sum of the picture, and according to the weight And determining whether the picture is spam.

The mail system according to claim 9, wherein the picture feature extraction module specifically includes:

a picture scanning unit, configured to scan a picture in the mail to obtain a compression ratio of each sub-block of the picture;

a picture feature generating unit, configured to combine compression ratios of each N consecutive sub-blocks into a new compression rate change element, and combine each compression rate change element with a position code in a picture in which the picture is located, to obtain the The feature value of the picture; where N is a natural number greater than one.

The mail system according to claim 10, wherein the spam probability acquisition module comprises:

a probability query unit, configured to query a sample database according to the feature value of the picture, and obtain a probability that each feature value of the picture appears in the garbage picture;

a Bayesian calculation unit, configured to calculate a probability that each feature value of the picture appears in the garbage picture into a Bayesian formula to obtain a first probability;

a support vector machine calculation unit, configured to construct a probability vector for each feature value of the picture in the garbage picture, and perform calculation on the support vector machine formula to obtain a second probability; the picture is spam The probability of the first probability and/or the second probability.

12. The mail system according to claim 11, wherein the image transmission number acquisition module specifically includes:

a hash value calculation unit, applying a hash algorithm to process the feature value of the picture to obtain a hash value of the picture;

a similarity determining unit, configured to compare a hash value of the picture with a hash value of the received mail picture to obtain a similarity between the picture and the received mail picture;

The repeated transmission number determining unit is configured to obtain, according to the similarity between the picture and the received mail picture, the number of times the picture is repeatedly sent.

The mailing system according to claim 12, wherein the spam determination module comprises:

The weight query unit is configured to query the preset weight value list according to the probability that the picture is spam, the number of times of repeated transmission, and the reputation value of the sent IP, and obtain the weight values of the three;

a mail identifying unit, configured to add the weight values of the three to obtain a weight sum of the picture; determine whether the weight of the picture is greater than a predetermined threshold, and if yes, determine that the picture is spam; If not, it is determined that the picture is a normal mail.

The mail system according to claim 13, wherein the mail system further comprises: a sample database, configured to save all feature values of the garbage picture sample and the normal picture sample, and each feature value is in the garbage picture. Probability of occurrence;