CN101819637A

CN101819637A - Method for detecting image-based spam by utilizing image local invariant feature

Info

Publication number: CN101819637A
Application number: CN 201010139946
Authority: CN
Inventors: 张卫丰; 杨波; 周国强; 张迎周; 陆柳敏; 许碧娣; 王慕妮; 王宗辉; 韩蕊; 陆柳青
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2010-04-02
Filing date: 2010-04-02
Publication date: 2010-09-01
Anticipated expiration: 2030-04-02
Also published as: CN101819637B

Abstract

The invention relates to a method for detecting image-based spam by utilizing the image local invariant feature, which comprises the steps of extracting invariant region feature of spam information in an image by utilizing the accelerated extraction algorithm with the robust feature, further generating a feature vector of the image, estimating parameters of a Gaussian mixture model by using the maximum likelihood algorithm and training a classifier of the Gaussian mixture model. Experiments show that the method can improve the recall rate of the spam and save the program computation time and the space. The classifier based on the Gaussian mixture model is obtained. The realizing method for detecting the image-based spam comprises three modules of the extraction of the image feature, the estimation of the parameters of the Gaussian mixture model and the detection of the image-based spam.

Description

Utilize the method for the local invariant feature detection image spam email of picture

Technical field

The present invention is a kind of local invariant feature of utilizing the rubbish picture, train gauss hybrid models, to the implementation that image spam email detects, mainly solved current technology to problems such as picture type spam detection efficient and recall rate are low, belong to data mining and machine learning field.

Background technology

Email has become people and has carried out the important channel that internet exchange is linked up, but because huge commerce, economy and political interest causes spam quantity sharply swollen.Originally Sheng Hang spam be with junk information such as advertisements with in the written form embedded images, text that people such as Hrishikesh excavate out in utilization and color characteristic come mail classify [1].Fumera etc. have proposed the text message of a kind of OCR (optical character identification) technology for detection image spam email in 2006, other filtering systems better detect effect [2] relatively.Simultaneously spammer are also constantly being strengthened the ability of the escape detection system of spam, and they have carried out Fuzzy Processing to the image that is embedded with junk information such as advertisement, and this movement makes that the OCR technology is lack scope for their abilities.Dredzeet al. has proposed to utilize the feature that exceeds of picture to come picture is classified, the picture file layout, size, color distribution etc. [3], advantage is can be faster than the speed of the low layers such as edge feature of picture, and have extendability preferably, can with the effective combination of image filtering device at stratum characteristic.

Fumera has proposed a kind of method by computed image girth complexity and has differentiated the processing [a 4] whether pictures has passed through fuzzy technology in 2007.The fuzzy degree of one pictures can be weighed by the girth complexity, and computing formula is the ratio of the quadratic sum character area area of character area girth.Can identify the appearance of word break character or noise object by literal girth complexity.Because can not confirm that the image through Fuzzy Processing is exactly the image that carries junk information, this treatment technology can only be as certain module of pre-service in the Spam filtering system.The image spam email filter method that people such as Wang propose is by the similarity method between the movement images [5]: use respectively in conjunction with three class image spam email filter method (color histogram filter methods, the Ha Er small echo filters and the direction histogram feature), result of experiment shows: when each filtering system is carried out separately, by finding that relatively best verification and measurement ratio has been obtained in the small echo filtration and its false drop rate (is normal picture with the rubbish picture identification) is lower than 0.0009%, the accurate rate that three class filtrators combine reaches 96%, this method is by realizing in conjunction with existing filtering system, we can say that it is a summary to the Spam filtering technology in a stage, uses this method to improve the performance of image spam email filtering system.Mehta etc. in 2008 at template and a large amount of spam that generates detects, utilization has duplication similarity, utilize the degree of accuracy of svm classifier to reach 98%, proposed to utilize GMM to come algorithm [6] simultaneously: every width of cloth picture is narrowed down to 100 * 100 pixels to the picture cluster, extract the texture shape and the color characteristic of each pixel, to every width of cloth picture training GMM, and the phase in the calculating picture is closely come cluster, calculate the rubbish picture by calculating threshold values, though this method calculates accurately but calculated amount is too big, and the time complexity of algorithm is higher, is unfavorable in the application of reality.Having proposed to use a class to use kernel function by Zuo etc. subsequently is that the svm classifier device of PMK comes the local invariant feature of Email image is sorted out [7].This method mainly be at those in order to escape filtrator based on the image template similarity, and change the spam of the total arrangement of image.Do not change some mark of picture.So this method has remedied the leak that similarity detects to a certain extent.

[1]Hrishikesh?Aradhye，Gregory?Myers，and?James?Herson.Image?analysis?forefficient?categorization?of?image-based?spam?e-mail.In?Proceedings?of?EighthInternational?Conference?on?Document?Analysis?and?Recognition，ICDAR?2005，volume?2，pages?914-918.IEEE?Computer?Society，2005.

[2]Giorgio?Fumera，Ignazio?Pillai，and?Fabio?Roli.Spam?filtering?based?on?theanalysis?of?text?information?embedded?into?images.Journal?of?Machine?LearningResearch，(7)：2699-2720，2006.

[3]Mark?Dredze，Reuven?Gevaryahu，and?Ari?Elias-Bachrach.Learning?fast?classifiersfor?image?spam.In?Proceedings?of?the?Fourth?Conference?on?Email?and?Anti-Spam，CEAS’2007，2007.

[4]Giorgio?Fumera，Ignazio?Pillai，Fabio?Roli，and?Battista?Biggio.Image?spamfiltering?using?textual?and?visual?information，MIT?Spam?Conference2007，Cambridge，USA，March?2007

[5]Zhe?Wang，William?Josephson，Qin?Lv，Moses?Charikar，and?Kai?Li.Filteringimage?spam?with?near-duplicate?detection.In?Proceedings?of?the?FourthConference?on?Email?and?Anti-Spam，CEAS’2007，2007.

[6]Mehta，B.，Nangia，S.，Gupta，M.，and?Nejdl，W.Detecting?image?spam?usingvisual?features?and?near?duplicate?detection.In?Proceeding?of?the?17th?internationalConference?on?World?Wide?Web(Beijing，China，April?21-25，2008).WWW′08.ACM，New?York，NY，497-506.

[7]Haiqiang?Zuo，Weiming?Hu，Ou?Wu，Yunfei?Chen，Guan?Luo.Detecting?ImageSpam?Using?Local?Invariant?Features?and?Pyramid?Match?Kernel.Proceedings?ofthe?18th?international?conference?on?World?Wide?Web?Pages，2009，1187-1188.

Summary of the invention

Technical matters: the purpose of this invention is to provide a kind of method of utilizing the local invariant feature detection image spam email of picture, the local invariant zone that utilization exists in the rubbish picture, train based on the gauss hybrid models sorter, on this basis picture to be tested is classified to reach the purpose of detected image type spam.

Technical scheme: the method for the local invariant feature detection image spam email of utilizing picture that the present invention proposes, it is the invariant region feature that a kind of acceleration extraction algorithm that utilizes robust features extracts the junk information in the picture, thereby generate the proper vector of picture, parameter with maximal possibility estimation algorithm estimation gauss hybrid models obtains the sorter based on gauss hybrid models.To the implementation method that image spam email detects, entire method comprises the extraction of picture feature, the estimation of gauss hybrid models parameter, and three modules of the detection of image-type mail, the module of system is formed as shown in Figure 1.

Comprise two training stages in stage and test phase in the implementation method based on the gauss hybrid models sorter in picture local invariant zone, the step that is comprised is:

The step of training stage is:

One, at first trains according to sample set

The image data collection that step 1) is treated training carries out label, is divided into rubbish picture and normal picture;

Step 2) adopt " the acceleration extraction algorithm of robust features " to extract the local invariant feature descriptor of each rubbish picture and normal picture respectively, each local invariant feature descriptor is made of vector;

Step 3) utilizes " means clustering algorithm " that the local invariant feature descriptor of each rubbish picture and normal picture in the training set is carried out cluster, finally obtains several cluster centres; With this cluster centre is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vector of some dimensions;

Step 4) utilizes the maximum likelihood function algorithm for estimating to estimate the parameter of the gauss hybrid models that set of rubbish picture and normal picture are gathered respectively respectively normal picture in the training set and the pairing vector of the rubbish picture training sample as gauss hybrid models;

The parameter that rubbish picture set that step 5) is obtained by the maximum likelihood function algorithm for estimating and normal picture are gathered corresponding gauss hybrid models can be determined the distribution function of the polynary gauss hybrid models that normal picture set and rubbish picture are gathered like this.

Two, carry out testing process then:

Step 21), utilize the acceleration extraction algorithm of robust features to extract the local invariant feature descriptor of picture for picture to be detected;

Step 22) utilize the cluster centre in the step 3) to be reference point, to step 21) in the local invariant feature descriptor carry out standardization, obtain the vector of picture to be detected;

Step 23), calculates the distribution function value of the gauss hybrid models of the distribution function value of normal picture gauss hybrid models and rubbish picture respectively with the vectorial substitution distribution function of picture to be tested;

Step 24) according to step 23) the distribution function value that obtains classifies: which value then belongs to corresponding picture greatly.

Beneficial effect: the inventive method has proposed to utilize the acceleration extraction algorithm of robust features to extract the invariant region feature of the junk information in the picture, trains gauss hybrid models to come spam is detected.The method of the application of the invention can improve the precision and the recall rate of spam detection, saves sequential operation time and space.

Description of drawings

Fig. 1 is based on the sorter prototype of gauss hybrid models,

Fig. 2 is based on the sorter process flow diagram of gauss hybrid models.

Embodiment

Based on the local invariant feature detection image spam email of picture, employing VC++6.0 is a developing instrument, and wherein to the processing and utilizing opencv1.0 of the characteristics of image storehouse of increasing income, wherein detailed steps is as follows:

One, the training stage: obtain rubbish picture and normal picture, the composing training collection.

The picture that step 1) is treated the data set of training carries out label, makes that rubbish picture (Image spam) is I _iNormal picture (image ham) J _i, i={1 wherein, 2...N};

Step 2) adopt surf (acceleration of robust features is extracted) algorithm to extract I _iAnd J _iIn the local invariant feature descriptor of every pictures, wherein each descriptor of picture is described (L=64) with the L dimensional vector;

Step 3) utilizes " means clustering algorithm " that 64 dimension local invariant feature descriptors of each rubbish picture and normal picture in the training set are carried out cluster, finally obtains 200 cluster centres.With these 200 cluster centres is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vectors of 200 dimensions;

Step 4) has obtained rubbish picture feature vector storehouse F by step 3) _Spam={ F _{Spam (1)}, F _{Spam (2)}..., F _{Spam (N)}And the proper vector storehouse F of normal picture _Ham={ F _{Ham (1)}, F _{Ham (2)}..., F _{Ham (N)};

Step 5) is with feature database F _SpamBe sample, utilize the EM algorithm to estimate the parameter θ of the gauss hybrid models of Image spam _Spam=(π ₁, π ₂..., π _Lμ ₁, μ ₂..., μ _L∑ ₁, ∑ ₂..., ∑ _L);

The EM algorithm, the E step:

w_{ij} = \frac{π_{j} f (p_{i} | μ_{j}, Σ_{j})}{Σ_{k = 1}^{k = L} π_{k} f (p_{i} | μ_{k}, Σ_{k})}

The M step:

{\hat{π}}_{j} &LeftArrow; \frac{1}{n} Σ_{i = 1}^{i = n} w_{ij}

{\hat{μ}}_{j} &LeftArrow; \frac{Σ_{i = 1}^{i = n} w_{ij} p_{i}}{Σ_{i = 1}^{i = n} w_{ij}}

{\hat{Σ}}_{j} &LeftArrow; \frac{Σ_{i = 1}^{i = n} w_{ij} (p_{i} - {\hat{μ}}_{j}) {(p_{i} - {\hat{μ}}_{j})}^{T}}{Σ_{i = 1}^{i = n} w_{ij}}

Wherein, p _iBe training sample, π is weights corresponding in the Gaussian distribution mixture model, and what μ distributed is average, and ∑ is a variance, and L is the number of mixture model;

Step 6) obtains the gauss of distribution function of the local invariant feature of image spam

Classifer (θ_{spam}) = Σ_{k = 1}^{k = L} π_{k} f (p | μ_{k}, Σ_{k});

Step 7) is with feature database F _HamBe sample, utilize the EM algorithm to estimate the parameter of the gauss hybrid models of Image ham, principle is with step 5;

Step 8) obtains the distribution function of image ham local invariant feature

Classifer (θ_{ham}) = Σ_{k = 1}^{k = K} π_{k} f (p | μ_{k}, Σ_{k}) .

Two, the step of detection-phase is:

Step 1) image data collection to be detected is T _j, j={1,2...M}, wherein M is a picture number to be detected;

Step 2) utilize the surf algorithm to extract T _jIn the local invariant feature descriptor of every pictures, principle is with step 2;

Step 3) is utilized the K-Mean algorithm, is that length is the proper vector of SIZE with the feature descriptor cluster of the local invariant that extracts, and principle is with step 3;

The proper vector of step 4) after by cluster obtains property data base F _Test={ F _{Test (1)}, F _{Test (2)}..., F _{Test (N)};

Step 5) is with proper vector storehouse F _TestBe sample value, calculate and Gaussian Mixture distribution distribution function Classifer (θ respectively _Ham) and Classifer (θ _Spam) apart from d _j ^HamAnd d _j ^Spam, d wherein _j ^HamThe distance of expression picture j and ham Gaussian Mixture distribution function, d _j ^SpamThe distance of expression picture j and spam Gaussian Mixture distribution function;

If step 6) is d _j ^HamGreater than d _j ^Spam, then the j pictures among the Tj belongs to the spam picture, otherwise belongs to the ham picture;

Step 7) repeating step 2)-and step 6), each picture in the set to be detected is detected.

Claims

1. method of utilizing the local invariant feature detection image spam email of picture is characterized in that this method mainly is divided into following steps:

One, at first trains according to sample set

Step 2) adopt " the acceleration extraction algorithm of robust features " to extract the local invariant feature descriptor of each rubbish picture and normal picture respectively, each local invariant feature descriptor outgoing vector constitutes;

Two, carry out testing process then: