CN101819637A - Method for detecting image-based spam by utilizing image local invariant feature - Google Patents

Method for detecting image-based spam by utilizing image local invariant feature Download PDF

Info

Publication number
CN101819637A
CN101819637A CN 201010139946 CN201010139946A CN101819637A CN 101819637 A CN101819637 A CN 101819637A CN 201010139946 CN201010139946 CN 201010139946 CN 201010139946 A CN201010139946 A CN 201010139946A CN 101819637 A CN101819637 A CN 101819637A
Authority
CN
China
Prior art keywords
picture
image
spam
invariant feature
local invariant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010139946
Other languages
Chinese (zh)
Other versions
CN101819637B (en
Inventor
张卫丰
杨波
周国强
张迎周
陆柳敏
许碧娣
王慕妮
王宗辉
韩蕊
陆柳青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN2010101399468A priority Critical patent/CN101819637B/en
Publication of CN101819637A publication Critical patent/CN101819637A/en
Application granted granted Critical
Publication of CN101819637B publication Critical patent/CN101819637B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to a method for detecting image-based spam by utilizing the image local invariant feature, which comprises the steps of extracting invariant region feature of spam information in an image by utilizing the accelerated extraction algorithm with the robust feature, further generating a feature vector of the image, estimating parameters of a Gaussian mixture model by using the maximum likelihood algorithm and training a classifier of the Gaussian mixture model. Experiments show that the method can improve the recall rate of the spam and save the program computation time and the space. The classifier based on the Gaussian mixture model is obtained. The realizing method for detecting the image-based spam comprises three modules of the extraction of the image feature, the estimation of the parameters of the Gaussian mixture model and the detection of the image-based spam.

Description

Utilize the method for the local invariant feature detection image spam email of picture
Technical field
The present invention is a kind of local invariant feature of utilizing the rubbish picture, train gauss hybrid models, to the implementation that image spam email detects, mainly solved current technology to problems such as picture type spam detection efficient and recall rate are low, belong to data mining and machine learning field.
Background technology
Email has become people and has carried out the important channel that internet exchange is linked up, but because huge commerce, economy and political interest causes spam quantity sharply swollen.Originally Sheng Hang spam be with junk information such as advertisements with in the written form embedded images, text that people such as Hrishikesh excavate out in utilization and color characteristic come mail classify [1].Fumera etc. have proposed the text message of a kind of OCR (optical character identification) technology for detection image spam email in 2006, other filtering systems better detect effect [2] relatively.Simultaneously spammer are also constantly being strengthened the ability of the escape detection system of spam, and they have carried out Fuzzy Processing to the image that is embedded with junk information such as advertisement, and this movement makes that the OCR technology is lack scope for their abilities.Dredzeet al. has proposed to utilize the feature that exceeds of picture to come picture is classified, the picture file layout, size, color distribution etc. [3], advantage is can be faster than the speed of the low layers such as edge feature of picture, and have extendability preferably, can with the effective combination of image filtering device at stratum characteristic.
Fumera has proposed a kind of method by computed image girth complexity and has differentiated the processing [a 4] whether pictures has passed through fuzzy technology in 2007.The fuzzy degree of one pictures can be weighed by the girth complexity, and computing formula is the ratio of the quadratic sum character area area of character area girth.Can identify the appearance of word break character or noise object by literal girth complexity.Because can not confirm that the image through Fuzzy Processing is exactly the image that carries junk information, this treatment technology can only be as certain module of pre-service in the Spam filtering system.The image spam email filter method that people such as Wang propose is by the similarity method between the movement images [5]: use respectively in conjunction with three class image spam email filter method (color histogram filter methods, the Ha Er small echo filters and the direction histogram feature), result of experiment shows: when each filtering system is carried out separately, by finding that relatively best verification and measurement ratio has been obtained in the small echo filtration and its false drop rate (is normal picture with the rubbish picture identification) is lower than 0.0009%, the accurate rate that three class filtrators combine reaches 96%, this method is by realizing in conjunction with existing filtering system, we can say that it is a summary to the Spam filtering technology in a stage, uses this method to improve the performance of image spam email filtering system.Mehta etc. in 2008 at template and a large amount of spam that generates detects, utilization has duplication similarity, utilize the degree of accuracy of svm classifier to reach 98%, proposed to utilize GMM to come algorithm [6] simultaneously: every width of cloth picture is narrowed down to 100 * 100 pixels to the picture cluster, extract the texture shape and the color characteristic of each pixel, to every width of cloth picture training GMM, and the phase in the calculating picture is closely come cluster, calculate the rubbish picture by calculating threshold values, though this method calculates accurately but calculated amount is too big, and the time complexity of algorithm is higher, is unfavorable in the application of reality.Having proposed to use a class to use kernel function by Zuo etc. subsequently is that the svm classifier device of PMK comes the local invariant feature of Email image is sorted out [7].This method mainly be at those in order to escape filtrator based on the image template similarity, and change the spam of the total arrangement of image.Do not change some mark of picture.So this method has remedied the leak that similarity detects to a certain extent.
[1]Hrishikesh?Aradhye,Gregory?Myers,and?James?Herson.Image?analysis?forefficient?categorization?of?image-based?spam?e-mail.In?Proceedings?of?EighthInternational?Conference?on?Document?Analysis?and?Recognition,ICDAR?2005,volume?2,pages?914-918.IEEE?Computer?Society,2005.
[2]Giorgio?Fumera,Ignazio?Pillai,and?Fabio?Roli.Spam?filtering?based?on?theanalysis?of?text?information?embedded?into?images.Journal?of?Machine?LearningResearch,(7):2699-2720,2006.
[3]Mark?Dredze,Reuven?Gevaryahu,and?Ari?Elias-Bachrach.Learning?fast?classifiersfor?image?spam.In?Proceedings?of?the?Fourth?Conference?on?Email?and?Anti-Spam,CEAS’2007,2007.
[4]Giorgio?Fumera,Ignazio?Pillai,Fabio?Roli,and?Battista?Biggio.Image?spamfiltering?using?textual?and?visual?information,MIT?Spam?Conference2007,Cambridge,USA,March?2007
[5]Zhe?Wang,William?Josephson,Qin?Lv,Moses?Charikar,and?Kai?Li.Filteringimage?spam?with?near-duplicate?detection.In?Proceedings?of?the?FourthConference?on?Email?and?Anti-Spam,CEAS’2007,2007.
[6]Mehta,B.,Nangia,S.,Gupta,M.,and?Nejdl,W.Detecting?image?spam?usingvisual?features?and?near?duplicate?detection.In?Proceeding?of?the?17th?internationalConference?on?World?Wide?Web(Beijing,China,April?21-25,2008).WWW′08.ACM,New?York,NY,497-506.
[7]Haiqiang?Zuo,Weiming?Hu,Ou?Wu,Yunfei?Chen,Guan?Luo.Detecting?ImageSpam?Using?Local?Invariant?Features?and?Pyramid?Match?Kernel.Proceedings?ofthe?18th?international?conference?on?World?Wide?Web?Pages,2009,1187-1188.
Summary of the invention
Technical matters: the purpose of this invention is to provide a kind of method of utilizing the local invariant feature detection image spam email of picture, the local invariant zone that utilization exists in the rubbish picture, train based on the gauss hybrid models sorter, on this basis picture to be tested is classified to reach the purpose of detected image type spam.
Technical scheme: the method for the local invariant feature detection image spam email of utilizing picture that the present invention proposes, it is the invariant region feature that a kind of acceleration extraction algorithm that utilizes robust features extracts the junk information in the picture, thereby generate the proper vector of picture, parameter with maximal possibility estimation algorithm estimation gauss hybrid models obtains the sorter based on gauss hybrid models.To the implementation method that image spam email detects, entire method comprises the extraction of picture feature, the estimation of gauss hybrid models parameter, and three modules of the detection of image-type mail, the module of system is formed as shown in Figure 1.
Comprise two training stages in stage and test phase in the implementation method based on the gauss hybrid models sorter in picture local invariant zone, the step that is comprised is:
The step of training stage is:
One, at first trains according to sample set
The image data collection that step 1) is treated training carries out label, is divided into rubbish picture and normal picture;
Step 2) adopt " the acceleration extraction algorithm of robust features " to extract the local invariant feature descriptor of each rubbish picture and normal picture respectively, each local invariant feature descriptor is made of vector;
Step 3) utilizes " means clustering algorithm " that the local invariant feature descriptor of each rubbish picture and normal picture in the training set is carried out cluster, finally obtains several cluster centres; With this cluster centre is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vector of some dimensions;
Step 4) utilizes the maximum likelihood function algorithm for estimating to estimate the parameter of the gauss hybrid models that set of rubbish picture and normal picture are gathered respectively respectively normal picture in the training set and the pairing vector of the rubbish picture training sample as gauss hybrid models;
The parameter that rubbish picture set that step 5) is obtained by the maximum likelihood function algorithm for estimating and normal picture are gathered corresponding gauss hybrid models can be determined the distribution function of the polynary gauss hybrid models that normal picture set and rubbish picture are gathered like this.
Two, carry out testing process then:
Step 21), utilize the acceleration extraction algorithm of robust features to extract the local invariant feature descriptor of picture for picture to be detected;
Step 22) utilize the cluster centre in the step 3) to be reference point, to step 21) in the local invariant feature descriptor carry out standardization, obtain the vector of picture to be detected;
Step 23), calculates the distribution function value of the gauss hybrid models of the distribution function value of normal picture gauss hybrid models and rubbish picture respectively with the vectorial substitution distribution function of picture to be tested;
Step 24) according to step 23) the distribution function value that obtains classifies: which value then belongs to corresponding picture greatly.
Beneficial effect: the inventive method has proposed to utilize the acceleration extraction algorithm of robust features to extract the invariant region feature of the junk information in the picture, trains gauss hybrid models to come spam is detected.The method of the application of the invention can improve the precision and the recall rate of spam detection, saves sequential operation time and space.
Description of drawings
Fig. 1 is based on the sorter prototype of gauss hybrid models,
Fig. 2 is based on the sorter process flow diagram of gauss hybrid models.
Embodiment
Based on the local invariant feature detection image spam email of picture, employing VC++6.0 is a developing instrument, and wherein to the processing and utilizing opencv1.0 of the characteristics of image storehouse of increasing income, wherein detailed steps is as follows:
One, the training stage: obtain rubbish picture and normal picture, the composing training collection.
The picture that step 1) is treated the data set of training carries out label, makes that rubbish picture (Image spam) is I iNormal picture (image ham) J i, i={1 wherein, 2...N};
Step 2) adopt surf (acceleration of robust features is extracted) algorithm to extract I iAnd J iIn the local invariant feature descriptor of every pictures, wherein each descriptor of picture is described (L=64) with the L dimensional vector;
Step 3) utilizes " means clustering algorithm " that 64 dimension local invariant feature descriptors of each rubbish picture and normal picture in the training set are carried out cluster, finally obtains 200 cluster centres.With these 200 cluster centres is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vectors of 200 dimensions;
Step 4) has obtained rubbish picture feature vector storehouse F by step 3) Spam={ F Spam (1), F Spam (2)..., F Spam (N)And the proper vector storehouse F of normal picture Ham={ F Ham (1), F Ham (2)..., F Ham (N);
Step 5) is with feature database F SpamBe sample, utilize the EM algorithm to estimate the parameter θ of the gauss hybrid models of Image spam Spam=(π 1, π 2..., π Lμ 1, μ 2..., μ L1, ∑ 2..., ∑ L);
The EM algorithm, the E step: w ij = π j f ( p i | μ j , Σ j ) Σ k = 1 k = L π k f ( p i | μ k , Σ k )
The M step: π ^ j ← 1 n Σ i = 1 i = n w ij
μ ^ j ← Σ i = 1 i = n w ij p i Σ i = 1 i = n w ij
Σ ^ j ← Σ i = 1 i = n w ij ( p i - μ ^ j ) ( p i - μ ^ j ) T Σ i = 1 i = n w ij
Wherein, p iBe training sample, π is weights corresponding in the Gaussian distribution mixture model, and what μ distributed is average, and ∑ is a variance, and L is the number of mixture model;
Step 6) obtains the gauss of distribution function of the local invariant feature of image spam
Classifer ( θ spam ) = Σ k = 1 k = L π k f ( p | μ k , Σ k ) ;
Step 7) is with feature database F HamBe sample, utilize the EM algorithm to estimate the parameter of the gauss hybrid models of Image ham, principle is with step 5;
Step 8) obtains the distribution function of image ham local invariant feature
Classifer ( θ ham ) = Σ k = 1 k = K π k f ( p | μ k , Σ k ) .
Two, the step of detection-phase is:
Step 1) image data collection to be detected is T j, j={1,2...M}, wherein M is a picture number to be detected;
Step 2) utilize the surf algorithm to extract T jIn the local invariant feature descriptor of every pictures, principle is with step 2;
Step 3) is utilized the K-Mean algorithm, is that length is the proper vector of SIZE with the feature descriptor cluster of the local invariant that extracts, and principle is with step 3;
The proper vector of step 4) after by cluster obtains property data base F Test={ F Test (1), F Test (2)..., F Test (N);
Step 5) is with proper vector storehouse F TestBe sample value, calculate and Gaussian Mixture distribution distribution function Classifer (θ respectively Ham) and Classifer (θ Spam) apart from d j HamAnd d j Spam, d wherein j HamThe distance of expression picture j and ham Gaussian Mixture distribution function, d j SpamThe distance of expression picture j and spam Gaussian Mixture distribution function;
If step 6) is d j HamGreater than d j Spam, then the j pictures among the Tj belongs to the spam picture, otherwise belongs to the ham picture;
Step 7) repeating step 2)-and step 6), each picture in the set to be detected is detected.

Claims (1)

1. method of utilizing the local invariant feature detection image spam email of picture is characterized in that this method mainly is divided into following steps:
One, at first trains according to sample set
The image data collection that step 1) is treated training carries out label, is divided into rubbish picture and normal picture;
Step 2) adopt " the acceleration extraction algorithm of robust features " to extract the local invariant feature descriptor of each rubbish picture and normal picture respectively, each local invariant feature descriptor outgoing vector constitutes;
Step 3) utilizes " means clustering algorithm " that the local invariant feature descriptor of each rubbish picture and normal picture in the training set is carried out cluster, finally obtains several cluster centres; With this cluster centre is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vector of some dimensions;
Step 4) utilizes the maximum likelihood function algorithm for estimating to estimate the parameter of the gauss hybrid models that set of rubbish picture and normal picture are gathered respectively respectively normal picture in the training set and the pairing vector of the rubbish picture training sample as gauss hybrid models;
The parameter that rubbish picture set that step 5) is obtained by the maximum likelihood function algorithm for estimating and normal picture are gathered corresponding gauss hybrid models can be determined the distribution function of the polynary gauss hybrid models that normal picture set and rubbish picture are gathered like this.
Two, carry out testing process then:
Step 21), utilize the acceleration extraction algorithm of robust features to extract the local invariant feature descriptor of picture for picture to be detected;
Step 22) utilize the cluster centre in the step 3) to be reference point, to step 21) in the local invariant feature descriptor carry out standardization, obtain the vector of picture to be detected;
Step 23), calculates the distribution function value of the gauss hybrid models of the distribution function value of normal picture gauss hybrid models and rubbish picture respectively with the vectorial substitution distribution function of picture to be tested;
Step 24) according to step 23) the distribution function value that obtains classifies: which value then belongs to corresponding picture greatly.
CN2010101399468A 2010-04-02 2010-04-02 Method for detecting image-based spam by utilizing image local invariant feature Expired - Fee Related CN101819637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101399468A CN101819637B (en) 2010-04-02 2010-04-02 Method for detecting image-based spam by utilizing image local invariant feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101399468A CN101819637B (en) 2010-04-02 2010-04-02 Method for detecting image-based spam by utilizing image local invariant feature

Publications (2)

Publication Number Publication Date
CN101819637A true CN101819637A (en) 2010-09-01
CN101819637B CN101819637B (en) 2012-02-22

Family

ID=42654733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101399468A Expired - Fee Related CN101819637B (en) 2010-04-02 2010-04-02 Method for detecting image-based spam by utilizing image local invariant feature

Country Status (1)

Country Link
CN (1) CN101819637B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663435A (en) * 2012-04-28 2012-09-12 南京邮电大学 Junk image filtering method based on semi-supervision
CN102682007A (en) * 2011-03-11 2012-09-19 阿里巴巴集团控股有限公司 Method and device for creating image database
CN104036285A (en) * 2014-05-12 2014-09-10 新浪网技术(中国)有限公司 Spam image recognition method and system
CN106384124A (en) * 2016-09-05 2017-02-08 华东师范大学 Plastic package mail image address block location method
CN107832925A (en) * 2017-10-20 2018-03-23 阿里巴巴集团控股有限公司 Internet content risk evaluating method, device and server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020129038A1 (en) * 2000-12-18 2002-09-12 Cunningham Scott Woodroofe Gaussian mixture models in a data mining system
CN1787076A (en) * 2005-12-13 2006-06-14 浙江大学 Method for distinguishing speek person based on hybrid supporting vector machine
CN101140624A (en) * 2007-10-18 2008-03-12 清华大学 Image matching method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020129038A1 (en) * 2000-12-18 2002-09-12 Cunningham Scott Woodroofe Gaussian mixture models in a data mining system
CN1787076A (en) * 2005-12-13 2006-06-14 浙江大学 Method for distinguishing speek person based on hybrid supporting vector machine
CN101140624A (en) * 2007-10-18 2008-03-12 清华大学 Image matching method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682007A (en) * 2011-03-11 2012-09-19 阿里巴巴集团控股有限公司 Method and device for creating image database
CN102663435A (en) * 2012-04-28 2012-09-12 南京邮电大学 Junk image filtering method based on semi-supervision
CN104036285A (en) * 2014-05-12 2014-09-10 新浪网技术(中国)有限公司 Spam image recognition method and system
CN106384124A (en) * 2016-09-05 2017-02-08 华东师范大学 Plastic package mail image address block location method
CN107832925A (en) * 2017-10-20 2018-03-23 阿里巴巴集团控股有限公司 Internet content risk evaluating method, device and server

Also Published As

Publication number Publication date
CN101819637B (en) 2012-02-22

Similar Documents

Publication Publication Date Title
CN102129568B (en) Method for detecting image-based spam email by utilizing improved gauss hybrid model classifier
CN101887523B (en) Method for detecting image spam email by picture character and local invariant feature
Lu et al. Standard detectors aren't (currently) fooled by physical adversarial stop signs
CN105095856B (en) Face identification method is blocked based on mask
CN107506702A (en) Human face recognition model training and test system and method based on multi-angle
CN102184419B (en) Pornographic image recognizing method based on sensitive parts detection
CN102938054B (en) Method for recognizing compressed-domain sensitive images based on visual attention models
US20070058836A1 (en) Object classification in video data
CN104408475B (en) A kind of licence plate recognition method and car license recognition equipment
CN103778409A (en) Human face identification method based on human face characteristic data mining and device
CN105809178A (en) Population analyzing method based on human face attribute and device
CN110070090A (en) A kind of logistic label information detecting method and system based on handwriting identification
CN101819637B (en) Method for detecting image-based spam by utilizing image local invariant feature
CN102915453B (en) Real-time feedback and update vehicle detection method
CN102968637A (en) Complicated background image and character division method
CN101661559A (en) Digital image training and detecting methods
CN104268528A (en) Method and device for detecting crowd gathered region
CN110263712A (en) A kind of coarse-fine pedestrian detection method based on region candidate
CN110334602B (en) People flow statistical method based on convolutional neural network
Li et al. Fast and effective text detection
CN112364803A (en) Living body recognition auxiliary network and training method, terminal, equipment and storage medium
CN108960175A (en) A kind of licence plate recognition method based on deep learning
CN102103700A (en) Land mobile distance-based image spam similarity-detection method
Liu et al. Efficient modeling of spam images
Xu et al. Occlusion problem-oriented adversarial faster-RCNN scheme

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120222

Termination date: 20150402

EXPY Termination of patent right or utility model