CN101819637A - Method for detecting image-based spam by utilizing image local invariant feature - Google Patents
Method for detecting image-based spam by utilizing image local invariant feature Download PDFInfo
- Publication number
- CN101819637A CN101819637A CN 201010139946 CN201010139946A CN101819637A CN 101819637 A CN101819637 A CN 101819637A CN 201010139946 CN201010139946 CN 201010139946 CN 201010139946 A CN201010139946 A CN 201010139946A CN 101819637 A CN101819637 A CN 101819637A
- Authority
- CN
- China
- Prior art keywords
- picture
- image
- spam
- invariant feature
- local invariant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention relates to a method for detecting image-based spam by utilizing the image local invariant feature, which comprises the steps of extracting invariant region feature of spam information in an image by utilizing the accelerated extraction algorithm with the robust feature, further generating a feature vector of the image, estimating parameters of a Gaussian mixture model by using the maximum likelihood algorithm and training a classifier of the Gaussian mixture model. Experiments show that the method can improve the recall rate of the spam and save the program computation time and the space. The classifier based on the Gaussian mixture model is obtained. The realizing method for detecting the image-based spam comprises three modules of the extraction of the image feature, the estimation of the parameters of the Gaussian mixture model and the detection of the image-based spam.
Description
Technical field
The present invention is a kind of local invariant feature of utilizing the rubbish picture, train gauss hybrid models, to the implementation that image spam email detects, mainly solved current technology to problems such as picture type spam detection efficient and recall rate are low, belong to data mining and machine learning field.
Background technology
Email has become people and has carried out the important channel that internet exchange is linked up, but because huge commerce, economy and political interest causes spam quantity sharply swollen.Originally Sheng Hang spam be with junk information such as advertisements with in the written form embedded images, text that people such as Hrishikesh excavate out in utilization and color characteristic come mail classify [1].Fumera etc. have proposed the text message of a kind of OCR (optical character identification) technology for detection image spam email in 2006, other filtering systems better detect effect [2] relatively.Simultaneously spammer are also constantly being strengthened the ability of the escape detection system of spam, and they have carried out Fuzzy Processing to the image that is embedded with junk information such as advertisement, and this movement makes that the OCR technology is lack scope for their abilities.Dredzeet al. has proposed to utilize the feature that exceeds of picture to come picture is classified, the picture file layout, size, color distribution etc. [3], advantage is can be faster than the speed of the low layers such as edge feature of picture, and have extendability preferably, can with the effective combination of image filtering device at stratum characteristic.
Fumera has proposed a kind of method by computed image girth complexity and has differentiated the processing [a 4] whether pictures has passed through fuzzy technology in 2007.The fuzzy degree of one pictures can be weighed by the girth complexity, and computing formula is the ratio of the quadratic sum character area area of character area girth.Can identify the appearance of word break character or noise object by literal girth complexity.Because can not confirm that the image through Fuzzy Processing is exactly the image that carries junk information, this treatment technology can only be as certain module of pre-service in the Spam filtering system.The image spam email filter method that people such as Wang propose is by the similarity method between the movement images [5]: use respectively in conjunction with three class image spam email filter method (color histogram filter methods, the Ha Er small echo filters and the direction histogram feature), result of experiment shows: when each filtering system is carried out separately, by finding that relatively best verification and measurement ratio has been obtained in the small echo filtration and its false drop rate (is normal picture with the rubbish picture identification) is lower than 0.0009%, the accurate rate that three class filtrators combine reaches 96%, this method is by realizing in conjunction with existing filtering system, we can say that it is a summary to the Spam filtering technology in a stage, uses this method to improve the performance of image spam email filtering system.Mehta etc. in 2008 at template and a large amount of spam that generates detects, utilization has duplication similarity, utilize the degree of accuracy of svm classifier to reach 98%, proposed to utilize GMM to come algorithm [6] simultaneously: every width of cloth picture is narrowed down to 100 * 100 pixels to the picture cluster, extract the texture shape and the color characteristic of each pixel, to every width of cloth picture training GMM, and the phase in the calculating picture is closely come cluster, calculate the rubbish picture by calculating threshold values, though this method calculates accurately but calculated amount is too big, and the time complexity of algorithm is higher, is unfavorable in the application of reality.Having proposed to use a class to use kernel function by Zuo etc. subsequently is that the svm classifier device of PMK comes the local invariant feature of Email image is sorted out [7].This method mainly be at those in order to escape filtrator based on the image template similarity, and change the spam of the total arrangement of image.Do not change some mark of picture.So this method has remedied the leak that similarity detects to a certain extent.
[1]Hrishikesh?Aradhye,Gregory?Myers,and?James?Herson.Image?analysis?forefficient?categorization?of?image-based?spam?e-mail.In?Proceedings?of?EighthInternational?Conference?on?Document?Analysis?and?Recognition,ICDAR?2005,volume?2,pages?914-918.IEEE?Computer?Society,2005.
[2]Giorgio?Fumera,Ignazio?Pillai,and?Fabio?Roli.Spam?filtering?based?on?theanalysis?of?text?information?embedded?into?images.Journal?of?Machine?LearningResearch,(7):2699-2720,2006.
[3]Mark?Dredze,Reuven?Gevaryahu,and?Ari?Elias-Bachrach.Learning?fast?classifiersfor?image?spam.In?Proceedings?of?the?Fourth?Conference?on?Email?and?Anti-Spam,CEAS’2007,2007.
[4]Giorgio?Fumera,Ignazio?Pillai,Fabio?Roli,and?Battista?Biggio.Image?spamfiltering?using?textual?and?visual?information,MIT?Spam?Conference2007,Cambridge,USA,March?2007
[5]Zhe?Wang,William?Josephson,Qin?Lv,Moses?Charikar,and?Kai?Li.Filteringimage?spam?with?near-duplicate?detection.In?Proceedings?of?the?FourthConference?on?Email?and?Anti-Spam,CEAS’2007,2007.
[6]Mehta,B.,Nangia,S.,Gupta,M.,and?Nejdl,W.Detecting?image?spam?usingvisual?features?and?near?duplicate?detection.In?Proceeding?of?the?17th?internationalConference?on?World?Wide?Web(Beijing,China,April?21-25,2008).WWW′08.ACM,New?York,NY,497-506.
[7]Haiqiang?Zuo,Weiming?Hu,Ou?Wu,Yunfei?Chen,Guan?Luo.Detecting?ImageSpam?Using?Local?Invariant?Features?and?Pyramid?Match?Kernel.Proceedings?ofthe?18th?international?conference?on?World?Wide?Web?Pages,2009,1187-1188.
Summary of the invention
Technical matters: the purpose of this invention is to provide a kind of method of utilizing the local invariant feature detection image spam email of picture, the local invariant zone that utilization exists in the rubbish picture, train based on the gauss hybrid models sorter, on this basis picture to be tested is classified to reach the purpose of detected image type spam.
Technical scheme: the method for the local invariant feature detection image spam email of utilizing picture that the present invention proposes, it is the invariant region feature that a kind of acceleration extraction algorithm that utilizes robust features extracts the junk information in the picture, thereby generate the proper vector of picture, parameter with maximal possibility estimation algorithm estimation gauss hybrid models obtains the sorter based on gauss hybrid models.To the implementation method that image spam email detects, entire method comprises the extraction of picture feature, the estimation of gauss hybrid models parameter, and three modules of the detection of image-type mail, the module of system is formed as shown in Figure 1.
Comprise two training stages in stage and test phase in the implementation method based on the gauss hybrid models sorter in picture local invariant zone, the step that is comprised is:
The step of training stage is:
One, at first trains according to sample set
The image data collection that step 1) is treated training carries out label, is divided into rubbish picture and normal picture;
Step 2) adopt " the acceleration extraction algorithm of robust features " to extract the local invariant feature descriptor of each rubbish picture and normal picture respectively, each local invariant feature descriptor is made of vector;
Step 3) utilizes " means clustering algorithm " that the local invariant feature descriptor of each rubbish picture and normal picture in the training set is carried out cluster, finally obtains several cluster centres; With this cluster centre is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vector of some dimensions;
Step 4) utilizes the maximum likelihood function algorithm for estimating to estimate the parameter of the gauss hybrid models that set of rubbish picture and normal picture are gathered respectively respectively normal picture in the training set and the pairing vector of the rubbish picture training sample as gauss hybrid models;
The parameter that rubbish picture set that step 5) is obtained by the maximum likelihood function algorithm for estimating and normal picture are gathered corresponding gauss hybrid models can be determined the distribution function of the polynary gauss hybrid models that normal picture set and rubbish picture are gathered like this.
Two, carry out testing process then:
Step 21), utilize the acceleration extraction algorithm of robust features to extract the local invariant feature descriptor of picture for picture to be detected;
Step 22) utilize the cluster centre in the step 3) to be reference point, to step 21) in the local invariant feature descriptor carry out standardization, obtain the vector of picture to be detected;
Step 23), calculates the distribution function value of the gauss hybrid models of the distribution function value of normal picture gauss hybrid models and rubbish picture respectively with the vectorial substitution distribution function of picture to be tested;
Step 24) according to step 23) the distribution function value that obtains classifies: which value then belongs to corresponding picture greatly.
Beneficial effect: the inventive method has proposed to utilize the acceleration extraction algorithm of robust features to extract the invariant region feature of the junk information in the picture, trains gauss hybrid models to come spam is detected.The method of the application of the invention can improve the precision and the recall rate of spam detection, saves sequential operation time and space.
Description of drawings
Fig. 1 is based on the sorter prototype of gauss hybrid models,
Fig. 2 is based on the sorter process flow diagram of gauss hybrid models.
Embodiment
Based on the local invariant feature detection image spam email of picture, employing VC++6.0 is a developing instrument, and wherein to the processing and utilizing opencv1.0 of the characteristics of image storehouse of increasing income, wherein detailed steps is as follows:
One, the training stage: obtain rubbish picture and normal picture, the composing training collection.
The picture that step 1) is treated the data set of training carries out label, makes that rubbish picture (Image spam) is I
iNormal picture (image ham) J
i, i={1 wherein, 2...N};
Step 2) adopt surf (acceleration of robust features is extracted) algorithm to extract I
iAnd J
iIn the local invariant feature descriptor of every pictures, wherein each descriptor of picture is described (L=64) with the L dimensional vector;
Step 3) utilizes " means clustering algorithm " that 64 dimension local invariant feature descriptors of each rubbish picture and normal picture in the training set are carried out cluster, finally obtains 200 cluster centres.With these 200 cluster centres is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vectors of 200 dimensions;
Step 4) has obtained rubbish picture feature vector storehouse F by step 3)
Spam={ F
Spam (1), F
Spam (2)..., F
Spam (N)And the proper vector storehouse F of normal picture
Ham={ F
Ham (1), F
Ham (2)..., F
Ham (N);
Step 5) is with feature database F
SpamBe sample, utilize the EM algorithm to estimate the parameter θ of the gauss hybrid models of Image spam
Spam=(π
1, π
2..., π
Lμ
1, μ
2..., μ
L∑
1, ∑
2..., ∑
L);
The EM algorithm, the E step:
The M step:
Wherein, p
iBe training sample, π is weights corresponding in the Gaussian distribution mixture model, and what μ distributed is average, and ∑ is a variance, and L is the number of mixture model;
Step 6) obtains the gauss of distribution function of the local invariant feature of image spam
Step 7) is with feature database F
HamBe sample, utilize the EM algorithm to estimate the parameter of the gauss hybrid models of Image ham, principle is with step 5;
Step 8) obtains the distribution function of image ham local invariant feature
Two, the step of detection-phase is:
Step 1) image data collection to be detected is T
j, j={1,2...M}, wherein M is a picture number to be detected;
Step 2) utilize the surf algorithm to extract T
jIn the local invariant feature descriptor of every pictures, principle is with step 2;
Step 3) is utilized the K-Mean algorithm, is that length is the proper vector of SIZE with the feature descriptor cluster of the local invariant that extracts, and principle is with step 3;
The proper vector of step 4) after by cluster obtains property data base F
Test={ F
Test (1), F
Test (2)..., F
Test (N);
Step 5) is with proper vector storehouse F
TestBe sample value, calculate and Gaussian Mixture distribution distribution function Classifer (θ respectively
Ham) and Classifer (θ
Spam) apart from d
j HamAnd d
j Spam, d wherein
j HamThe distance of expression picture j and ham Gaussian Mixture distribution function, d
j SpamThe distance of expression picture j and spam Gaussian Mixture distribution function;
If step 6) is d
j HamGreater than d
j Spam, then the j pictures among the Tj belongs to the spam picture, otherwise belongs to the ham picture;
Step 7) repeating step 2)-and step 6), each picture in the set to be detected is detected.
Claims (1)
1. method of utilizing the local invariant feature detection image spam email of picture is characterized in that this method mainly is divided into following steps:
One, at first trains according to sample set
The image data collection that step 1) is treated training carries out label, is divided into rubbish picture and normal picture;
Step 2) adopt " the acceleration extraction algorithm of robust features " to extract the local invariant feature descriptor of each rubbish picture and normal picture respectively, each local invariant feature descriptor outgoing vector constitutes;
Step 3) utilizes " means clustering algorithm " that the local invariant feature descriptor of each rubbish picture and normal picture in the training set is carried out cluster, finally obtains several cluster centres; With this cluster centre is reference point, and the local invariant feature descriptor of each picture is projected on these reference point, like this each picture is standardized as the vector of some dimensions;
Step 4) utilizes the maximum likelihood function algorithm for estimating to estimate the parameter of the gauss hybrid models that set of rubbish picture and normal picture are gathered respectively respectively normal picture in the training set and the pairing vector of the rubbish picture training sample as gauss hybrid models;
The parameter that rubbish picture set that step 5) is obtained by the maximum likelihood function algorithm for estimating and normal picture are gathered corresponding gauss hybrid models can be determined the distribution function of the polynary gauss hybrid models that normal picture set and rubbish picture are gathered like this.
Two, carry out testing process then:
Step 21), utilize the acceleration extraction algorithm of robust features to extract the local invariant feature descriptor of picture for picture to be detected;
Step 22) utilize the cluster centre in the step 3) to be reference point, to step 21) in the local invariant feature descriptor carry out standardization, obtain the vector of picture to be detected;
Step 23), calculates the distribution function value of the gauss hybrid models of the distribution function value of normal picture gauss hybrid models and rubbish picture respectively with the vectorial substitution distribution function of picture to be tested;
Step 24) according to step 23) the distribution function value that obtains classifies: which value then belongs to corresponding picture greatly.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010101399468A CN101819637B (en) | 2010-04-02 | 2010-04-02 | Method for detecting image-based spam by utilizing image local invariant feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010101399468A CN101819637B (en) | 2010-04-02 | 2010-04-02 | Method for detecting image-based spam by utilizing image local invariant feature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101819637A true CN101819637A (en) | 2010-09-01 |
CN101819637B CN101819637B (en) | 2012-02-22 |
Family
ID=42654733
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010101399468A Expired - Fee Related CN101819637B (en) | 2010-04-02 | 2010-04-02 | Method for detecting image-based spam by utilizing image local invariant feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101819637B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663435A (en) * | 2012-04-28 | 2012-09-12 | 南京邮电大学 | Junk image filtering method based on semi-supervision |
CN102682007A (en) * | 2011-03-11 | 2012-09-19 | 阿里巴巴集团控股有限公司 | Method and device for creating image database |
CN104036285A (en) * | 2014-05-12 | 2014-09-10 | 新浪网技术(中国)有限公司 | Spam image recognition method and system |
CN106384124A (en) * | 2016-09-05 | 2017-02-08 | 华东师范大学 | Plastic package mail image address block location method |
CN107832925A (en) * | 2017-10-20 | 2018-03-23 | 阿里巴巴集团控股有限公司 | Internet content risk evaluating method, device and server |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020129038A1 (en) * | 2000-12-18 | 2002-09-12 | Cunningham Scott Woodroofe | Gaussian mixture models in a data mining system |
CN1787076A (en) * | 2005-12-13 | 2006-06-14 | 浙江大学 | Method for distinguishing speek person based on hybrid supporting vector machine |
CN101140624A (en) * | 2007-10-18 | 2008-03-12 | 清华大学 | Image matching method |
-
2010
- 2010-04-02 CN CN2010101399468A patent/CN101819637B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020129038A1 (en) * | 2000-12-18 | 2002-09-12 | Cunningham Scott Woodroofe | Gaussian mixture models in a data mining system |
CN1787076A (en) * | 2005-12-13 | 2006-06-14 | 浙江大学 | Method for distinguishing speek person based on hybrid supporting vector machine |
CN101140624A (en) * | 2007-10-18 | 2008-03-12 | 清华大学 | Image matching method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102682007A (en) * | 2011-03-11 | 2012-09-19 | 阿里巴巴集团控股有限公司 | Method and device for creating image database |
CN102663435A (en) * | 2012-04-28 | 2012-09-12 | 南京邮电大学 | Junk image filtering method based on semi-supervision |
CN104036285A (en) * | 2014-05-12 | 2014-09-10 | 新浪网技术(中国)有限公司 | Spam image recognition method and system |
CN106384124A (en) * | 2016-09-05 | 2017-02-08 | 华东师范大学 | Plastic package mail image address block location method |
CN107832925A (en) * | 2017-10-20 | 2018-03-23 | 阿里巴巴集团控股有限公司 | Internet content risk evaluating method, device and server |
Also Published As
Publication number | Publication date |
---|---|
CN101819637B (en) | 2012-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102129568B (en) | Method for detecting image-based spam email by utilizing improved gauss hybrid model classifier | |
CN101887523B (en) | Method for detecting image spam email by picture character and local invariant feature | |
Lu et al. | Standard detectors aren't (currently) fooled by physical adversarial stop signs | |
CN105095856B (en) | Face identification method is blocked based on mask | |
CN107506702A (en) | Human face recognition model training and test system and method based on multi-angle | |
CN102184419B (en) | Pornographic image recognizing method based on sensitive parts detection | |
CN102938054B (en) | Method for recognizing compressed-domain sensitive images based on visual attention models | |
US20070058836A1 (en) | Object classification in video data | |
CN104408475B (en) | A kind of licence plate recognition method and car license recognition equipment | |
CN103778409A (en) | Human face identification method based on human face characteristic data mining and device | |
CN105809178A (en) | Population analyzing method based on human face attribute and device | |
CN110070090A (en) | A kind of logistic label information detecting method and system based on handwriting identification | |
CN101819637B (en) | Method for detecting image-based spam by utilizing image local invariant feature | |
CN102915453B (en) | Real-time feedback and update vehicle detection method | |
CN102968637A (en) | Complicated background image and character division method | |
CN101661559A (en) | Digital image training and detecting methods | |
CN104268528A (en) | Method and device for detecting crowd gathered region | |
CN110263712A (en) | A kind of coarse-fine pedestrian detection method based on region candidate | |
CN110334602B (en) | People flow statistical method based on convolutional neural network | |
Li et al. | Fast and effective text detection | |
CN112364803A (en) | Living body recognition auxiliary network and training method, terminal, equipment and storage medium | |
CN108960175A (en) | A kind of licence plate recognition method based on deep learning | |
CN102103700A (en) | Land mobile distance-based image spam similarity-detection method | |
Liu et al. | Efficient modeling of spam images | |
Xu et al. | Occlusion problem-oriented adversarial faster-RCNN scheme |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120222 Termination date: 20150402 |
|
EXPY | Termination of patent right or utility model |