CN109347719A - A kind of image junk mail filtering method based on machine learning - Google Patents

A kind of image junk mail filtering method based on machine learning Download PDF

Info

Publication number
CN109347719A
CN109347719A CN201811053556.1A CN201811053556A CN109347719A CN 109347719 A CN109347719 A CN 109347719A CN 201811053556 A CN201811053556 A CN 201811053556A CN 109347719 A CN109347719 A CN 109347719A
Authority
CN
China
Prior art keywords
image
mail
spam
classifier
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811053556.1A
Other languages
Chinese (zh)
Other versions
CN109347719B (en
Inventor
赵俊生
候圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN201811053556.1A priority Critical patent/CN109347719B/en
Publication of CN109347719A publication Critical patent/CN109347719A/en
Application granted granted Critical
Publication of CN109347719B publication Critical patent/CN109347719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to a kind of image junk mail filtering methods based on machine learning, belong to computer science and field of artificial intelligence.The characteristics of for image junk mail, chooses the basic data for being more advantageous to and distinguishing the hsv color histogram feature and textural characteristics that differentiate spam image as image classification;By both the above characteristic be applied to K-NN algorithm, NB Algorithm, Discrimination Analysis Algorithm, SVM algorithm and random forests algorithm based on machine learning algorithm, each algorithm is learnt from other's strong points to offset one's weaknesses and proposes a kind of Ensemble Learning Algorithms, it is determined by experiment the classification which kind of algorithm is suitable for which kind of characteristics of image, experimental analysis is carried out to the optimal parameter structure of method, determine that hsv color histogram dimension is 16 dimensions, the K value of K-NN algorithm can obtain best classifying quality when being 5.This method makes accuracy rate, recall rate and the F value of image junk mail filtering while being increased to 97%, and False Rate is reduced to 3% or less.

Description

A kind of image junk mail filtering method based on machine learning
Technical field
The present invention relates to a kind of image junk mail filtering methods based on machine learning, belong to Computer Science and Technology The field of artificial intelligence of subject.
Background technique
Spammer is at present in order to avoid text based Spam filtering, by junk information with the shape of image Formula is shown, and image is sent by mail, is newly asked so becoming urgently to be resolved to the filtering of image junk mail Topic.It is said from utilization of resources angle, the size that image junk mail is taken up space is the decades of times of plain text mail, and transmission can wave Take a large amount of network bandwidth, and occupies a large amount of personal memory space.It says, is effectively filtered out containing not from the angle of social influence The spam of plan deliberately picture can inhibit the flames bring such as advertisement, fraud to negatively affect to a certain extent.From scientific research Angle is said, can be to make tribute based on the database of rubbish mail filtering method for China to collecting for spam image It offers, also to provide new method based on image junk mail filtering.
The blacklist method of the existing restricted IP address of image spam email filtering technique also has the text for extracting mail The filter method that word feature or simple characteristics of image are combined with machine learning algorithm, but the characteristic and engineering used Practise that algorithm is most of relatively simple, and machine learning algorithm mostly use external standard picture spam sample be data source into Row experiment, this is not strong to the image junk mail filtering specific aim in China.Meanwhile existing image spam email filtering at present The rate of false alarm of method is still higher.So it is necessary to the image information in mailbox is collected, to the image information progress in mailbox It analyses and compares, establishes the image library for being appropriate for Spam filtering, and the image in library is marked.On this basis, The characteristics of image of filtering spam mail is more comprehensively analyzed, as color characteristic (HSV (Hue, Saturation, Value, Tone, saturation degree, brightness) color histogram and color moment), textural characteristics, shape feature etc., from the characteristics of image compared with based on In find feature suitable for Spam filtering.The image feature data of acquirement is applied to machine learning algorithm, such as K-NN (K-NearestNeighbor, k nearest neighbor algorithm) algorithm, NB Algorithm, Discrimination Analysis Algorithm, SVM (Support Vector Machine, support vector machines) algorithm and random forests algorithm etc., and each algorithm learnt from other's strong points to offset one's weaknesses play it is respective excellent Gesture forms Ensemble Learning Algorithms, the classification which kind of algorithm is suitable for which kind of characteristics of image is determined by experiment, to the best of method Argument structure carries out experimental analysis and finally determines.
Therefore, either personal, enterprise or state administration public institution, all there is an urgent need to effective filtering spam postals The method of part improves existing e-mail environment.
Summary of the invention
The purpose of the present invention is for serious harm network and individual existing for spam especially image junk mail It personal secrets and our work and life is caused greatly interferes this problem, propose a kind of based on machine learning Image junk mail filtering method, be a kind of new combination filter method based on the ballot of result label, this method is to China Various garbage mail image collection higher accuracy rate, recall rate and comprehensive performance F value is obtained by filtration, be mail service provider A kind of technological means of effectively filtering image junk mail is provided.
A kind of image junk mail filtering method based on machine learning, this method mainly solve conventional images spam Following defect existing for filter method: first is that not setting up the image data base of the image junk mail for China, it is difficult in this way Guarantee the accuracy of the basic data of subsequent image feature;Second is that used characteristics of image and machine learning algorithm are single, very The difficult accuracy rate and recall rate for improving image junk mail filtering simultaneously, so that the rate of false alarm of filter method is still higher.
Core of the invention thought is: the characteristics of being directed to image junk mail, and selection is more advantageous to differentiation and differentiates rubbish postal Basic data of the hsv color histogram feature and textural characteristics of part image as image classification;By both the above characteristic Applied to the engineering based on K-NN algorithm, NB Algorithm, Discrimination Analysis Algorithm, SVM algorithm and random forests algorithm Algorithm is practised, each algorithm is learnt from other's strong points to offset one's weaknesses and proposes a kind of Ensemble Learning Algorithms, is determined by experiment which kind of algorithm is suitable for which kind of figure As the classification of feature, experimental analysis is carried out to the optimal parameter structure of method, determines that hsv color histogram dimension is 16 dimensions, K- The K value of NN algorithm can obtain best classifying quality when being 5.
The present invention is relevant to be defined as follows:
It defines 1. image spam emails: not being to require or agree to the various forms of of receiving for personal in addressee Have the information containing improper political motives that is tendentious, can not rejecting, the information containing false or hiding swindle, contain The information of porns, gambling and drugs or the image mail of advertising information constituted with image format, referred to as image spam email;
It defines 2. image-type regular mails: being the sum being of practical significance, having demand value that addressee has a mind to check and accept The mail containing image of no flame is referred to as image-type regular mail;
Image-type regular mail and image spam email are referred to as image-type mail;
A kind of image junk mail filtering method based on machine learning, comprising the following steps:
Step 1: largely collecting image in spam and often by the channel based on the internet and mailbox addressee Mail image is advised, obtains comprehensive spam image data base and regular mail image data base respectively, and according to both Database generates training set and test set respectively;
Wherein, the X% data of the X% of the spam image data base of acquisition and regular mail image data base are as instruction Practice collection;The Y% of the spam image data base of acquisition and the Y% data of regular mail image data base are as test set;X% It with Y%'s and is 1;
Step 1, specific include following sub-step again:
Step 1.1, official's over-network registration individual mailbox;
Wherein, official website mainly includes Netease, Sohu, Sina, google and QQ;
Step 1.2 collects all spam images and conventional postal from the inbox for the individual mailbox that step 1.1 is registered Part image establishes mail image database;
Step 1.3 according to definition 1 and defines 2 to the mail image database of step 1.2 foundation, i.e. image-type rubbish postal The definition of part and the definition of image-type regular mail carry out the differentiation of image junk mail and image regular mail, rower of going forward side by side Note, is respectively formed two kinds of data sets of spam image and regular mail image;
Spam image and regular mail image are referred to as mail image;
Wherein, the X% of the X% and regular mail image that take spam image generate training set, remaining spam The Y% of image and the Y% generating test set of regular mail image, X%+Y%=1;
Step 2: analyzing the characteristics of image of image in the training set of step 1 output, the color for extracting image is special Sign, textural characteristics and shape feature are suitable for the characteristics of image of image classification by experimental comparison selection and classifier carry out rubbish The classification of rubbish mail and regular mail;Specifically include following sub-step:
Step 2.1 passes through the hsv color histogram of the color characteristic of experimental analysis image and the texture of color moment, image The shape feature of feature and image, and extract relevant characteristic value;
Wherein, hsv color histogram includes the face of the color histogram in the channel H, the color histogram of channel S and the channel V Color Histogram;
Step 2.1 includes following sub-step again:
Step 2.1.1, color space is divided, obtains the bin that several subintervals are exactly histogram, in bin Numerical value is that characteristic statistic is calculated from image color data;It establishes histogram and is converted to one-dimensional color histogram, generate One-dimensional vector;
Wherein, color space is divided, specifically: the numerical value on color space is quantified, by each Number of pixels in bin comprising color is counted, and color histogram is obtained;Again to the channel V in color histogram, the channel H with And the value of channel S is quantified, i.e., carries out equal part to the numerical value in channel;
Wherein, when establishing histogram, the lightness information of image, the i.e. value in the channel V are not selected, only choose the channel H and S is logical Road carries out Information Statistics, specifically includes following sub-step:
Step 2.1.1A carries out grade classification to the value in the channel H and channel S respectively, which is equivalent to the channel H The histogram of given interval range is established with channel S;
Wherein, the data distribution of the channel H and channel S is more dispersed, the numerical value in the channel H between 0 to 360, channel S Numerical value is between 0 to 1;
Step 2.1.1B merges the histogram in the channel H and channel S obtained step 2.1.1A, obtains one-dimensional face Color Histogram indicates;
Wherein, color moment is a kind of lightweight, calculates quick distribution of color and indicate feature;It is expressed and is schemed using color moment As information only needs to calculate 9 components, color moment is suitable on the two kinds of channels HSV and RGB, because both channels all contain There are 3 color components, only need to calculate 3 low-order moments on each component: first moment is the average value information of image pixel, second order Square is the covariance information of image pixel, and third moment is the degree of skewness information of image pixel, thus more comprehensively represents image Distribution of color;
Wherein, extracting color moment mainly has following three step:
Spam image and regular mail image are transformed into the channel HSV by RGB channel by step 2.1.1C, and to HSV The image data in channel is calculated, its mean value, variance and gradient are found out;
Step 2.1.1D is again normalized mean value, variance and the gradient that step 2.1.1C is found out, and obtains normalizing Change treated data;
Step 2.1.1E is finally, be spliced into one-dimensional vector for the form that the data after normalized are converted to vector;
Step 2.1.2, the textural characteristics for extracting image, i.e., be first converted to gray level image for true color image, then to gray scale Image is compressed, calculates gray level co-occurrence matrixes and is calculated the corresponding energy of gray level co-occurrence matrixes, entropy, the moment of inertia and correlation The average and standard deviation of numerical value four amounts, i.e., indicate the textural characteristics of image with 8 dimension datas;
Wherein, true color image refers to mail image;
Mainly include following three sub-step:
True color image is converted to gray level image by step 2.1.2A, is extracted in mail image textural characteristics using statistic law Gray level co-occurrence matrixes, specifically: built on the horizontal direction of image, vertical direction, diagonal and back-diagonal direction The gray level co-occurrence matrixes of vertical image, orientation angle are 0 °, 45 °, 90 ° and 135 °, in mail image a pixel (x, y) and The gray value of the point pair of the one other pixel point (x+a, y+b) of deviation is (i, j), and point (x, y) movement in mail image obtains The series L of different (i, j) values, gray value takes the combination of 256, i and j just to have L2Kind, count the appearance of each (i, j) value Number, then they are normalized to the probability P occurredij, obtained square matrix [Pij]L×LFor gray level co-occurrence matrixes;
Step 2.1.2B compresses the step 2.1.2A gray level image converted out, the section of gray value of image be [0, 255], the value in the section is divided into 16 grades, obtains compressed gray level image;
Step 2.1.2C calculates four co-occurrence matrix P based on the compressed gray level image that step 2.1.2B is exported;
Wherein, taking distance is 1, and angle is 0 °, 45 °, 90 ° and 135 °;
Four gray level co-occurrence matrixes that step 2.1.2C is generated are normalized in step 2.1.2D respectively, and generation is returned Gray level co-occurrence matrixes after one change, then acquire this corresponding energy of Normalized Grey Level co-occurrence matrix, entropy, the moment of inertia and correlation Numerical value, then the average and standard deviation of above four amounts is calculated, amount to 8 dimension datas to indicate the textural characteristics of image;
Step 2.1.3, the overall profile feature and mail of specific objective in mail image are extracted according to shape invariance moments method Image area characteristics, using HU, bending moment does not generate the shape feature of mail image;
Wherein, extracting shape feature mainly has following three step:
The representative function that step 2.1.3A defines mail image is f (x, y);
Step 2.1.3B re-defines the central moment of the standard square of mail image;
Step 2.1.3C finally constructs HU not bending moment according to the central moment after second order and the normalization of three ranks, finds out each mail The 7 invariant moments of image, and this 7 invariant moments is converted into splicing after one-dimensional vector and obtains the shape feature of the mail image;
Step 2.2, the characteristics of image and classifier for being suitable for image classification by experimental comparison selection, it is accurate to choose classification Rate and the higher hsv color histogram feature of recall rate and textural characteristics are carried out by K-NN classifier and integrated Study strategies and methods The classification of spam and regular mail;
Step 2.2.1, color characteristic, textural characteristics and shape feature are inputted into K-NN algorithm respectively, naive Bayesian is calculated Method, Ensemble Learning Algorithms, Discrimination Analysis Algorithm, SVM algorithm and random forests algorithm carry out six groups of experiments, by steady in experimental result Qualitative best algorithm classifies to mail image;
Wherein, color moment, hsv color histogram, textural characteristics and shape feature is tested respectively in six groups of experiments to exist The accuracy rate of mail image classification and the numerical value of recall rate, while whole longitudinal analysis has been carried out to six groups of experiments and has been compared, According to above experimental result, learn in terms of characteristics of image, hsv color histogram feature and the textural characteristics performance of image It is preferably also most stable, by longitudinal comparison, learn in terms of classifier, the performance of K-NN classifier and integrated Study strategies and methods It is best, most stable;In summary conclusion, using hsv color histogram and textural characteristics as main characteristics of image to be applied, K- NN classifier and integrated Study strategies and methods are as main classifier to be applied;It finally obtains and is most suitable for being applied to Spam filtering Two kinds of characteristics of image be hsv color histogram and textural characteristics, two kinds of best classifiers of classifying quality are K-NN classifiers With integrated Study strategies and methods;
Wherein, integrated study classifier is to be trained on training set using multiple individual classifiers, by by this The carry out optimum combination of a little independent trained classifier iteration, until obtaining strong classifier;Specifically include following four Sub-steps:
Step 2.2.1A, the classification data for each individual classifier assigns certain weight;
Step 2.2.1B, individual classifier is run on training set, obtains each independent classifier under current structure Classification accuracy;
Step 2.2.1C, weight is adjusted, the sample weights that last time is correctly classified improve, the sample power of last time mistake classification It reduces again;
Step 2.2.1D, step 2.2.1B and step 2.2.1C is repeated, converges to the difference of the accuracy of double classification pre- Until time value;
Step 2.2.2, when further determining that hsv color histogram dimension is 16 dimensions, 32 peacekeepings 64 dimension by three groups of experiments Classification Average Accuracy when applied to various classifiers, each classifier when hsv color histogram dimension is 16 dimension as the result is shown Classification Average Accuracy highest, so determine for classification hsv color histogram feature dimension be 16 dimension;
Step 2.2.3, further determine that hsv color histogram dimension is 16 dimensions, 32 peacekeepings 64 dimension by three groups of experiments again When classification Average Accuracy of the K-NN classifier when K value takes 3,5,7 and 9, as a result classification accuracy highest when K=5;
Step 2.2.4, finally determine that the classification using K-NN classifier when textural characteristics when K value takes 3,5,7 and 9 is average As a result accuracy rate equally shows classification accuracy highest when K=5;
It is final to determine that selecting the characteristics of image for being suitable for image classification is that dimension is tieed up for 16 according to the above experiment results Hsv color histogram feature and textural characteristics, the K-NN classifier sum aggregate when classifier for being suitable for image classification is K=5 At Study strategies and methods;
Step 3: inputting two kinds of mail image features of hsv color histogram feature and textural characteristics respectively based on coarse The K-NN classifier of set attribute reduction obtains two kinds of classification results, then makes the two kinds of postals of hsv color histogram feature and textural characteristics Part characteristics of image inputs integrated Study strategies and methods respectively, and two kinds of classification results of getting back amount to four kinds of assembled classifications as a result, passing through The method of tag along sort ballot, is tested on test set, and confirmatory experiment result simultaneously carries out performance evaluation to result, is finally mentioned Accuracy rate, recall rate and the comprehensive performance F value that height effectively filters image junk mail;Specifically include following sub-step:
Step 3.1 carries out experimental verification on test set, makes accurately to K-NN classifier and integrated Study strategies and methods Performance based on rate, recall rate and F value is made an appraisal;
Wherein, accuracy rate evaluation index is calculated with formula (1):
Wherein, Precision is accuracy rate, has reacted filtration system and has looked for ability to spam.A expression is correctly divided The spam number of class, B indicate that non-spam email is mistaken for the number of spam;
Recall rate evaluation index is calculated with formula (2):
Wherein, Recall is recall rate, has reacted the ability of filtration system discovery spam.A expression is correctly classified Spam number, C indicates that spam is mistaken for the number of non-spam email;
False Rate evaluation index is calculated with formula (3):
Wherein, FailureRate is False Rate, indicates the probability that non-spam email is determined as to spam.A indicates quilt The spam number correctly classified, B indicate that non-spam email is mistaken for the number of spam;
F value evaluation index is calculated with formula (4):
Wherein, F value is an overall balance index between recall rate and accuracy rate, it reflects the comprehensive of Spam filtering Close effect;
Step 3.2 after carrying out performance evaluation for various classifiers, uses hsv color histogram to mail image to be measured Feature obtains classification results 1 by K-NN classifier, and hsv color histogram feature obtains classification knot by integrated study classifier Fruit 2, textural characteristics obtain classification results 3 by K-NN classifier, and textural characteristics obtain classification knot by integrated study classifier Then fruit 4 carries out label ballot to each classification results, if it is decided that when being greater than 2 for the result of spam, then finally this envelope Mail is determined as spam.
Beneficial effect
A kind of image junk mail filtering method based on machine learning has as follows compared with prior art
The utility model has the advantages that
1. make the accuracy rate of image junk mail filtering, recall rate and F value while being increased to 97%, False Rate is reduced to 3% or less;
2. the machine learning algorithm of integrated application artificial intelligence forms a kind of combination filter method, to solve image rubbish Filtrating mail problem founded it is a kind of it is new, reliable, compared with the technology path of high accurancy and precision.
Detailed description of the invention
Fig. 1 is a kind of image junk mail filtering method structure chart based on machine learning of the present invention;
Fig. 2 is the implementation process schematic diagram for illustrating the combination filter module based on the ballot of result label in Fig. 1;
Fig. 3 is a kind of performance schematic diagram of the image junk mail filtering method based on machine learning of the present invention.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and examples and detailed description.
Embodiment 1
The present embodiment describes a kind of specific implementation of the image junk mail filtering method based on machine learning of the present invention Process, Fig. 1 are the structure charts of the present invention and the present embodiment, and Fig. 2 is to the combination filter based on the ballot of result label in Fig. 1 The schematic diagram that the specific implementation flow of module is illustrated.
From figure 1 it appears that the specific implementation steps are as follows for the present invention and the present embodiment:
Step A, the machine learning algorithm of artificial intelligence is applied to the filtering of image junk mail, is effectively filtered Model and filter method specifically include following sub-step:
Step A.1, establish mail image database, to be formed for the training set image data base of machine learning and survey Examination collection image data base;
When it is implemented, 80% number of 80% and regular mail image data base of the spam image data base obtained According to as training set, in addition 20% it to be used for test set;
Step A.2, extract image color characteristic (hsv color histogram and color moment), textural characteristics, shape feature, The hsv color histogram feature and textural characteristics suitable for Spam filtering, shape are found from the characteristics of image compared with based on At feature dictionary;
A.3, by the training set image feature data of acquirement step is applied to K-NN algorithm, NB Algorithm, differentiation The machine learning algorithms such as parser, SVM algorithm and random forests algorithm, construction generate various single classifiers, and to each Classifier carries out the performance evaluation of accuracy rate and recall rate one by one, and passage capacity evaluation determines that selection is best suited for carrying out image rubbish The classifier of rubbish filtrating mail is K-NN classifier and integrated Study strategies and methods based on rough set attribute reduction, integrated study point Class device is that each algorithm is learnt from other's strong points to offset one's weaknesses to play the algorithm that respective advantage is formed;
A.4, by step two kinds of characteristics of image (hsv color histogram feature and textural characteristics) that A.2 step determines are respectively It combines to be formed with two kinds of classifiers (K-NN classifier and integrated Study strategies and methods) that A.3 step determines and be thrown based on result label The combination filtering model of ticket is completed to filter, obtains final filtrating mail result to mail image to be measured.
Step B, it according to the combination filter voted based on result label is realized shown in Fig. 2, obtains differentiating that precision is higher Image junk mail classification results, specifically include following sub-step:
Step extracts the hsv color histogram feature and textural characteristics of mail image to be measured B.1, first, keeps hsv color straight Square figure feature obtains classification results 1 by K-NN classifier;Make hsv color histogram feature by integrated study classifier, obtains To classification results 2;Make textural characteristics by K-NN classifier, obtains classification results 3;Textural characteristics are made to classify by integrated study Device obtains classification results 4;Label ballot is carried out to above four kinds of classification results, result of the poll greater than 2 is the classification determined As a result, terminating to this classification;
B.2, by the filtering repeatedly to mail image to be measured step detects, the dimension of hsv color histogram takes 16 dimensions, K- When the K value of NN classifier takes 5, a kind of image junk mail filtering method performance based on machine learning as shown in Figure 3 is obtained, Find out from the performance map, this method is all increased to the accuracy rate of image junk mail filtering, recall rate and F value simultaneously 97%, and False Rate is reduced to 3% or less.
The above is presently preferred embodiments of the present invention, and it is public that the present invention should not be limited to embodiment and attached drawing institute The content opened.It is all not depart from the lower equivalent or modification completed of spirit disclosed in this invention, both fall within the model that the present invention protects It encloses.

Claims (1)

1. a kind of image junk mail filtering method based on machine learning, it is characterised in that: relevant definition in the method It is as follows:
It defines 1. image spam emails: not being for personal requirement or to agree to that the various forms of of receiving having in addressee The tendentious, information containing improper political motives that can not reject, containing false or hide the information of swindle, containing yellow gambling The information of poison or the image mail of advertising information constituted with image format, referred to as image spam email;
Define 2. image-type regular mails: be addressee's being of practical significance of having a mind to check and accept, it is having that demand is worth and invariably The mail containing image of good information is referred to as image-type regular mail;
Image-type regular mail and image spam email are referred to as image-type mail;
The method, comprising the following steps:
Step 1: largely collecting image and the conventional postal in spam by the channel based on the internet and mailbox addressee Part image obtains comprehensive spam image data base and regular mail image data base respectively, and according to both data Library generates training set and test set respectively;
Wherein, the X% data of the X% of the spam image data base of acquisition and regular mail image data base are as training Collection;The Y% of the spam image data base of acquisition and the Y% data of regular mail image data base are as test set;X% and Y%'s and be 1;
Step 1, specific include following sub-step again:
Step 1.1, official's over-network registration individual mailbox;
Wherein, official website mainly includes Netease, Sohu, Sina, google and QQ;
Step 1.2 collects all spam images and regular mail figure from the inbox for the individual mailbox that step 1.1 is registered Picture establishes mail image database;
Step 1.3, the mail image database established to step 1.2 are according to defining 1 and define 2, i.e. image spam email Definition and the definition of image-type regular mail carry out the differentiation of image junk mail and image regular mail, and are marked, point It Xing Cheng not two kinds of data sets of spam image and regular mail image;
Spam image and regular mail image are referred to as mail image;
Wherein, the X% of the X% and regular mail image that take spam image generate training set, remaining spam image Y% and regular mail image Y% generating test set, X%+Y%=1;
Step 2: analyzing the characteristics of image of image in the training set of step 1 output, color characteristic, the line of image are extracted Feature and shape feature are managed, the characteristics of image of image classification is suitable for by experimental comparison selection and classifier carries out spam With the classification of regular mail, following sub-step is specifically included:
Step 2.1 passes through the hsv color histogram of the color characteristic of experimental analysis image and the textural characteristics of color moment, image With the shape feature of image, and relevant characteristic value is extracted;
Wherein, hsv color histogram includes that the color of the color histogram in the channel H, the color histogram of channel S and the channel V is straight Fang Tu;
Step 2.1 includes following sub-step again:
Step 2.1.1, color space is divided, obtains the bin that several subintervals are exactly histogram, the numerical value in bin It is that characteristic statistic is calculated from image color data;It establishes histogram and is converted to one-dimensional color histogram, generate one-dimensional Vector;
Wherein, color space is divided, specifically: the numerical value on color space is quantified, by each bin Number of pixels comprising color is counted, and color histogram is obtained;It is logical to the channel V, the channel H and S in color histogram again The value in road is quantified, i.e., carries out equal part to the numerical value in channel;
Wherein, when establishing histogram, the lightness information of image, the i.e. value in the channel V are not selected, only choose the channel H and channel S into Row information statistics, specifically includes following sub-step:
Step 2.1.1A carries out grade classification to the data in the channel H and channel S respectively, which is equivalent to the channel H and S Path Setup gives the histogram of interval range;
Wherein, the data distribution of the channel H and channel S is more dispersed, and the numerical value in the channel H is between 0 to 360, the numerical value of channel S Between 0 to 1;
Step 2.1.1B merges the histogram in the channel H and channel S obtained step 2.1.1A, and it is straight to obtain one-dimensional color Side's figure indicates;
Wherein, color moment is a kind of lightweight, calculates quick distribution of color and indicate feature;Believed using color moment expression image Breath only needs to calculate 9 components, and color moment is suitable on the two kinds of channels HSV and RGB, because both channels all contain 3 A color component, only need to calculate 3 low-order moments on each component: first moment is the average value information of image pixel, second moment It is the covariance information of image pixel, third moment is the degree of skewness information of image pixel, thus more comprehensively represents image Distribution of color;
Wherein, extracting color moment mainly has following three step:
Spam image and regular mail image are transformed into the channel HSV by RGB channel by step 2.1.1C, and to the channel HSV Image data calculated, find out its mean value, variance and gradient;
Step 2.1.1D is again normalized mean value, variance and the gradient that step 2.1.1C is found out, and obtains at normalization Data after reason;
Step 2.1.1E is finally, be spliced into one-dimensional vector for the form that the data after normalized are converted to vector;
Step 2.1.2, the textural characteristics for extracting image, i.e., be first converted to gray level image for true color image, then to gray level image Compressed, calculate gray level co-occurrence matrixes and calculated the corresponding energy of gray level co-occurrence matrixes, entropy, the moment of inertia and correlation values The average and standard deviation of four amounts, i.e., indicate the textural characteristics of image with 8 dimension datas;
Wherein, true color image refers to mail image;Step 2.1.2 mainly includes following three sub-step:
True color image is converted to gray level image by step 2.1.2A, extracts the ash in mail image textural characteristics using statistic law Co-occurrence matrix is spent, specifically: figure is established on the horizontal direction of image, vertical direction, diagonal and back-diagonal direction The gray level co-occurrence matrixes of picture, orientation angle are 0 °, 45 °, 90 ° and 135 °, a pixel (x, y) and deviation in mail image One other pixel point (x+a, y+b) point pair gray value be (i, j), point (x, y) in mail image movement obtains difference (i, j) value, the series L of gray value takes the combination of 256, i and j just to have L2Kind, count time of each (i, j) value appearance Number, then they are normalized to the probability P occurredij, obtained square matrix [Pij]L×LFor gray level co-occurrence matrixes;
Step 2.1.2B compresses the step 2.1.2A gray level image converted out, and the section of gray value of image is [0,255], The value in the section is divided into 16 grades, obtains compressed gray level image;
Step 2.1.2C calculates four co-occurrence matrix P based on the compressed gray level image that step 2.1.2B is exported;
Wherein, taking distance is 1, and angle is 0 °, 45 °, 90 ° and 135 °;
Four gray level co-occurrence matrixes that step 2.1.2C is generated are normalized in step 2.1.2D respectively, generate normalization Then gray level co-occurrence matrixes afterwards acquire this corresponding energy of Normalized Grey Level co-occurrence matrix, entropy, the moment of inertia and correlation number Value, then the average and standard deviation of above four amounts is calculated, amount to 8 dimension datas to indicate the textural characteristics of image;
Step 2.1.3, the overall profile feature and mail image of specific objective in mail image are extracted according to shape invariance moments method Provincial characteristics, using HU, bending moment does not generate the shape feature of mail image;
Wherein, extracting shape feature mainly has following three step:
The representative function that step 2.1.3A defines mail image is f (x, y);
Step 2.1.3B re-defines the central moment of the standard square of mail image;
Step 2.1.3C finally constructs HU not bending moment according to the central moment after second order and the normalization of three ranks, finds out each mail image 7 invariant moments, and this 7 invariant moments is converted into splicing after one-dimensional vector and obtains the shape feature of the mail image;
Step 2.2 is suitable for the characteristics of image and classifier of image classification by experimental comparison selection, choose classification accuracy and The higher hsv color histogram feature of recall rate and textural characteristics pass through K-NN classifier and integrated Study strategies and methods progress rubbish The classification of mail and regular mail;
Step 2.2.1, color characteristic, textural characteristics and shape feature are inputted into K-NN algorithm, NB Algorithm, collection respectively Six groups of experiments are carried out at learning algorithm, Discrimination Analysis Algorithm, SVM algorithm and random forests algorithm, by stability in experimental result Best algorithm classifies to mail image;
Wherein, color moment, hsv color histogram, textural characteristics and shape feature are tested respectively in six groups of experiments in mail The accuracy rate of image classification and the numerical value of recall rate, while whole longitudinal analysis has been carried out to six groups of experiments and has been compared, according to Above experimental result learns that in terms of characteristics of image, the hsv color histogram feature and textural characteristics of image show best Also most stable, by longitudinal comparison, learn in terms of classifier, the performance of K-NN classifier and integrated Study strategies and methods is most It is good, most stable;In summary conclusion, using hsv color histogram and textural characteristics as main characteristics of image to be applied, K-NN Classifier and integrated Study strategies and methods are as main classifier to be applied;It finally obtains and is most suitable for applied to Spam filtering Two kinds of characteristics of image are hsv color histogram and textural characteristics, two kinds of best classifiers of classifying quality be K-NN classifier and Integrated study classifier;
Wherein, integrated study classifier is to be trained on training set using multiple individual classifiers, by the way that these are only The carry out optimum combination of vertical trained classifier iteration specifically includes following four sons until obtaining strong classifier Step:
Step 2.2.1A, the classification data for each individual classifier assigns certain weight;
Step 2.2.1B, individual classifier is run on training set, obtains point of each individually classifier under current structure Class accuracy rate;
Step 2.2.1C, weight is adjusted, the sample weights that last time is correctly classified improve, the sample weights drop of last time mistake classification It is low;
Step 2.2.1D, step 2.2.1B and step 2.2.1C is repeated, the difference of the accuracy of double classification is made to converge to desired value Until;
Step 2.2.2, it is applied when further determining that hsv color histogram dimension is 16 dimensions, 32 peacekeepings 64 dimension by three groups of experiments Classification Average Accuracy when various classifiers, point of each classifier when hsv color histogram dimension is 16 dimension as the result is shown Class Average Accuracy highest, so determining that the dimension of the hsv color histogram feature for classification is 16 dimensions;
Step 2.2.3, again by three groups of experiments further determine that hsv color histogram dimension be 16 dimension, 32 peacekeepings 64 tie up when K- Classification Average Accuracy of the NN classifier when K value takes 3,5,7 and 9, as a result classification accuracy highest when K=5;
Step 2.2.4, finally determine that the classification using K-NN classifier when textural characteristics when K value takes 3,5,7 and 9 is average accurate As a result rate equally shows classification accuracy highest when K=5;
It is final to determine that selecting the characteristics of image for being suitable for image classification is that dimension is tieed up for 16 according to the above experiment results Hsv color histogram feature and textural characteristics, K-NN classifier when the classifier for being suitable for image classification is K=5 and integrated Study strategies and methods;
Step 3: inputting two kinds of mail image features of hsv color histogram feature and textural characteristics respectively based on rough set category The K-NN classifier of property reduction, obtains two kinds of classification results, then make two kinds of mail figures of hsv color histogram feature and textural characteristics Study strategies and methods are integrated as feature inputs respectively, two kinds of classification results of getting back amount to four kinds of assembled classifications as a result, passing through classification The method of label ballot, is tested on test set, and confirmatory experiment result simultaneously carries out performance evaluation, final raising pair to result Accuracy rate, recall rate and the comprehensive performance F value that image junk mail effectively filters;Specifically include following sub-step:
Step 3.1 carries out experimental verification on test set, makes accuracy rate to K-NN classifier and integrated Study strategies and methods, calls together Performance based on the rate of returning and F value is made an appraisal;
Wherein, accuracy rate evaluation index is calculated with formula (1):
Wherein, Precision is accuracy rate, has reacted filtration system and has looked for ability to spam;A expression is correctly classified Spam number, B indicate that non-spam email is mistaken for the number of spam;
Recall rate evaluation index is calculated with formula (2):
Wherein, Recall is recall rate, has reacted the ability of filtration system discovery spam;A indicates the rubbish correctly classified Rubbish mail number, C indicate that spam is mistaken for the number of non-spam email;
False Rate evaluation index is calculated with formula (3):
Wherein, FailureRate is False Rate, indicates the probability that non-spam email is determined as to spam.A indicates correct The spam number of classification, B indicate that non-spam email is mistaken for the number of spam;
F value evaluation index is calculated with formula (4):
Wherein, F value is an overall balance index between recall rate and accuracy rate, it reflects the comprehensive effect of Spam filtering Fruit;
Step 3.2 after carrying out performance evaluation for various classifiers, uses hsv color histogram feature to mail image to be measured Classification results 1 are obtained by K-NN classifier, hsv color histogram feature obtains classification results 2 by integrated study classifier, Textural characteristics obtain classification results 3 by K-NN classifier, and textural characteristics obtain classification results 4 by integrated study classifier, Then label ballot is carried out to each classification results, if it is decided that when being greater than 2 for the result of spam, then finally this envelope mail It is determined as spam.
CN201811053556.1A 2018-09-11 2018-09-11 Image spam filtering method based on machine learning Active CN109347719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811053556.1A CN109347719B (en) 2018-09-11 2018-09-11 Image spam filtering method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811053556.1A CN109347719B (en) 2018-09-11 2018-09-11 Image spam filtering method based on machine learning

Publications (2)

Publication Number Publication Date
CN109347719A true CN109347719A (en) 2019-02-15
CN109347719B CN109347719B (en) 2021-01-15

Family

ID=65305130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811053556.1A Active CN109347719B (en) 2018-09-11 2018-09-11 Image spam filtering method based on machine learning

Country Status (1)

Country Link
CN (1) CN109347719B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781812A (en) * 2019-10-24 2020-02-11 谷琛 Method for automatically identifying target object by security check instrument based on machine learning
CN111461199A (en) * 2020-03-30 2020-07-28 华南理工大学 Security attribute selection method based on distributed junk mail classified data
CN113768452A (en) * 2021-09-16 2021-12-10 重庆金山医疗技术研究院有限公司 Intelligent timing method and device for electronic endoscope
CN115424278A (en) * 2022-08-12 2022-12-02 中国电信股份有限公司 Mail detection method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520848A (en) * 2008-02-27 2009-09-02 中国科学院自动化研究所 Method for filtering image-based junk mails
CN103020645A (en) * 2013-01-06 2013-04-03 深圳市彩讯科技有限公司 System and method for junk picture recognition
WO2015054666A1 (en) * 2013-10-10 2015-04-16 Board Of Regents, The University Of Texas System Systems and methods for quantitative analysis of histopathology images using multi-classifier ensemble schemes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520848A (en) * 2008-02-27 2009-09-02 中国科学院自动化研究所 Method for filtering image-based junk mails
CN103020645A (en) * 2013-01-06 2013-04-03 深圳市彩讯科技有限公司 System and method for junk picture recognition
WO2015054666A1 (en) * 2013-10-10 2015-04-16 Board Of Regents, The University Of Texas System Systems and methods for quantitative analysis of histopathology images using multi-classifier ensemble schemes

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781812A (en) * 2019-10-24 2020-02-11 谷琛 Method for automatically identifying target object by security check instrument based on machine learning
CN111461199A (en) * 2020-03-30 2020-07-28 华南理工大学 Security attribute selection method based on distributed junk mail classified data
CN111461199B (en) * 2020-03-30 2023-04-28 华南理工大学 Safety attribute selection method based on distributed junk mail classified data
CN113768452A (en) * 2021-09-16 2021-12-10 重庆金山医疗技术研究院有限公司 Intelligent timing method and device for electronic endoscope
CN115424278A (en) * 2022-08-12 2022-12-02 中国电信股份有限公司 Mail detection method and device and electronic equipment
CN115424278B (en) * 2022-08-12 2024-05-03 中国电信股份有限公司 Mail detection method and device and electronic equipment

Also Published As

Publication number Publication date
CN109347719B (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN109347719A (en) A kind of image junk mail filtering method based on machine learning
CN106248559B (en) A kind of five sorting technique of leucocyte based on deep learning
CN103996057B (en) Real-time Handwritten Numeral Recognition Method based on multi-feature fusion
CN109768985A (en) A kind of intrusion detection method based on traffic visualization and machine learning algorithm
CN103632168B (en) Classifier integration method for machine learning
CN109952614A (en) The categorizing system and method for biomone
CN108171184A (en) Method for distinguishing is known based on Siamese networks again for pedestrian
CN110334565A (en) A kind of uterine neck neoplastic lesions categorizing system of microscope pathological photograph
CN106453033A (en) Multilevel Email classification method based on Email content
CN109034194A (en) Transaction swindling behavior depth detection method based on feature differentiation
CN108363810A (en) A kind of file classification method and device
CN111861103A (en) Fresh tea leaf classification method based on multiple features and multiple classifiers
CN109117885A (en) A kind of stamp recognition methods based on deep learning
CN106570109A (en) Method for automatically generating knowledge points of question bank through text analysis
CN104766097A (en) Aluminum plate surface defect classification method based on BP neural network and support vector machine
CN108764302A (en) A kind of bill images sorting technique based on color characteristic and bag of words feature
KR101054107B1 (en) A system for exposure retrieval of personal information using image features
CN108197636A (en) A kind of paddy detection and sorting technique based on depth multiple views feature
CN109544538A (en) Wheat scab disease grade grading method and device
CN109800810A (en) A kind of few sample learning classifier construction method based on unbalanced data
CN103077399A (en) Biological microscopic image classification method based on integrated cascade structure
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
Bae et al. Implementation of template matching, fuzzy logic and K nearest neighbor classifier on Philippine banknote recognition system
CN108074025A (en) Coil of strip surface defect determination method based on surface defect distribution characteristics
CN109523514A (en) To the batch imaging quality assessment method of Inverse Synthetic Aperture Radar ISAR

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant