CN106096005A - A kind of rubbish mail filtering method based on degree of depth study and system - Google Patents
A kind of rubbish mail filtering method based on degree of depth study and system Download PDFInfo
- Publication number
- CN106096005A CN106096005A CN201610464120.6A CN201610464120A CN106096005A CN 106096005 A CN106096005 A CN 106096005A CN 201610464120 A CN201610464120 A CN 201610464120A CN 106096005 A CN106096005 A CN 106096005A
- Authority
- CN
- China
- Prior art keywords
- degree
- training
- depth
- rmb
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of rubbish mail filtering method based on degree of depth study and system, wherein, described rubbish mail filtering method based on degree of depth study includes: step A: carries out mail sample processing generation primary vector spatial model, builds degree of depth confidence network;Step B: carry out processing generation secondary vector spatial model to test mail;Step C: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;Step D: output detections result.Rubbish mail filtering method based on degree of depth study provided by the present invention, owing to have employed structure degree of depth confidence network, by the way of constructed degree of depth confidence network detection test mail, improve accuracy and the stability identifying spam, save the mark great amount of samples required time spent and manpower simultaneously.
Description
Technical field
The present invention relates to Spam filtering field, a kind of spam based on degree of depth study
Filter method and system.
Background technology
Along with the fast development of Internet technology, Email the most closely become people's life, in working and learning can not or
The part lacked.The life that it is us provides a great convenience, but what people's life was caused by corresponding spam
Perplex the most increasing.
The key problem of filtrating mail is how to use known email text data set to set up a text classification mould
Type, then uses this model to differentiate email type, thus filters out spam.These algorithms following are ratios
More common, such as: K next-door neighbour's algorithm (KNN), NB Algorithm, decision Tree algorithms, algorithm of support vector machine.But these
Algorithm suffers from respective limitation.
NB Algorithm, howsoever select probability model, this model is the most all at a given text
Under the conditions of could calculate mail and be divided into the probability of rubbish classification.And premise is par wise irrelevance between each feature.For knn
Algorithm, choosing of k value is particularly important, which determines the correctness of last classification.But it is fine up to the present to go back neither one
Method;Determine rational k value.
Owing to Spam filtering is actually two classification problems, although so traditional sorting technique can reach
Purpose, but effect bad.The method that filtrating mail mainly uses at present is the method giving rule-based filtering, this method pair
Very strong in the dependency of rule, if rule choose good, corresponding filter result also can be the best.But the spy of spam
Point also can constantly change, and this just requires constantly regulation rule, the most passive and troublesome.
Therefore, prior art has yet to be improved and developed.
Summary of the invention
In view of above-mentioned the deficiencies in the prior art, it is an object of the invention to provide one and can improve spam filtering
Accuracy and stability, cost time and the spam based on degree of depth study of manpower needed for saving mark great amount of samples simultaneously
Filter method and system.
Technical scheme is as follows:
A kind of rubbish mail filtering method based on degree of depth study, wherein, described spam mistake based on degree of depth study
Filtering method includes:
Step A: carry out mail sample processing generation primary vector spatial model, build degree of depth confidence network;
Step B: carry out processing generation secondary vector spatial model to test mail;
Step C: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;
Step D: output detections result.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A specifically includes:
Step A1: training mail sample;
Step A2: the mail sample after training is carried out pretreatment, determines feature the structural feature collection of spam;
Step A3: generate primary vector spatial model according to the feature set constructed;
Step A4: build degree of depth confidence network according to generated primary vector spatial model.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A2 specifically includes:
Step A21: the mail sample after training is carried out participle;
Step A22: separated entry structure dictionary according to all;
Step A23: add up the word frequency remaining entry after stop words removed by constructed dictionary.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A3 specifically includes:
Step A31: institute's structural feature is concentrated all features carry out vectorization, and stores according to the pattern of vector space;
Step A32: the characteristic vector generated is normalized.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A4 includes:
Step A41: train up n-th RMB and obtain the weights of this RMB;
Step A42: the weights of fixing n-th RMB and side-play amount, and use the state of its recessive neuron as the next one
The input vector of RMB;
Step A43: carry out the training of next RMB until all RMB have trained.
A kind of Spam Filtering System based on degree of depth study, wherein, described spam mistake based on degree of depth study
Filter system includes:
Training module, generates primary vector spatial model for carrying out mail sample processing, builds degree of depth confidence network;
Test module, generates secondary vector spatial model for carrying out processing to test mail;
Detection module, for utilizing constructed degree of depth confidence network detection secondary vector spatial model;
Output module, for output detections result.
Described Spam Filtering System based on degree of depth study, wherein, described training module specifically includes:
Training submodule, is used for training mail sample;
Pretreatment submodule, for the mail sample after training is carried out pretreatment, determines feature the structure of spam
Make feature set;
Model construction submodule, for generating primary vector spatial model according to the feature set constructed;
DBN builds submodule, for building degree of depth confidence network according to generated primary vector spatial model.
Described Spam Filtering System based on degree of depth study, wherein, described pretreatment submodule specifically includes:
Participle unit, for carrying out participle to the mail sample after training;
Computing unit, for calculating all global factor separated corresponding to entry;
Dictionary construction unit, for having separated entry and the global factor structure dictionary calculated according to all;
Word frequency statistics unit, for adding up the word frequency remaining entry after stop words removed by constructed dictionary.
Described Spam Filtering System based on degree of depth study, wherein, described model construction submodule specifically includes:
Characteristic processing unit, for concentrating all features to carry out vectorization institute's structural feature, and according to vector space
Pattern stores;
Normalized unit, for being normalized the characteristic vector generated.
Described Spam Filtering System based on degree of depth study, wherein, described DBN constructor module specifically includes:
Training unit, obtains the weights of this RMB for training up n-th RMB;
RMB processing unit, for fixing weights and the side-play amount of n-th RMB, and uses the state of its recessive neuron to make
Input vector for next RMB.
Rubbish mail filtering method based on degree of depth study provided by the present invention, owing to have employed structure degree of depth confidence net
Network, by the way of constructed degree of depth confidence network detection test mail, improves and identifies the accuracy of spam and steady
Qualitative, save the mark great amount of samples required time spent and manpower simultaneously.
Accompanying drawing explanation
Fig. 1 is the main flow schematic diagram of rubbish mail filtering method based on degree of depth study in the present invention;
Fig. 2 is the structural representation of the Spam Filtering System that the present invention learns based on the degree of depth.
Detailed description of the invention
The present invention provide a kind of based on the degree of depth study rubbish mail filtering method and system, for make the purpose of the present invention,
Technical scheme and effect are clearer, clear and definite, referring to the drawings and give an actual example that the present invention is described in more detail.Should manage
Solving, specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.
The present invention provides rubbish mail filtering method based on degree of depth study, the oneself being had by degree of depth confidence network
Learning capacity, in conjunction with the advantage of big data, utilizes great amount of samples study present on network to improve classification capacity, on the one hand, energy
Enough improve the accuracy to spam filtering and stability;On the other hand, degree of depth confidence network is semi-supervised learning model, can
It is trained without class target sample set on a large scale to use, mark can be saved relative to traditional supervised learning model a large amount of
The sample required time spent and manpower.
As it is shown in figure 1, a kind of rubbish mail filtering method based on degree of depth study, wherein, described based on degree of depth study
Rubbish mail filtering method includes:
S100: carry out mail sample processing generation primary vector spatial model, build degree of depth confidence network;
In the embodiment of the present invention, mail sample is preferably training mail collection, refers to be made up of the mail of a large amount of known class
Set, it is possible to referred to as training set.The characteristic of each mail classes can be concluded by training mail sample.
The concept of degree of depth study comes from the research of artificial neural network, and the multilayer perceptron containing many hidden layers is exactly a kind of degree of depth
Study structure.Degree of depth study forms more abstract high-rise expression attribute classification or feature by combination low-level feature, to find
The distributed nature of data represents.
Vector space model (VSM:Vector Space Model), it is empty that it is reduced to vector to the process of content of text
Vector operation between, and it is with the similarity of similarity expression semanteme spatially, visual and understandable.When document is represented as
The vector of document space, it is possible to measure the similarity between document by calculating the similarity between vector.
In information filtering and searching field, for the ease of calculating, conventional vector space model represents text.This model
It it is the characteristic item first selected from text and there is the ability of representative
Degree of depth intelligence communication network (Deep Belief Network is called for short DBN), one can be as generating model, it is also possible to
As the weight by training wherein neuron of judgment models, allow whole neutral net according to maximum of probability to generate training number
According to bimodel.It may be used for identifying feature, categorical data, even generates data.
DBN is made up of multilamellar neuron, is divided into dominant neurologic unit (referred to as aobvious unit) and recessive neuron (the most hidden
Unit, can be described as again property detector);Aobvious unit is used for receiving input, and hidden unit is used for extracting feature.The connection of two interlayers topmost
It is undirected, associating internal memory can be formed;And for connecting upper and lower directed connection between other relatively low layer.The bottom represents number
According to vector, each neuron represents the one-dimensional of data vector.
In the embodiment of the present invention, preferably have the degree of depth confidence network of feedforward neural network composition of deep layer framework as instruction
Practice the network model of mail classification, it is possible to utilize less parameter to complete the function approximation of complexity.
S200: carry out processing generation secondary vector spatial model to test mail;
Carry out processing showing in the way of vector space model by test mail, namely refer to an i.e. postal of text
Part is expressed as a n-dimensional vector, and the sorting algorithm can not being constructed due to natural text directly processes, so firstly the need of
Text carries out certain process, be converted to the form that grader is capable of identify that, it is assumed that the value of n characteristic item of a document is respectively
For w1, w2 ..., wn, owing to they come from same mail to be filtered, considering so treating them as an entirety, allowing this
A little characteristic items constitute characteristic vector d, i.e. each text and are seen as being a vector in n-dimensional space, and its representation is:
D (w1, w2 ..., wn), wherein, wi is the weight of ith feature item, and n is the number of characteristic item, characteristic item can be word, word,
Phrase or certain conception of species, preferably word, in order to have higher nicety of grading.So text representation translates into advanced style of writing originally
Participle, then represented text by these words as vectorial dimension.
In the embodiment of the present invention, document refers to the such as paragraph of the fragment in mail or mail, sentence group or sentence etc..
Weight is a relative concept, for a certain index.The weight of a certain index refers to that this index is in entirety
Relative importance in evaluation.Weight is intended to separate weight from some evaluation indexes, and one group of assessment indicator system is relative
The weight answered constitutes proportional system.
S300: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;
Degree of depth confidence network detection secondary vector spatial model constructed by utilization, refers to utilize the degree of depth confidence trained
The mail that network processes is to be filtered, by mail classification to be filtered, checks that it is spam or normal email;This step i.e.
It is represented by again: the degree of depth confidence network constructed by utilization will be indicated as the mail to be filtered of secondary vector spatial model to be carried out point
Class, wherein, classification includes spam and normal email.
S400: output detections result.
Output detections result, refers to whether the filtering posts through above-mentioned steps is spam or belongs to training
Mail concentrates the output of the result such as which class, in order to e-mail recipient or system understand this mail classes, follow-up also can add at other
Reason process.As, after e-mail recipient confirms, the category or this mail transmission source address are added blacklist, gray list or white name
Single etc..
Rubbish mail filtering method based on degree of depth study provided by the present invention, owing to have employed structure degree of depth confidence net
Network, by the way of constructed degree of depth confidence network detection test mail, improves and identifies the accuracy of spam and steady
Qualitative, save the mark great amount of samples required time spent and manpower simultaneously.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S100 specifically includes:
S110: training mail sample;
S120: the mail sample after training is carried out pretreatment, determines feature the structural feature collection of spam;
Vector space model has Boolean type and numeric type two kinds, during numeric type vector space model represents, Features weight
Calculating use word frequency (TF, Term Frequency represents the number of times that occurs in the text of this feature word) to represent or TF-IDF
Methods such as (TF-inverse document frequency, arrange word frequency), the latter is the correlation combiner of TF and DF.
Therefore, when representing text with vector space model, owing to dimension of a vector space is come certainly by the number of word in text set
Fixed, thus dimension is sizable, but many information of text are again high redundancies, so needing dimension-reduction treatment and feature
Extract.Concretely comprise the following steps: text is carried out pretreatment, remove the word that in stop words and text, the frequency of occurrences is very few;Use spy
Determine feature selection approach and word is carried out Feature selection;Step can also be included: be added as needed on other features, it is therefore an objective to carry
High-class effect.
And Boolean type vector space model is the expression model of a kind of plain text, in text, the state of characteristic item only has 0
Or 1 two kinds of forms, 0 represents that this feature item does not appears in text, and 1 represents that text comprises characteristic item.Boolean type vector is empty
Between model by the word string of 0 and 1, text table is shown as 0/1 sequence.The advantage of this model is that design comparison is simple, classification
Efficiency is high.
S130: generate primary vector spatial model according to the feature set constructed;
The process generating primary vector spatial model all features will carry out vectorization empty according to vector in feature set
Inter mode carries out the process stored.
S140: build degree of depth confidence network according to generated primary vector spatial model.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S120 specifically includes:
S121: the mail sample after training is carried out participle;
Chinese word cutting method can be divided into three major types: the segmenting method of string matching based on dictionary, based on understand point
Word method and segmenting method based on statistics.
Gradually the matching method based on dictionary, is called again mechanical segmentation method, and it will be treated according to certain strategy
Entry in the Chinese character string analyzed and an abundant big machine dictionary mates, if finding certain character string in dictionary,
Then the match is successful.According to the difference of scanning direction, gradually the matching method can be divided into forward coupling and reverse coupling;Press
According to the situation of different length priority match, maximum match and smallest match can be divided into.Two kinds of conventional segmenting methods are as follows:
(1) Forward Maximum Method method.Forward Maximum Method method purpose is to be separated by the longest compound word.It basic
Thought is: assuming that Chinese character number contained by the longest entry in dictionary for word segmentation is n, then with before in the current word string of processed document
N word, as matching field, searches dictionary.If there is such a words in dictionary, then the match is successful, and matching field is made
Be a word segmentation out.If can not find such a words in dictionary, then it fails to match, last by matching field
One word removes, and remaining word string is re-started matching treatment ... so go on, until the match is successful, is syncopated as one
Individual word or residue word string a length of zero till.This completes one and take turns coupling, then take next n word word string and mate
Process, until document has been scanned.
(2) reverse maximum matching method.The ultimate principle of reverse maximum matching method is identical with Forward Maximum Method method, different
The direction being participle cutting is contrary with Forward Maximum Method method, and the dictionary for word segmentation used is the most different.When actual treatment, first
Document is carried out the process of falling row, generates reverse order document.Then, according to backward dictionary, reverse order document is used Forward Maximum Method method
Process.
Based on understand segmenting method, its by allow computer mould personification distich understanding, reach identify word effect.
Its basic thought carries out syntax, semantic analysis exactly while participle, utilizes syntactic information and semantic information to process ambiguity
Phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part
Under, participle subsystem can obtain the syntax and semantic information about word, sentence etc. and judge segmentation ambiguity, i.e. its mould
Intend people's understanding process to sentence.
Segmenting method based on statistics, formally sees, word is stable combinatorics on words, within a context, the most adjacent
The number of times that simultaneously occurs of word can preferably reflect the probability constituting word.The frequency that word and word are occurred simultaneously or probability
Adding up, what number of times was the highest the most likely constitutes a word.Result hence with word frequency statistics helps participle, can produce
Certain effect.Word group frequency in language material only need to be added up by this method, it is not necessary to cutting dictionary, because of and be called nothing
Dictionary based segment method or statistics take word method.
S122: separated entry structure dictionary according to all;
Can also calculate the global factor of all entries while structure dictionary, the value obtained by calculating places dictionary
In so that it is can directly invoke in subsequent process.
S123: add up the word frequency remaining entry after stop words removed by constructed dictionary.
Automatic fitration some word or word can be fallen, these words or word before or after processing natural language data (or text)
I.e. it is referred to as stop words (Stop Words), in the present invention, it is preferred to be often to occur in the text, the classification nothing to text
The word of much contributions.
By S121 to S123 and above-mentioned " dimension-reduction treatment and feature extraction concretely comprise the following steps: text is carried out pretreatment,
Remove the word that in stop words and text, the frequency of occurrences is very few;Use special characteristic system of selection that word is carried out Feature selection;
Step can also be included: be added as needed on other features ";Can be seen that step S122 and S123 can change order.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S130 specifically includes:
S131: institute's structural feature is concentrated all features carry out vectorization, and stores according to the pattern of vector space;
All features are concentrated to carry out vectorization institute's structural feature, it may be said that to be to be translated into characteristic vector respectively.
S132: the characteristic vector generated is normalized.
Normalization is a kind of mode simplifying calculating, will have the expression formula of dimension, through conversion, turns to nondimensional table
Reach formula, become scalar.
Step can also be included: give different weights, described weighted value to obtained characteristic vector after S132
The weight of primitive character, is chosen as after pretreatment the TF-IDF of word in text, and it can directly invoke be stored in dictionary complete
Office's factor, shown in calculation such as formula (1):
TF-IDF=(TF/Ni) * lg (N/DF) (1);
Wherein, the sum of word during Ni is mail;TF refers to the word frequency of given word in document;IDF is reverse file frequency
Rate, is the tolerance of the importance of a word;N represents total number of documents;DF represents the total number of documents comprising this word.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S140 includes:
S141: train up n-th RMB and obtain the weights of this RMB;
Limited Boltzmann machine (English: Restricted Boltzmann Machine, RBM) is that one can be by input
The stochastic generation neutral net of data set learning probability distribution, is the element of DBN, and each RBM can be used alone as
Cluster device.RMB is divided into aobvious layer and hidden layer, and aobvious layer is formed by showing unit, is used for inputting training data;Hidden layer is made up of hidden unit, is used as
Property detector.Aobvious unit between same aobvious layer is separate, and its unit hidden with hidden layer is connected;Same, in hidden layer
Also being separate between each hidden unit, its unit aobvious with aobvious layer is connected.
RBM is mainly defined by an energy function: as shown in formula (2):
E (v, h | θ)=-btv-cth-htWv (2);
Can show that the information vector of hidden layer in RMB and the information vector of aobvious layer meet respectively as public according to formula (2)
Probability distribution shown in formula (3) and formula (4):
P(vi=1 | h)=σ (bi+∑jwjihj) (3);
P(hj=1 | v)=σ (cj+∑iwjivi) (4);
The more new formula utilizing log-likelihood function can try to achieve parameter is respectively formula (5), formula (6) and formula (7):
ΔWji=η (< vihj>data-< vihj>confabula) (5);
Δbi=η (< vi>data-< vi>confabula) (6);
Δcj=η (< hj>data-< hj>confabula) (7)。
Greedy method can be used during the training of DBN successively to train the RBM of each layer, i.e. S140 step is particularly as follows: head
First train up first RBM;Fix weight and the side-play amount of first RBM, then use the state of its stealthy neuron
Input vector as second RBM;After training up second RBM, second RBM is stacked on above first RBM, weight
Multiple above-mentioned steps is until all RMB have trained.
S142: the weights of fixing n-th RMB and side-play amount, and use the state of its recessive neuron as next RMB
Input vector;
S143: carry out the training of next RMB until all RMB have trained.
May further comprise the step of: after this step and utilize the whole net of error back propagation process tuning in traditional neural network
Network, this step can eliminate the error successively being carried out RMB training accumulation by greedy method.
Filtrating mail is two classification problems, when with Processing with Neural Network problems, and top layer neuron general generation
The number of table classification, therefore to realize Spam filtering, can arrange last BP network output layer and comprise two nerves
Unit, the neuron number of input layer is the size of the vocabulary obtained after pretreatment.In the embodiment of the present invention, owing to RBM is general
Two-value input data are run, so RBM can preferably employ binary set.
The concrete training process of DBN is, first passes through a non-supervisory greediness successively method and goes pre-training to obtain generation mould
The weights of type.In this training stage, a vector v can be produced at aobvious layer, by it, value is delivered to hidden layer.In turn, aobvious
The input of layer can be by random selection, to attempt going to reconstruct original input signal.Finally, these new visual neurons swash
Forward direction transmission reconstruct hidden layer is activated unit by unit of living.In the training process, first hidden unit is given by visual vector-valued map;Then
Aobvious layer unit is rebuild by Hidden unit;These the most aobvious layer unit are mapped to hidden unit again, thus obtain new hidden unit.So instruction
The white silk time can significantly reduce, and just can learn close to maximum likelihood because having only to single step.Increase each of network access network
Layer all can improve the log probability of training data.
After pre-training, DBN can go differentiating that performance adjusts by utilizing tape label data BP algorithm.At this
In, a tally set will be affixed to top layer, bottom-up by one, and what study was arrived identifies that weights obtain a network
Classifying face.This performance can be better than the network of simple BP Algorithm for Training.
Concrete, first train ground floor by ca libration-free data, during training, first learn the parameter of ground floor, this layer can be seen
Work is to obtain a hidden layer making output and the minimum three-layer neural network of input difference, due to model hold quantitative limitation and
Sparsity constraints so that the model obtained can learn the structure to data itself, thus obtains having more expression energy than input
The feature of power;After study obtains (n-1)th layer, using the output of n-1 layer as the input of n-th layer, train n-th layer, thus distinguish
Obtain the parameter of each layer.
The each layer parameter obtained based on the first step adjusts the parameter of whole multilayered model further, and this step is one prison
Supervise and instruct experienced process;The first step is similar to the random initializtion initial value process of neutral net, due to the degree of depth study the first step be not with
Machine initializes, but obtained by the structure of study input data, thus this initial value is closer to global optimum such that it is able to
Obtain more preferable effect.After obtaining the degree of depth intelligence communication network trained, it is possible to using test sample generate vector space as
Input can be obtained by the classification of mail.
As in figure 2 it is shown, a kind of Spam Filtering System based on degree of depth study, wherein, described based on degree of depth study
Spam Filtering System includes:
Training module 100, generates primary vector spatial model for carrying out mail sample processing, builds degree of depth confidence net
Network, as detailed above;
Test module 200, generates secondary vector spatial model, as detailed above for carrying out processing to test mail;
Detection module 300, for utilizing constructed degree of depth confidence network detection secondary vector spatial model, the most as above
Described;
Output module 400, for output detections result, as detailed above.
Further, described Spam Filtering System based on degree of depth study, wherein, described training module 100 has
Body includes:
Training submodule, is used for training mail sample, as detailed above;
Pretreatment submodule, for the mail sample after training is carried out pretreatment, determines feature the structure of spam
Make feature set, as detailed above;
Model construction submodule, for generating primary vector spatial model, the most as above institute according to the feature set constructed
State;
DBN builds submodule, for building degree of depth confidence network according to generated primary vector spatial model, the most as above
Described.
Further, described Spam Filtering System based on degree of depth study, wherein, described pretreatment submodule has
Body includes:
Participle unit, for carrying out participle, as detailed above to the mail sample after training;
Computing unit, for calculating all global factor separated corresponding to entry, as detailed above;
Dictionary construction unit, is used for according to all entry and global factor structure dictionaries calculated of having separated, the most such as
Upper described;
Word frequency statistics unit, for adding up the word frequency remaining entry after stop words removed by constructed dictionary, the most as above institute
State.
Further, described Spam Filtering System based on degree of depth study, wherein, described model construction submodule
Specifically include:
Characteristic processing unit, for concentrating all features to carry out vectorization institute's structural feature, and according to vector space
Pattern stores, as detailed above;
Normalized unit, for being normalized the characteristic vector generated, as detailed above.
Further, described Spam Filtering System based on degree of depth study, wherein, described DBN constructor module
Specifically include:
Training unit, obtains the weights of this RMB, as detailed above for training up n-th RMB;RMB processes single
Unit, for fixing weights and the side-play amount of n-th RMB, and uses the state input as next RMB of its recessive neuron
Vector, as detailed above.
It should be appreciated that the application of the present invention is not limited to above-mentioned citing, for those of ordinary skills, can
To be improved according to the above description or to convert, such as vector space model levies a processing sequence etc., all these improvement and
Conversion all should belong to the protection domain of claims of the present invention.
Claims (10)
1. a rubbish mail filtering method based on degree of depth study, it is characterised in that described rubbish postal based on degree of depth study
Part filter method includes:
Step A: carry out mail sample processing generation primary vector spatial model, build degree of depth confidence network;
Step B: carry out processing generation secondary vector spatial model to test mail;
Step C: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;
Step D: output detections result.
Rubbish mail filtering method based on degree of depth study the most according to claim 1, it is characterised in that described step A
Specifically include:
Step A1: training mail sample;
Step A2: the mail sample after training is carried out pretreatment, determines feature the structural feature collection of spam;
Step A3: generate primary vector spatial model according to the feature set constructed;
Step A4: build degree of depth confidence network according to generated primary vector spatial model.
Rubbish mail filtering method based on degree of depth study the most according to claim 2, it is characterised in that described step A2
Specifically include:
Step A21: the mail sample after training is carried out participle;
Step A22: separated entry structure dictionary according to all;
Step A23: add up the word frequency remaining entry after stop words removed by constructed dictionary.
Rubbish mail filtering method based on degree of depth study the most according to claim 2, it is characterised in that described step A3
Specifically include:
Step A31: institute's structural feature is concentrated all features carry out vectorization, and stores according to the pattern of vector space;
Step A32: the characteristic vector generated is normalized.
Rubbish mail filtering method based on degree of depth study the most according to claim 2, it is characterised in that described step A4
Including:
Step A41: train up n-th RMB and obtain the weights of this RMB;
Step A42: the weights of fixing n-th RMB and side-play amount, and use the state of its recessive neuron as next RMB
Input vector;
Step A43: carry out the training of next RMB until all RMB have trained.
6. a Spam Filtering System based on degree of depth study, it is characterised in that described rubbish postal based on degree of depth study
Part filtration system includes:
Training module, generates primary vector spatial model for carrying out mail sample processing, builds degree of depth confidence network;
Test module, generates secondary vector spatial model for carrying out processing to test mail;
Detection module, for utilizing constructed degree of depth confidence network detection secondary vector spatial model;
Output module, for output detections result.
Spam Filtering System based on degree of depth study the most according to claim 6, it is characterised in that described training mould
Block specifically includes:
Training submodule, is used for training mail sample;
Pretreatment submodule, for the mail sample after training is carried out pretreatment, determines the feature of spam and constructs spy
Collection;
Model construction submodule, for generating primary vector spatial model according to the feature set constructed;
DBN builds submodule, for building degree of depth confidence network according to generated primary vector spatial model.
Spam Filtering System based on degree of depth study the most according to claim 7, it is characterised in that described pretreatment
Submodule specifically includes:
Participle unit, for carrying out participle to the mail sample after training;
Computing unit, for calculating all global factor separated corresponding to entry;
Dictionary construction unit, for having separated entry and the global factor structure dictionary calculated according to all;Word frequency statistics list
Unit, for adding up the word frequency remaining entry after stop words removed by constructed dictionary.
Spam Filtering System based on degree of depth study the most according to claim 7, it is characterised in that described model structure
Make submodule to specifically include:
Characteristic processing unit, for concentrating all features to carry out vectorization institute's structural feature, and according to the pattern of vector space
Storage;
Normalized unit, for being normalized the characteristic vector generated.
Spam Filtering System based on degree of depth study the most according to claim 7, it is characterised in that described DBN structure
Make submodule to specifically include:
Training unit, obtains the weights of this RMB for training up n-th RMB;
RMB processing unit, for fixing weights and the side-play amount of n-th RMB, and use the state of its recessive neuron as under
The input vector of one RMB.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610464120.6A CN106096005A (en) | 2016-06-23 | 2016-06-23 | A kind of rubbish mail filtering method based on degree of depth study and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610464120.6A CN106096005A (en) | 2016-06-23 | 2016-06-23 | A kind of rubbish mail filtering method based on degree of depth study and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106096005A true CN106096005A (en) | 2016-11-09 |
Family
ID=57252230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610464120.6A Pending CN106096005A (en) | 2016-06-23 | 2016-06-23 | A kind of rubbish mail filtering method based on degree of depth study and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106096005A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108108184A (en) * | 2017-03-07 | 2018-06-01 | 北京理工大学 | A kind of source code writer identification method based on depth belief network |
CN108199953A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | A kind of spam filtering method and system |
CN108694202A (en) * | 2017-04-10 | 2018-10-23 | 上海交通大学 | Configurable Spam Filtering System based on sorting algorithm and filter method |
CN108805132A (en) * | 2018-06-01 | 2018-11-13 | 华中科技大学 | A kind of rubbish text filter method based on deep learning |
CN109034246A (en) * | 2018-07-27 | 2018-12-18 | 中国矿业大学(北京) | A kind of the determination method and determining system of roadbed saturation state |
WO2019051704A1 (en) * | 2017-09-14 | 2019-03-21 | 深圳传音通讯有限公司 | Method and device for identifying junk file |
CN110019773A (en) * | 2017-08-14 | 2019-07-16 | 中国移动通信有限公司研究院 | A kind of refuse messages detection method, terminal and computer readable storage medium |
CN110149266A (en) * | 2018-07-19 | 2019-08-20 | 腾讯科技(北京)有限公司 | Spam filtering method and device |
CN111079427A (en) * | 2019-12-20 | 2020-04-28 | 北京金睛云华科技有限公司 | Junk mail identification method and system |
CN111970251A (en) * | 2020-07-28 | 2020-11-20 | 西安万像电子科技有限公司 | Data processing method and server |
CN112688852A (en) * | 2019-10-18 | 2021-04-20 | 上海越力信息科技有限公司 | E-mail management system and method based on deep learning |
CN113011503A (en) * | 2021-03-17 | 2021-06-22 | 彭黎文 | Data evidence obtaining method of electronic equipment, storage medium and terminal |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1614607A (en) * | 2004-11-25 | 2005-05-11 | 中国科学院计算技术研究所 | Filtering method and system for e-mail refuse |
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
US20130138670A1 (en) * | 2011-11-28 | 2013-05-30 | Hans-Martin Ludwig | Automatic tagging between structured/unstructured data |
-
2016
- 2016-06-23 CN CN201610464120.6A patent/CN106096005A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1614607A (en) * | 2004-11-25 | 2005-05-11 | 中国科学院计算技术研究所 | Filtering method and system for e-mail refuse |
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
US20130138670A1 (en) * | 2011-11-28 | 2013-05-30 | Hans-Martin Ludwig | Automatic tagging between structured/unstructured data |
Non-Patent Citations (1)
Title |
---|
孙劲光: "深度置信网络在垃圾邮件过滤中的应用", 《计算机应用》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108108184A (en) * | 2017-03-07 | 2018-06-01 | 北京理工大学 | A kind of source code writer identification method based on depth belief network |
CN108108184B (en) * | 2017-03-07 | 2020-12-04 | 北京理工大学 | Source code author identification method based on deep belief network |
CN108694202A (en) * | 2017-04-10 | 2018-10-23 | 上海交通大学 | Configurable Spam Filtering System based on sorting algorithm and filter method |
CN110019773A (en) * | 2017-08-14 | 2019-07-16 | 中国移动通信有限公司研究院 | A kind of refuse messages detection method, terminal and computer readable storage medium |
WO2019051704A1 (en) * | 2017-09-14 | 2019-03-21 | 深圳传音通讯有限公司 | Method and device for identifying junk file |
CN108199953A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | A kind of spam filtering method and system |
CN108199953B (en) * | 2018-01-31 | 2020-09-29 | 湖北工业大学 | Junk mail identification method and system |
CN108805132A (en) * | 2018-06-01 | 2018-11-13 | 华中科技大学 | A kind of rubbish text filter method based on deep learning |
CN108805132B (en) * | 2018-06-01 | 2021-08-20 | 华中科技大学 | Rubbish text filtering method based on deep learning |
CN110149266A (en) * | 2018-07-19 | 2019-08-20 | 腾讯科技(北京)有限公司 | Spam filtering method and device |
CN110149266B (en) * | 2018-07-19 | 2022-06-24 | 腾讯科技(北京)有限公司 | Junk mail identification method and device |
CN109034246A (en) * | 2018-07-27 | 2018-12-18 | 中国矿业大学(北京) | A kind of the determination method and determining system of roadbed saturation state |
CN112688852A (en) * | 2019-10-18 | 2021-04-20 | 上海越力信息科技有限公司 | E-mail management system and method based on deep learning |
CN111079427A (en) * | 2019-12-20 | 2020-04-28 | 北京金睛云华科技有限公司 | Junk mail identification method and system |
CN111970251A (en) * | 2020-07-28 | 2020-11-20 | 西安万像电子科技有限公司 | Data processing method and server |
CN113011503A (en) * | 2021-03-17 | 2021-06-22 | 彭黎文 | Data evidence obtaining method of electronic equipment, storage medium and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106096005A (en) | A kind of rubbish mail filtering method based on degree of depth study and system | |
CN110442760B (en) | Synonym mining method and device for question-answer retrieval system | |
JP7386240B2 (en) | automated email assistant | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN110059181A (en) | Short text stamp methods, system, device towards extensive classification system | |
CN112256939A (en) | Text entity relation extraction method for chemical field | |
CN112860898B (en) | Short text box clustering method, system, equipment and storage medium | |
CN111651602A (en) | Text classification method and system | |
CN116501875A (en) | Document processing method and system based on natural language and knowledge graph | |
Monisha et al. | Classification of bengali questions towards a factoid question answering system | |
CN109299251A (en) | A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm | |
CN112434145A (en) | Picture-viewing poetry method based on image recognition and natural language processing | |
Zobeidi et al. | Effective text classification using multi-level fuzzy neural network | |
CN111144453A (en) | Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data | |
CN110580286A (en) | Text feature selection method based on inter-class information entropy | |
CN114969324A (en) | Chinese news title classification method based on subject word feature expansion | |
Lin et al. | Chinese Question Classification Using Alternating and Iterative One-against-One Algorithm. | |
Zhang et al. | Using machine learning for automated detection of ambiguity in building requirements | |
Poornachandran et al. | MalHate: Hate Speech Detection in Malayalam Regional Language | |
Amin et al. | Enhancing the detection of fake news in social media based on machine learning models | |
García-Díaz et al. | UMUTeam at IberLEF-2022 DETESTS task: Feature Engineering for the Identification and Categorization of Racial Stereotypes in Spanish. | |
Alla et al. | Robust Text Clustering To Cluster The Text Documents In A Meta-Heuristic Optimization | |
Yixuan et al. | Spam Recognition Model Based on TextCNN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161109 |
|
RJ01 | Rejection of invention patent application after publication |