CN106096005A - A kind of rubbish mail filtering method based on degree of depth study and system - Google Patents

A kind of rubbish mail filtering method based on degree of depth study and system Download PDF

Info

Publication number
CN106096005A
CN106096005A CN201610464120.6A CN201610464120A CN106096005A CN 106096005 A CN106096005 A CN 106096005A CN 201610464120 A CN201610464120 A CN 201610464120A CN 106096005 A CN106096005 A CN 106096005A
Authority
CN
China
Prior art keywords
degree
mail
training
depth
rmb
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610464120.6A
Other languages
Chinese (zh)
Inventor
杨卫国
邹伟
何震宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Konka Group Co Ltd
Original Assignee
Konka Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Konka Group Co Ltd filed Critical Konka Group Co Ltd
Priority to CN201610464120.6A priority Critical patent/CN106096005A/en
Publication of CN106096005A publication Critical patent/CN106096005A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of rubbish mail filtering method based on degree of depth study and system, wherein, described rubbish mail filtering method based on degree of depth study includes: step A: carries out mail sample processing generation primary vector spatial model, builds degree of depth confidence network;Step B: carry out processing generation secondary vector spatial model to test mail;Step C: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;Step D: output detections result.Rubbish mail filtering method based on degree of depth study provided by the present invention, owing to have employed structure degree of depth confidence network, by the way of constructed degree of depth confidence network detection test mail, improve accuracy and the stability identifying spam, save the mark great amount of samples required time spent and manpower simultaneously.

Description

A kind of rubbish mail filtering method based on degree of depth study and system
Technical field
The present invention relates to Spam filtering field, a kind of spam based on degree of depth study Filter method and system.
Background technology
Along with the fast development of Internet technology, Email the most closely become people's life, in working and learning can not or The part lacked.The life that it is us provides a great convenience, but what people's life was caused by corresponding spam Perplex the most increasing.
The key problem of filtrating mail is how to use known email text data set to set up a text classification mould Type, then uses this model to differentiate email type, thus filters out spam.These algorithms following are ratios More common, such as: K next-door neighbour's algorithm (KNN), NB Algorithm, decision Tree algorithms, algorithm of support vector machine.But these Algorithm suffers from respective limitation.
NB Algorithm, howsoever select probability model, this model is the most all at a given text Under the conditions of could calculate mail and be divided into the probability of rubbish classification.And premise is par wise irrelevance between each feature.For knn Algorithm, choosing of k value is particularly important, which determines the correctness of last classification.But it is fine up to the present to go back neither one Method;Determine rational k value.
Owing to Spam filtering is actually two classification problems, although so traditional sorting technique can reach Purpose, but effect bad.The method that filtrating mail mainly uses at present is the method giving rule-based filtering, this method pair Very strong in the dependency of rule, if rule choose good, corresponding filter result also can be the best.But the spy of spam Point also can constantly change, and this just requires constantly regulation rule, the most passive and troublesome.
Therefore, prior art has yet to be improved and developed.
Summary of the invention
In view of above-mentioned the deficiencies in the prior art, it is an object of the invention to provide one and can improve spam filtering Accuracy and stability, cost time and the spam based on degree of depth study of manpower needed for saving mark great amount of samples simultaneously Filter method and system.
Technical scheme is as follows:
A kind of rubbish mail filtering method based on degree of depth study, wherein, described spam mistake based on degree of depth study Filtering method includes:
Step A: carry out mail sample processing generation primary vector spatial model, build degree of depth confidence network;
Step B: carry out processing generation secondary vector spatial model to test mail;
Step C: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;
Step D: output detections result.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A specifically includes:
Step A1: training mail sample;
Step A2: the mail sample after training is carried out pretreatment, determines feature the structural feature collection of spam;
Step A3: generate primary vector spatial model according to the feature set constructed;
Step A4: build degree of depth confidence network according to generated primary vector spatial model.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A2 specifically includes:
Step A21: the mail sample after training is carried out participle;
Step A22: separated entry structure dictionary according to all;
Step A23: add up the word frequency remaining entry after stop words removed by constructed dictionary.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A3 specifically includes:
Step A31: institute's structural feature is concentrated all features carry out vectorization, and stores according to the pattern of vector space;
Step A32: the characteristic vector generated is normalized.
Described rubbish mail filtering method based on degree of depth study, wherein, described step A4 includes:
Step A41: train up n-th RMB and obtain the weights of this RMB;
Step A42: the weights of fixing n-th RMB and side-play amount, and use the state of its recessive neuron as the next one The input vector of RMB;
Step A43: carry out the training of next RMB until all RMB have trained.
A kind of Spam Filtering System based on degree of depth study, wherein, described spam mistake based on degree of depth study Filter system includes:
Training module, generates primary vector spatial model for carrying out mail sample processing, builds degree of depth confidence network;
Test module, generates secondary vector spatial model for carrying out processing to test mail;
Detection module, for utilizing constructed degree of depth confidence network detection secondary vector spatial model;
Output module, for output detections result.
Described Spam Filtering System based on degree of depth study, wherein, described training module specifically includes:
Training submodule, is used for training mail sample;
Pretreatment submodule, for the mail sample after training is carried out pretreatment, determines feature the structure of spam Make feature set;
Model construction submodule, for generating primary vector spatial model according to the feature set constructed;
DBN builds submodule, for building degree of depth confidence network according to generated primary vector spatial model.
Described Spam Filtering System based on degree of depth study, wherein, described pretreatment submodule specifically includes:
Participle unit, for carrying out participle to the mail sample after training;
Computing unit, for calculating all global factor separated corresponding to entry;
Dictionary construction unit, for having separated entry and the global factor structure dictionary calculated according to all;
Word frequency statistics unit, for adding up the word frequency remaining entry after stop words removed by constructed dictionary.
Described Spam Filtering System based on degree of depth study, wherein, described model construction submodule specifically includes:
Characteristic processing unit, for concentrating all features to carry out vectorization institute's structural feature, and according to vector space Pattern stores;
Normalized unit, for being normalized the characteristic vector generated.
Described Spam Filtering System based on degree of depth study, wherein, described DBN constructor module specifically includes:
Training unit, obtains the weights of this RMB for training up n-th RMB;
RMB processing unit, for fixing weights and the side-play amount of n-th RMB, and uses the state of its recessive neuron to make Input vector for next RMB.
Rubbish mail filtering method based on degree of depth study provided by the present invention, owing to have employed structure degree of depth confidence net Network, by the way of constructed degree of depth confidence network detection test mail, improves and identifies the accuracy of spam and steady Qualitative, save the mark great amount of samples required time spent and manpower simultaneously.
Accompanying drawing explanation
Fig. 1 is the main flow schematic diagram of rubbish mail filtering method based on degree of depth study in the present invention;
Fig. 2 is the structural representation of the Spam Filtering System that the present invention learns based on the degree of depth.
Detailed description of the invention
The present invention provide a kind of based on the degree of depth study rubbish mail filtering method and system, for make the purpose of the present invention, Technical scheme and effect are clearer, clear and definite, referring to the drawings and give an actual example that the present invention is described in more detail.Should manage Solving, specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.
The present invention provides rubbish mail filtering method based on degree of depth study, the oneself being had by degree of depth confidence network Learning capacity, in conjunction with the advantage of big data, utilizes great amount of samples study present on network to improve classification capacity, on the one hand, energy Enough improve the accuracy to spam filtering and stability;On the other hand, degree of depth confidence network is semi-supervised learning model, can It is trained without class target sample set on a large scale to use, mark can be saved relative to traditional supervised learning model a large amount of The sample required time spent and manpower.
As it is shown in figure 1, a kind of rubbish mail filtering method based on degree of depth study, wherein, described based on degree of depth study Rubbish mail filtering method includes:
S100: carry out mail sample processing generation primary vector spatial model, build degree of depth confidence network;
In the embodiment of the present invention, mail sample is preferably training mail collection, refers to be made up of the mail of a large amount of known class Set, it is possible to referred to as training set.The characteristic of each mail classes can be concluded by training mail sample.
The concept of degree of depth study comes from the research of artificial neural network, and the multilayer perceptron containing many hidden layers is exactly a kind of degree of depth Study structure.Degree of depth study forms more abstract high-rise expression attribute classification or feature by combination low-level feature, to find The distributed nature of data represents.
Vector space model (VSM:Vector Space Model), it is empty that it is reduced to vector to the process of content of text Vector operation between, and it is with the similarity of similarity expression semanteme spatially, visual and understandable.When document is represented as The vector of document space, it is possible to measure the similarity between document by calculating the similarity between vector.
In information filtering and searching field, for the ease of calculating, conventional vector space model represents text.This model It it is the characteristic item first selected from text and there is the ability of representative
Degree of depth intelligence communication network (Deep Belief Network is called for short DBN), one can be as generating model, it is also possible to As the weight by training wherein neuron of judgment models, allow whole neutral net according to maximum of probability to generate training number According to bimodel.It may be used for identifying feature, categorical data, even generates data.
DBN is made up of multilamellar neuron, is divided into dominant neurologic unit (referred to as aobvious unit) and recessive neuron (the most hidden Unit, can be described as again property detector);Aobvious unit is used for receiving input, and hidden unit is used for extracting feature.The connection of two interlayers topmost It is undirected, associating internal memory can be formed;And for connecting upper and lower directed connection between other relatively low layer.The bottom represents number According to vector, each neuron represents the one-dimensional of data vector.
In the embodiment of the present invention, preferably have the degree of depth confidence network of feedforward neural network composition of deep layer framework as instruction Practice the network model of mail classification, it is possible to utilize less parameter to complete the function approximation of complexity.
S200: carry out processing generation secondary vector spatial model to test mail;
Carry out processing showing in the way of vector space model by test mail, namely refer to an i.e. postal of text Part is expressed as a n-dimensional vector, and the sorting algorithm can not being constructed due to natural text directly processes, so firstly the need of Text carries out certain process, be converted to the form that grader is capable of identify that, it is assumed that the value of n characteristic item of a document is respectively For w1, w2 ..., wn, owing to they come from same mail to be filtered, considering so treating them as an entirety, allowing this A little characteristic items constitute characteristic vector d, i.e. each text and are seen as being a vector in n-dimensional space, and its representation is: D (w1, w2 ..., wn), wherein, wi is the weight of ith feature item, and n is the number of characteristic item, characteristic item can be word, word, Phrase or certain conception of species, preferably word, in order to have higher nicety of grading.So text representation translates into advanced style of writing originally Participle, then represented text by these words as vectorial dimension.
In the embodiment of the present invention, document refers to the such as paragraph of the fragment in mail or mail, sentence group or sentence etc..
Weight is a relative concept, for a certain index.The weight of a certain index refers to that this index is in entirety Relative importance in evaluation.Weight is intended to separate weight from some evaluation indexes, and one group of assessment indicator system is relative The weight answered constitutes proportional system.
S300: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;
Degree of depth confidence network detection secondary vector spatial model constructed by utilization, refers to utilize the degree of depth confidence trained The mail that network processes is to be filtered, by mail classification to be filtered, checks that it is spam or normal email;This step i.e. It is represented by again: the degree of depth confidence network constructed by utilization will be indicated as the mail to be filtered of secondary vector spatial model to be carried out point Class, wherein, classification includes spam and normal email.
S400: output detections result.
Output detections result, refers to whether the filtering posts through above-mentioned steps is spam or belongs to training Mail concentrates the output of the result such as which class, in order to e-mail recipient or system understand this mail classes, follow-up also can add at other Reason process.As, after e-mail recipient confirms, the category or this mail transmission source address are added blacklist, gray list or white name Single etc..
Rubbish mail filtering method based on degree of depth study provided by the present invention, owing to have employed structure degree of depth confidence net Network, by the way of constructed degree of depth confidence network detection test mail, improves and identifies the accuracy of spam and steady Qualitative, save the mark great amount of samples required time spent and manpower simultaneously.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S100 specifically includes:
S110: training mail sample;
S120: the mail sample after training is carried out pretreatment, determines feature the structural feature collection of spam;
Vector space model has Boolean type and numeric type two kinds, during numeric type vector space model represents, Features weight Calculating use word frequency (TF, Term Frequency represents the number of times that occurs in the text of this feature word) to represent or TF-IDF Methods such as (TF-inverse document frequency, arrange word frequency), the latter is the correlation combiner of TF and DF.
Therefore, when representing text with vector space model, owing to dimension of a vector space is come certainly by the number of word in text set Fixed, thus dimension is sizable, but many information of text are again high redundancies, so needing dimension-reduction treatment and feature Extract.Concretely comprise the following steps: text is carried out pretreatment, remove the word that in stop words and text, the frequency of occurrences is very few;Use spy Determine feature selection approach and word is carried out Feature selection;Step can also be included: be added as needed on other features, it is therefore an objective to carry High-class effect.
And Boolean type vector space model is the expression model of a kind of plain text, in text, the state of characteristic item only has 0 Or 1 two kinds of forms, 0 represents that this feature item does not appears in text, and 1 represents that text comprises characteristic item.Boolean type vector is empty Between model by the word string of 0 and 1, text table is shown as 0/1 sequence.The advantage of this model is that design comparison is simple, classification Efficiency is high.
S130: generate primary vector spatial model according to the feature set constructed;
The process generating primary vector spatial model all features will carry out vectorization empty according to vector in feature set Inter mode carries out the process stored.
S140: build degree of depth confidence network according to generated primary vector spatial model.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S120 specifically includes:
S121: the mail sample after training is carried out participle;
Chinese word cutting method can be divided into three major types: the segmenting method of string matching based on dictionary, based on understand point Word method and segmenting method based on statistics.
Gradually the matching method based on dictionary, is called again mechanical segmentation method, and it will be treated according to certain strategy Entry in the Chinese character string analyzed and an abundant big machine dictionary mates, if finding certain character string in dictionary, Then the match is successful.According to the difference of scanning direction, gradually the matching method can be divided into forward coupling and reverse coupling;Press According to the situation of different length priority match, maximum match and smallest match can be divided into.Two kinds of conventional segmenting methods are as follows:
(1) Forward Maximum Method method.Forward Maximum Method method purpose is to be separated by the longest compound word.It basic Thought is: assuming that Chinese character number contained by the longest entry in dictionary for word segmentation is n, then with before in the current word string of processed document N word, as matching field, searches dictionary.If there is such a words in dictionary, then the match is successful, and matching field is made Be a word segmentation out.If can not find such a words in dictionary, then it fails to match, last by matching field One word removes, and remaining word string is re-started matching treatment ... so go on, until the match is successful, is syncopated as one Individual word or residue word string a length of zero till.This completes one and take turns coupling, then take next n word word string and mate Process, until document has been scanned.
(2) reverse maximum matching method.The ultimate principle of reverse maximum matching method is identical with Forward Maximum Method method, different The direction being participle cutting is contrary with Forward Maximum Method method, and the dictionary for word segmentation used is the most different.When actual treatment, first Document is carried out the process of falling row, generates reverse order document.Then, according to backward dictionary, reverse order document is used Forward Maximum Method method Process.
Based on understand segmenting method, its by allow computer mould personification distich understanding, reach identify word effect. Its basic thought carries out syntax, semantic analysis exactly while participle, utilizes syntactic information and semantic information to process ambiguity Phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part Under, participle subsystem can obtain the syntax and semantic information about word, sentence etc. and judge segmentation ambiguity, i.e. its mould Intend people's understanding process to sentence.
Segmenting method based on statistics, formally sees, word is stable combinatorics on words, within a context, the most adjacent The number of times that simultaneously occurs of word can preferably reflect the probability constituting word.The frequency that word and word are occurred simultaneously or probability Adding up, what number of times was the highest the most likely constitutes a word.Result hence with word frequency statistics helps participle, can produce Certain effect.Word group frequency in language material only need to be added up by this method, it is not necessary to cutting dictionary, because of and be called nothing Dictionary based segment method or statistics take word method.
S122: separated entry structure dictionary according to all;
Can also calculate the global factor of all entries while structure dictionary, the value obtained by calculating places dictionary In so that it is can directly invoke in subsequent process.
S123: add up the word frequency remaining entry after stop words removed by constructed dictionary.
Automatic fitration some word or word can be fallen, these words or word before or after processing natural language data (or text) I.e. it is referred to as stop words (Stop Words), in the present invention, it is preferred to be often to occur in the text, the classification nothing to text The word of much contributions.
By S121 to S123 and above-mentioned " dimension-reduction treatment and feature extraction concretely comprise the following steps: text is carried out pretreatment, Remove the word that in stop words and text, the frequency of occurrences is very few;Use special characteristic system of selection that word is carried out Feature selection; Step can also be included: be added as needed on other features ";Can be seen that step S122 and S123 can change order.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S130 specifically includes:
S131: institute's structural feature is concentrated all features carry out vectorization, and stores according to the pattern of vector space;
All features are concentrated to carry out vectorization institute's structural feature, it may be said that to be to be translated into characteristic vector respectively.
S132: the characteristic vector generated is normalized.
Normalization is a kind of mode simplifying calculating, will have the expression formula of dimension, through conversion, turns to nondimensional table Reach formula, become scalar.
Step can also be included: give different weights, described weighted value to obtained characteristic vector after S132 The weight of primitive character, is chosen as after pretreatment the TF-IDF of word in text, and it can directly invoke be stored in dictionary complete Office's factor, shown in calculation such as formula (1):
TF-IDF=(TF/Ni) * lg (N/DF) (1);
Wherein, the sum of word during Ni is mail;TF refers to the word frequency of given word in document;IDF is reverse file frequency Rate, is the tolerance of the importance of a word;N represents total number of documents;DF represents the total number of documents comprising this word.
Further, described rubbish mail filtering method based on degree of depth study, wherein, described S140 includes:
S141: train up n-th RMB and obtain the weights of this RMB;
Limited Boltzmann machine (English: Restricted Boltzmann Machine, RBM) is that one can be by input The stochastic generation neutral net of data set learning probability distribution, is the element of DBN, and each RBM can be used alone as Cluster device.RMB is divided into aobvious layer and hidden layer, and aobvious layer is formed by showing unit, is used for inputting training data;Hidden layer is made up of hidden unit, is used as Property detector.Aobvious unit between same aobvious layer is separate, and its unit hidden with hidden layer is connected;Same, in hidden layer Also being separate between each hidden unit, its unit aobvious with aobvious layer is connected.
RBM is mainly defined by an energy function: as shown in formula (2):
E (v, h | θ)=-btv-cth-htWv (2);
Can show that the information vector of hidden layer in RMB and the information vector of aobvious layer meet respectively as public according to formula (2) Probability distribution shown in formula (3) and formula (4):
P(vi=1 | h)=σ (bi+∑jwjihj) (3);
P(hj=1 | v)=σ (cj+∑iwjivi) (4);
The more new formula utilizing log-likelihood function can try to achieve parameter is respectively formula (5), formula (6) and formula (7):
ΔWji=η (< vihjdata-< vihjconfabula) (5);
Δbi=η (< vidata-< viconfabula) (6);
Δcj=η (< hjdata-< hjconfabula) (7)。
Greedy method can be used during the training of DBN successively to train the RBM of each layer, i.e. S140 step is particularly as follows: head First train up first RBM;Fix weight and the side-play amount of first RBM, then use the state of its stealthy neuron Input vector as second RBM;After training up second RBM, second RBM is stacked on above first RBM, weight Multiple above-mentioned steps is until all RMB have trained.
S142: the weights of fixing n-th RMB and side-play amount, and use the state of its recessive neuron as next RMB Input vector;
S143: carry out the training of next RMB until all RMB have trained.
May further comprise the step of: after this step and utilize the whole net of error back propagation process tuning in traditional neural network Network, this step can eliminate the error successively being carried out RMB training accumulation by greedy method.
Filtrating mail is two classification problems, when with Processing with Neural Network problems, and top layer neuron general generation The number of table classification, therefore to realize Spam filtering, can arrange last BP network output layer and comprise two nerves Unit, the neuron number of input layer is the size of the vocabulary obtained after pretreatment.In the embodiment of the present invention, owing to RBM is general Two-value input data are run, so RBM can preferably employ binary set.
The concrete training process of DBN is, first passes through a non-supervisory greediness successively method and goes pre-training to obtain generation mould The weights of type.In this training stage, a vector v can be produced at aobvious layer, by it, value is delivered to hidden layer.In turn, aobvious The input of layer can be by random selection, to attempt going to reconstruct original input signal.Finally, these new visual neurons swash Forward direction transmission reconstruct hidden layer is activated unit by unit of living.In the training process, first hidden unit is given by visual vector-valued map;Then Aobvious layer unit is rebuild by Hidden unit;These the most aobvious layer unit are mapped to hidden unit again, thus obtain new hidden unit.So instruction The white silk time can significantly reduce, and just can learn close to maximum likelihood because having only to single step.Increase each of network access network Layer all can improve the log probability of training data.
After pre-training, DBN can go differentiating that performance adjusts by utilizing tape label data BP algorithm.At this In, a tally set will be affixed to top layer, bottom-up by one, and what study was arrived identifies that weights obtain a network Classifying face.This performance can be better than the network of simple BP Algorithm for Training.
Concrete, first train ground floor by ca libration-free data, during training, first learn the parameter of ground floor, this layer can be seen Work is to obtain a hidden layer making output and the minimum three-layer neural network of input difference, due to model hold quantitative limitation and Sparsity constraints so that the model obtained can learn the structure to data itself, thus obtains having more expression energy than input The feature of power;After study obtains (n-1)th layer, using the output of n-1 layer as the input of n-th layer, train n-th layer, thus distinguish Obtain the parameter of each layer.
The each layer parameter obtained based on the first step adjusts the parameter of whole multilayered model further, and this step is one prison Supervise and instruct experienced process;The first step is similar to the random initializtion initial value process of neutral net, due to the degree of depth study the first step be not with Machine initializes, but obtained by the structure of study input data, thus this initial value is closer to global optimum such that it is able to Obtain more preferable effect.After obtaining the degree of depth intelligence communication network trained, it is possible to using test sample generate vector space as Input can be obtained by the classification of mail.
As in figure 2 it is shown, a kind of Spam Filtering System based on degree of depth study, wherein, described based on degree of depth study Spam Filtering System includes:
Training module 100, generates primary vector spatial model for carrying out mail sample processing, builds degree of depth confidence net Network, as detailed above;
Test module 200, generates secondary vector spatial model, as detailed above for carrying out processing to test mail;
Detection module 300, for utilizing constructed degree of depth confidence network detection secondary vector spatial model, the most as above Described;
Output module 400, for output detections result, as detailed above.
Further, described Spam Filtering System based on degree of depth study, wherein, described training module 100 has Body includes:
Training submodule, is used for training mail sample, as detailed above;
Pretreatment submodule, for the mail sample after training is carried out pretreatment, determines feature the structure of spam Make feature set, as detailed above;
Model construction submodule, for generating primary vector spatial model, the most as above institute according to the feature set constructed State;
DBN builds submodule, for building degree of depth confidence network according to generated primary vector spatial model, the most as above Described.
Further, described Spam Filtering System based on degree of depth study, wherein, described pretreatment submodule has Body includes:
Participle unit, for carrying out participle, as detailed above to the mail sample after training;
Computing unit, for calculating all global factor separated corresponding to entry, as detailed above;
Dictionary construction unit, is used for according to all entry and global factor structure dictionaries calculated of having separated, the most such as Upper described;
Word frequency statistics unit, for adding up the word frequency remaining entry after stop words removed by constructed dictionary, the most as above institute State.
Further, described Spam Filtering System based on degree of depth study, wherein, described model construction submodule Specifically include:
Characteristic processing unit, for concentrating all features to carry out vectorization institute's structural feature, and according to vector space Pattern stores, as detailed above;
Normalized unit, for being normalized the characteristic vector generated, as detailed above.
Further, described Spam Filtering System based on degree of depth study, wherein, described DBN constructor module Specifically include:
Training unit, obtains the weights of this RMB, as detailed above for training up n-th RMB;RMB processes single Unit, for fixing weights and the side-play amount of n-th RMB, and uses the state input as next RMB of its recessive neuron Vector, as detailed above.
It should be appreciated that the application of the present invention is not limited to above-mentioned citing, for those of ordinary skills, can To be improved according to the above description or to convert, such as vector space model levies a processing sequence etc., all these improvement and Conversion all should belong to the protection domain of claims of the present invention.

Claims (10)

1. a rubbish mail filtering method based on degree of depth study, it is characterised in that described rubbish postal based on degree of depth study Part filter method includes:
Step A: carry out mail sample processing generation primary vector spatial model, build degree of depth confidence network;
Step B: carry out processing generation secondary vector spatial model to test mail;
Step C: the degree of depth confidence network detection secondary vector spatial model constructed by utilization;
Step D: output detections result.
Rubbish mail filtering method based on degree of depth study the most according to claim 1, it is characterised in that described step A Specifically include:
Step A1: training mail sample;
Step A2: the mail sample after training is carried out pretreatment, determines feature the structural feature collection of spam;
Step A3: generate primary vector spatial model according to the feature set constructed;
Step A4: build degree of depth confidence network according to generated primary vector spatial model.
Rubbish mail filtering method based on degree of depth study the most according to claim 2, it is characterised in that described step A2 Specifically include:
Step A21: the mail sample after training is carried out participle;
Step A22: separated entry structure dictionary according to all;
Step A23: add up the word frequency remaining entry after stop words removed by constructed dictionary.
Rubbish mail filtering method based on degree of depth study the most according to claim 2, it is characterised in that described step A3 Specifically include:
Step A31: institute's structural feature is concentrated all features carry out vectorization, and stores according to the pattern of vector space;
Step A32: the characteristic vector generated is normalized.
Rubbish mail filtering method based on degree of depth study the most according to claim 2, it is characterised in that described step A4 Including:
Step A41: train up n-th RMB and obtain the weights of this RMB;
Step A42: the weights of fixing n-th RMB and side-play amount, and use the state of its recessive neuron as next RMB Input vector;
Step A43: carry out the training of next RMB until all RMB have trained.
6. a Spam Filtering System based on degree of depth study, it is characterised in that described rubbish postal based on degree of depth study Part filtration system includes:
Training module, generates primary vector spatial model for carrying out mail sample processing, builds degree of depth confidence network;
Test module, generates secondary vector spatial model for carrying out processing to test mail;
Detection module, for utilizing constructed degree of depth confidence network detection secondary vector spatial model;
Output module, for output detections result.
Spam Filtering System based on degree of depth study the most according to claim 6, it is characterised in that described training mould Block specifically includes:
Training submodule, is used for training mail sample;
Pretreatment submodule, for the mail sample after training is carried out pretreatment, determines the feature of spam and constructs spy Collection;
Model construction submodule, for generating primary vector spatial model according to the feature set constructed;
DBN builds submodule, for building degree of depth confidence network according to generated primary vector spatial model.
Spam Filtering System based on degree of depth study the most according to claim 7, it is characterised in that described pretreatment Submodule specifically includes:
Participle unit, for carrying out participle to the mail sample after training;
Computing unit, for calculating all global factor separated corresponding to entry;
Dictionary construction unit, for having separated entry and the global factor structure dictionary calculated according to all;Word frequency statistics list Unit, for adding up the word frequency remaining entry after stop words removed by constructed dictionary.
Spam Filtering System based on degree of depth study the most according to claim 7, it is characterised in that described model structure Make submodule to specifically include:
Characteristic processing unit, for concentrating all features to carry out vectorization institute's structural feature, and according to the pattern of vector space Storage;
Normalized unit, for being normalized the characteristic vector generated.
Spam Filtering System based on degree of depth study the most according to claim 7, it is characterised in that described DBN structure Make submodule to specifically include:
Training unit, obtains the weights of this RMB for training up n-th RMB;
RMB processing unit, for fixing weights and the side-play amount of n-th RMB, and use the state of its recessive neuron as under The input vector of one RMB.
CN201610464120.6A 2016-06-23 2016-06-23 A kind of rubbish mail filtering method based on degree of depth study and system Pending CN106096005A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610464120.6A CN106096005A (en) 2016-06-23 2016-06-23 A kind of rubbish mail filtering method based on degree of depth study and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610464120.6A CN106096005A (en) 2016-06-23 2016-06-23 A kind of rubbish mail filtering method based on degree of depth study and system

Publications (1)

Publication Number Publication Date
CN106096005A true CN106096005A (en) 2016-11-09

Family

ID=57252230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610464120.6A Pending CN106096005A (en) 2016-06-23 2016-06-23 A kind of rubbish mail filtering method based on degree of depth study and system

Country Status (1)

Country Link
CN (1) CN106096005A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108184A (en) * 2017-03-07 2018-06-01 北京理工大学 A kind of source code writer identification method based on depth belief network
CN108199953A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of spam filtering method and system
CN108694202A (en) * 2017-04-10 2018-10-23 上海交通大学 Configurable Spam Filtering System based on sorting algorithm and filter method
CN108805132A (en) * 2018-06-01 2018-11-13 华中科技大学 A kind of rubbish text filter method based on deep learning
CN109034246A (en) * 2018-07-27 2018-12-18 中国矿业大学(北京) A kind of the determination method and determining system of roadbed saturation state
WO2019051704A1 (en) * 2017-09-14 2019-03-21 深圳传音通讯有限公司 Method and device for identifying junk file
CN110019773A (en) * 2017-08-14 2019-07-16 中国移动通信有限公司研究院 A kind of refuse messages detection method, terminal and computer readable storage medium
CN110149266A (en) * 2018-07-19 2019-08-20 腾讯科技(北京)有限公司 Spam filtering method and device
CN111079427A (en) * 2019-12-20 2020-04-28 北京金睛云华科技有限公司 Junk mail identification method and system
CN111970251A (en) * 2020-07-28 2020-11-20 西安万像电子科技有限公司 Data processing method and server
CN112688852A (en) * 2019-10-18 2021-04-20 上海越力信息科技有限公司 E-mail management system and method based on deep learning
CN113011503A (en) * 2021-03-17 2021-06-22 彭黎文 Data evidence obtaining method of electronic equipment, storage medium and terminal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1614607A (en) * 2004-11-25 2005-05-11 中国科学院计算技术研究所 Filtering method and system for e-mail refuse
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
US20130138670A1 (en) * 2011-11-28 2013-05-30 Hans-Martin Ludwig Automatic tagging between structured/unstructured data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1614607A (en) * 2004-11-25 2005-05-11 中国科学院计算技术研究所 Filtering method and system for e-mail refuse
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
US20130138670A1 (en) * 2011-11-28 2013-05-30 Hans-Martin Ludwig Automatic tagging between structured/unstructured data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙劲光: "深度置信网络在垃圾邮件过滤中的应用", 《计算机应用》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108184A (en) * 2017-03-07 2018-06-01 北京理工大学 A kind of source code writer identification method based on depth belief network
CN108108184B (en) * 2017-03-07 2020-12-04 北京理工大学 Source code author identification method based on deep belief network
CN108694202A (en) * 2017-04-10 2018-10-23 上海交通大学 Configurable Spam Filtering System based on sorting algorithm and filter method
CN110019773A (en) * 2017-08-14 2019-07-16 中国移动通信有限公司研究院 A kind of refuse messages detection method, terminal and computer readable storage medium
WO2019051704A1 (en) * 2017-09-14 2019-03-21 深圳传音通讯有限公司 Method and device for identifying junk file
CN108199953A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of spam filtering method and system
CN108199953B (en) * 2018-01-31 2020-09-29 湖北工业大学 Junk mail identification method and system
CN108805132A (en) * 2018-06-01 2018-11-13 华中科技大学 A kind of rubbish text filter method based on deep learning
CN108805132B (en) * 2018-06-01 2021-08-20 华中科技大学 Rubbish text filtering method based on deep learning
CN110149266A (en) * 2018-07-19 2019-08-20 腾讯科技(北京)有限公司 Spam filtering method and device
CN110149266B (en) * 2018-07-19 2022-06-24 腾讯科技(北京)有限公司 Junk mail identification method and device
CN109034246A (en) * 2018-07-27 2018-12-18 中国矿业大学(北京) A kind of the determination method and determining system of roadbed saturation state
CN112688852A (en) * 2019-10-18 2021-04-20 上海越力信息科技有限公司 E-mail management system and method based on deep learning
CN111079427A (en) * 2019-12-20 2020-04-28 北京金睛云华科技有限公司 Junk mail identification method and system
CN111970251A (en) * 2020-07-28 2020-11-20 西安万像电子科技有限公司 Data processing method and server
CN113011503A (en) * 2021-03-17 2021-06-22 彭黎文 Data evidence obtaining method of electronic equipment, storage medium and terminal

Similar Documents

Publication Publication Date Title
CN106096005A (en) A kind of rubbish mail filtering method based on degree of depth study and system
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
JP7386240B2 (en) automated email assistant
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN106776562A (en) A kind of keyword extracting method and extraction system
CN110674252A (en) High-precision semantic search system for judicial domain
CN110059181A (en) Short text stamp methods, system, device towards extensive classification system
CN112256939A (en) Text entity relation extraction method for chemical field
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN111651602A (en) Text classification method and system
CN116501875A (en) Document processing method and system based on natural language and knowledge graph
Monisha et al. Classification of bengali questions towards a factoid question answering system
CN109299251A (en) A kind of abnormal refuse messages recognition methods and system based on deep learning algorithm
CN112434145A (en) Picture-viewing poetry method based on image recognition and natural language processing
Zobeidi et al. Effective text classification using multi-level fuzzy neural network
CN111144453A (en) Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
CN110580286A (en) Text feature selection method based on inter-class information entropy
CN114969324A (en) Chinese news title classification method based on subject word feature expansion
Lin et al. Chinese Question Classification Using Alternating and Iterative One-against-One Algorithm.
Zhang et al. Using machine learning for automated detection of ambiguity in building requirements
Poornachandran et al. MalHate: Hate Speech Detection in Malayalam Regional Language
Amin et al. Enhancing the detection of fake news in social media based on machine learning models
García-Díaz et al. UMUTeam at IberLEF-2022 DETESTS task: Feature Engineering for the Identification and Categorization of Racial Stereotypes in Spanish.
Alla et al. Robust Text Clustering To Cluster The Text Documents In A Meta-Heuristic Optimization
Yixuan et al. Spam Recognition Model Based on TextCNN

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161109

RJ01 Rejection of invention patent application after publication