CN105306296A - Data filter processing method based on LTE (Long Term Evolution) signaling - Google Patents
Data filter processing method based on LTE (Long Term Evolution) signaling Download PDFInfo
- Publication number
- CN105306296A CN105306296A CN201510694999.9A CN201510694999A CN105306296A CN 105306296 A CN105306296 A CN 105306296A CN 201510694999 A CN201510694999 A CN 201510694999A CN 105306296 A CN105306296 A CN 105306296A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- classification
- algorithm
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/02—Capturing of monitoring data
- H04L43/028—Capturing of monitoring data by filtering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a data filter processing method based on LTE (Long Term Evolution) signaling. In particular, big data filter processing is performed on an LTE mobile core high-speed network system, and a four-stage hybrid filtering mode in which simple filtering and content deep filtering are combined is adopted. The method comprises the following steps: firstly, performing simple data preprocessing through a quintuple; secondly, determining a data source property according to a KNN (Key Nearest Neighbor) text classification method; thirdly, implementing content-based three-stage information filtering through an optimized AdaBoost algorithm; and finally, implementing complete data filter processing. Compared with a conventional filtering method, the data filter processing method has the advantages that the problems of missing and screening errors on the aspect of data filtering in a conventional LTE system are solved; a high stability and a high accuracy are achieved; and very high robustness to data filtering is realized. The data filter processing method can be directly applied to the fields of network security, network information data processing, big data analyses and the like.
Description
Technical field
The present invention relates to a kind of data filtering processing method based on LTE signaling, belong to data filtering processing technology field.
Background technology
Five-tuple simple data is filtered, refer to first defined variable SIP, DIP, SP, DP, PT be respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple.In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), carries out information single filter according to demand.
KNN (KeyNearestNeighbor) algorithm is a kind of algorithm for pattern recognition of Corpus--based Method, mainly be used in text classification, its basic thought is: after new text is given, consider the one section of text concentrating nearest to the text (being the most similar) at training text, the classification belonging to this section of text judges the classification belonging to new text.That is, each section of text is all regarded as a N dimensional vector, calculate the distance that new text and training text concentrate text, determined the classification of new text by distance.
The AdaBoost algorithm optimized is the Least risk Bayes in-depth filtration algorithm based on AdaBoost algorithm.Training framework using AdaBoost algorithm as grader, replaces the Weak Classifier in AdaBoost algorithm by Least risk Bayes sorting algorithm, as the grader of AdaBoost, finally reaches the combination of two algorithms.Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning.Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability.Thus obtain the theme (selection has the class of maximum a posteriori probability as the theme belonging to object source) of object source.By the set of training source data, obtain each data message in inhomogeneous probability size by Bayesian Classification Arithmetic, construct Bayesian Classification Model, it is minimum that naive Bayesian is exactly Bayesian Classification Model medial error rate, and estimated parameter needed for it is little, and implementation algorithm is very simple.AdaBoost algorithm is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest.The main feature of algorithm has:
1. adopt five-tuple simple filtration algorithm, information filtering that the AdaBoost method of KNN file classification method and improvement carries out three grades of deep content, effectively ensure that strainability and the robustness of system;
2. high speed, accuracy.KNN Algorithm of documents categorization can select according to the needs of user self document that is applicable to, filters out useless document, carries out automatic classification quickly and efficiently to a large amount of network datas, be applicable to carry out information sifting in mass data;
3. the AdaBoost algorithm through optimizing can get rid of some unnecessary training data features, pay close attention to crucial training data, and according to different topic distillation strategies, data are filtered, the possibility of all classification errors is all taken into account, reduce the risk of erroneous judgement to a great extent;
4. reduce system loading, improve running efficiency of system.Clustering method is adopted to carry out tissue and classification automatically to Sample Storehouse.
5. stability, by three-stage filtration, can significant increase filtering function disposal ability.
But for KNN Algorithm of documents categorization, this algorithm also existing defects, when sample size is uneven, sample size as a class is very large, and other class sample sizes very little time, likely cause when an input new samples, in K neighbours of this sample, the sample of Large Copacity class occupies the majority, because this algorithm only calculates " nearest " neighbours sample, so when the sample size of a certain class is very large, new samples can be caused cannot to match correct sample, the method (increasing and the weights of this sample apart from little neighbours) changing weights can be adopted to improve, but the complexity of algorithm can be brought.
Summary of the invention
For above problem, the object of the present invention is to provide efficiently a kind of, the stable data filtering processing method based on LTE signaling, it adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize deep content and filter.
Its concrete steps comprise:
1. five-tuple simple filtration.
First simple single filter being carried out to network data---five-tuple filters.Defined variable SIP, DIP, SP, DP, PT represent respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple.In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), thus the information of carrying out single filter.
2.KNN file classification method determines source character.
Data after five-tuple filtration treatment are called new text, the text that itself and given training text are concentrated is carried out KNN text classification calculating, to new input example and new text, concentrate at training text and find K the example the most contiguous with this example, then that class belonging to majority of this K example is exactly the class of new text.That is, new text and training text are all regarded as a N dimensional vector, calculate the similarity that new text and training text concentrate each text, find out K sample the most similar, determined the classification of new text by the classification belonging to Weighted distance and training text.
KNN algorithmic procedure is described below:
1) for new text and training text, new text vector and training text vector is formed according to Feature Words.
According to traditional vector space model, text message is turned to the weighted feature vector in feature space by form.I.e. D=D (T
1, W
1; T
2, W
2; T
n, W
n), the vector representation of new text and training text is determined according to Feature Words.
2) text similarity that new text and training text concentrate each text is calculated. computing formula is:
Wherein d
ifor the characteristic vector of new text, d
jfor the center vector of jth class, M is the dimension of characteristic vector, W
kfor the kth of vector is tieed up.
For k value, because KNN method is counted as one estimate posterior probability P (w from sample
i| method x), so in order to be estimated reliably, k value need be the bigger the better, just can improve the accuracy of estimation like this, but then, wishes that again this k neighbor point is more near apart from new text better, remembers that new text posterior probability is P (w
i| x
1), only have when this k neighbor point is nearer apart from new text, P (w
i| x
1) just can approach P (w as much as possible
i| x), all determine k value according to the personal experience of people in the past, so often occur estimating inaccurate situation, if the selection of k value is too small, the contiguous number obtained can be too small, then reduce nicety of grading, if the selection of k value is excessive, then easily increases noise data and reduce classification accuracy, therefore now proved by a large amount of experiments, during all text numbers of fetching data in storehouse when k value, the classification results of new text is globally optimal solution.
3) in k neighbours of new text, the weight of every class is calculated successively,
Wherein
for the characteristic vector of new text,
for calculating formula of similarity,
for data source character.
4) weight of comparing class, assigns to text in that maximum classification of weight and goes.
In sum, data carry out cascade filtration by the method for KNN text classification to data message, determine the character of data source.
3. the AdaBoost method optimized carries out deep content filtration.
It is training framework using AdaBoost algorithm as grader that the present invention proposes based on the Least risk Bayes in-depth filtration algorithm of AdaBoost algorithm, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost.
AdaBoost is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest.Its algorithm realizes according to change Data distribution8, whether correct according to the classification of sample each among each training set, and the accuracy of the general classification of last time determines the weights of each sample, the up-to-date weights revised are delivered in sub classification device and trains, finally the grader that obtains after training is at every turn merged, export final most strong classifier.
Assuming that training sample set is: S={ (x
1, y
1), (x
2, y
2) ..., (x
i, y
i), x
i∈ X, y
i∈ Y}, X, Y correspond respectively to positive example sample and negative routine sample, and M is the maximum cycle of training, and the error rate of grader is designated as ε
m, minimal error rate is designated as
In the former algorithm of AdaBoost, integrate whole decision-making to produce final decision-making by the majority voting method of a weighting:
Wherein P
mx () is grader decision function, AdaBoost algorithm suitably can be integrated the mistake learning the Weak Classifier obtained, iteration all will upgrade weight each time, reduce the weight of the good data of Weak Classifier classifying quality, increase the weight of the poor data of Weak Classifier classifying quality, final grader is the weighted average of Weak Classifier.
Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability.Namely which kind of theme object source belong to, and selects the class with maximum a posteriori probability as the theme belonging to object source.By the set of training source data, obtain each data message in inhomogeneous probability size by bayesian theory, construct Bayesian model.Naive Bayesian is that Bayesian Classification Model medial error rate is minimum, and estimated parameter needed for it is little, and implementation algorithm is simple.Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning.In the present invention, if by system, data are judged to be that " sensitive data " is used as junk data and is filtered out, but it is again exactly the content required for user, then can cause very large loss to user.By the theme of Least risk Bayes sorting technique determination data source, filter according to different topic distillation strategies, all classification errors are all taken into account, the risk of erroneous judgement can be reduced to a great extent.
Known P (ω
i), P (X| ω
i), i=1,2 ..., c, and when X to be identified (network packet to be filtered), calculate posterior probability according to Bayesian formula,
Wherein P (ω
i) be prior probability, be obtained by the demand analysis of user to network data in the past; P (ω
j| X) be posterior probability, be the probability again corrected again after obtaining information X, P (X| ω
i) be judge that whether the X to be identified that receives is the probability of rubbish network data according to the demand experience of user to network data in the past.
Note data degradation is α, decision-making decision rule is defined as:
1) when network data is junk data, be judged as that junk data can not cause any loss, α=0;
2) when being valid data rubbish network data judging, then α=0 is lost;
3) when needed for user, network data is judged to be junk data, then the loss caused is immeasurable, 0< α < ∞.
According to the decision rule of the posterior probability that draws after calculating and setting, calculate as follows and take d
i, i=1,2 ... the conditional risk of a:
After considering that data are misjudged, loss α → 0 is dropped to minimum, therefore to d the conditional risk value R (d obtained before
i| X), i=1,2 ..., d compares, and therefrom finds out the decision-making making conditional risk minimum, is designated as d
k, d
kit is exactly Least risk Bayes categorised decision.
The AdaBoost method that the present invention optimizes is as follows:
Input network data with a matrix type, initializes weights
perform circulation m=1,2 ..., M, by ω
ivalue substitute in AdaBoost framework, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y
i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P
m, by classification error rate α
mupgrade, the weights obtaining training sample are
continue the circulation starting next round, until M circulation terminates.By repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P
m, obtain through algorithm:
Final P (x) is exactly the final grader obtained after M study in content-based in-depth filtration algorithm.
The present invention can obtain following beneficial effect:
The present invention is directed to above problem, the object of the present invention is to provide efficiently a kind of, the stable data filtering processing method based on LTE signaling, it adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize the data filtering processing method based on LTE signaling of complete set.Data filtering processing method application scenarios figure as shown in Figure 1.Data processing is divided into three processes, LTE signaling data is carried out to the filtration of five-tuple simple data, carries out deep content filtration by the AdaBoost method of KNN file classification method determination data source character, optimization to signaling data, complete the data processing to described LTE data message, process as shown in Figure 2.The present invention has higher filtration accuracy and system robustness than original information filtering method, and the present invention can directly apply to the fields such as network security, network information data process, large data analysis.
Accompanying drawing explanation
The data filtering processing method application scenarios figure of Fig. 1 LTE signaling.
Fig. 2 data filtering process flow figure.
Fig. 3 filtration treatment method cell schematics is illustrated.
Fig. 4 HASH rule is arranged.
Fig. 5 KNN file classification method key diagram.
Embodiment
In order to understand the present invention better, below in conjunction with the drawings and specific embodiments, the present invention is elaborated.The present invention proposes a kind of three DBMS filtration treatment methods based on LTE signaling newly.Below key step is specifically described.
1. five-tuple simple filtration
First defined variable SIP, DIP, SP, DP, PT represents source IP address in five-tuple, object IP address, source port number, destination slogan and transport protocol type respectively, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), PCL is the advanced function of exchange chip, and it is the filtering rule list realized on exchange chip.
Data message is first through the process of PCL (IngressPCL) Engine, an IPCLTable is generated according to the type of message and PCL-ID, carry out searching coupling in TCAM by this table, the condition that the match is successful is: first PCL-ID is identical, then define a data structure and be used as function parameter, and the PCL transmitting client layer specified then, how many rule is had just to need to perform to secondary coupling less, five-tuple of the present invention filters and uses mask, respectively based on source IP address+mask SIP_MASK, object IP address+mask DIP_MASK, source port number+mask SP_MASK, destination slogan+mask DP_MASK, transport protocol type+mask PT_MASK is respectively as filter condition, hit rule then preserves output, miss rule, then carry out discard processing.
2.KNN file classification method
1) new text vector and training text vector is formed according to Feature Words.
I.e. D=D (T
1, W
1; T
2, W
2; T
n, W
n), the vector representation of new text and training text is determined according to Feature Words.
2) text similarity that new text and training text concentrate each text is calculated. computing formula is:
Wherein d
ifor the characteristic vector of new text, d
jfor the center vector of jth class, M is the dimension of characteristic vector, W
kfor the kth of vector is tieed up.
3) choose the number that k value is all texts in database, then according to the value of Sim according to sorting from high to low, k equals several, just gets front severally to sort.
4) in k neighbours of new text, the weight of every class is calculated successively,
Wherein
for the characteristic vector of new text,
for calculating formula of similarity,
for data source character.
5) weight of comparing class, namely in k neighbours, belong to which classification many, which classification new text just belongs to.Message is originally assigned in that maximum classification of weight and go.
The cascade filtration more than undertaken by KNN file classification method for data, determines the character of data source.
3. the AdaBoost method optimized carries out deep content filtration.
The AdaBoost method of optimization of the present invention is the Least risk Bayes in-depth filtration algorithm based on AdaBoost algorithm.Its training framework using AdaBoost algorithm as grader, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost.
1) the form input of matrix will be converted to by the network data after front double-filtration, and under given rectangular characteristic prototype, calculate and obtain rectangular characteristic collection,
2) initializes weights
perform circulation m=1,2 ..., M, by ω
ivalue substitute in AdaBoost framework,
3) take feature set as input, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y
i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P
m,
4) by classification error rate α
mupgrade, the weights obtaining training sample are
Continue the circulation starting next round, until M circulation terminates.
5) by repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P
m, obtain through algorithm:
Final P (x) is exactly the final grader obtained after M study in content-based in-depth filtration algorithm.
Claims (1)
1., based on a data filtering processing method for LTE signaling, it is characterized in that:
This method adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize deep content and filter;
Its concrete steps comprise:
1. five-tuple simple filtration;
First simple single filter being carried out to network data---five-tuple filters; Defined variable SIP, DIP, SP, DP, PT represent respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple; In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL, thus the information of carrying out single filter;
2.KNN file classification method determines source character;
Data after five-tuple filtration treatment are called new text, the text that itself and given training text are concentrated is carried out KNN text classification calculating, to new input example and new text, concentrate at training text and find K the example the most contiguous with this example, then that class belonging to majority of this K example is exactly the class of new text; That is, new text and training text are all regarded as a N dimensional vector, calculate the similarity that new text and training text concentrate each text, find out K sample the most similar, determined the classification of new text by the classification belonging to Weighted distance and training text;
KNN algorithmic procedure is described below:
1) for new text and training text, new text vector and training text vector is formed according to Feature Words;
According to traditional vector space model, text message is turned to the weighted feature vector in feature space by form; I.e. D=D (T
1, W
1; T
2, W
2; T
n, W
n), the vector representation of new text and training text is determined according to Feature Words;
2) text similarity that new text and training text concentrate each text is calculated. computing formula is:
Wherein d
ifor the characteristic vector of new text, d
jfor the center vector of jth class, M is the dimension of characteristic vector, W
kfor the kth of vector is tieed up;
For k value, because KNN method is counted as one estimate posterior probability P (w from sample
i| method x), so in order to be estimated reliably, k value need be the bigger the better, just can improve the accuracy of estimation like this, but then, wishes that again this k neighbor point is more near apart from new text better, remembers that new text posterior probability is P (w
i| x
1), only have when this k neighbor point is nearer apart from new text, P (w
i| x
1) just can approach P (w as much as possible
i| x), all determine k value according to the personal experience of people in the past, so often occur estimating inaccurate situation, if the selection of k value is too small, the contiguous number obtained can be too small, then reduce nicety of grading, if the selection of k value is excessive, then easily increases noise data and reduce classification accuracy, therefore now proved by a large amount of experiments, during all text numbers of fetching data in storehouse when k value, the classification results of new text is globally optimal solution;
3) in k neighbours of new text, the weight of every class is calculated successively,
Wherein
for the characteristic vector of new text,
for calculating formula of similarity,
for data source character;
4) weight of comparing class, assigns to text in that maximum classification of weight and goes;
In sum, data carry out cascade filtration by the method for KNN text classification to data message, determine the character of data source;
3. the AdaBoost method optimized carries out deep content filtration;
It is training framework using AdaBoost algorithm as grader that this method proposes based on the Least risk Bayes in-depth filtration algorithm of AdaBoost algorithm, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost;
AdaBoost is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest; Its algorithm realizes according to change Data distribution8, whether correct according to the classification of sample each among each training set, and the accuracy of the general classification of last time determines the weights of each sample, the up-to-date weights revised are delivered in sub classification device and trains, finally the grader that obtains after training is at every turn merged, export final most strong classifier;
If training sample set is: S={ (x
1, y
1), (x
2, y
2) ..., (x
i, y
i), x
i∈ X, y
i∈ Y}, X, Y correspond respectively to positive example sample and negative routine sample, and M is the maximum cycle of training, and the error rate of grader is designated as ε
m, minimal error rate is designated as
In the former algorithm of AdaBoost, integrate whole decision-making to produce final decision-making by the majority voting method of a weighting:
Wherein P
mx () is grader decision function, AdaBoost algorithm suitably can be integrated the mistake learning the Weak Classifier obtained, iteration all will upgrade weight each time, reduce the weight of the good data of Weak Classifier classifying quality, increase the weight of the poor data of Weak Classifier classifying quality, final grader is the weighted average of Weak Classifier;
Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability; Namely which kind of theme object source belong to, and selects the class with maximum a posteriori probability as the theme belonging to object source; By the set of training source data, obtain each data message in inhomogeneous probability size by bayesian theory, construct Bayesian model; Naive Bayesian is that Bayesian Classification Model medial error rate is minimum, and estimated parameter needed for it is little, and implementation algorithm is simple; Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning; In the method, if by system, data are judged to be that " sensitive data " is used as junk data and is filtered out, but it is again exactly the content required for user, then can cause very large loss to user; By the theme of Least risk Bayes sorting technique determination data source, filter according to different topic distillation strategies, all classification errors are all taken into account, the risk of erroneous judgement can be reduced to a great extent;
Known P (ω
i), P (X| ω
i), i=1,2 ..., c, and when X to be identified (network packet to be filtered), calculate posterior probability according to Bayesian formula,
Wherein P (ω
i) be prior probability, be obtained by the demand analysis of user to network data in the past; P (ω
j| X) be posterior probability, be the probability again corrected again after obtaining information X, P (X| ω
i) be judge that whether the X to be identified that receives is the probability of rubbish network data according to the demand experience of user to network data in the past;
Note data degradation is α, decision-making decision rule is defined as:
1) when network data is junk data, be judged as that junk data can not cause any loss, α=0;
2) when being valid data rubbish network data judging, then α=0 is lost;
3) when needed for user, network data is judged to be junk data, then the loss caused is immeasurable, 0< α < ∞;
According to the decision rule of the posterior probability that draws after calculating and setting, calculate as follows and take d
i, i=1,2 ... the conditional risk of a:
After considering that data are misjudged, loss α → 0 is dropped to minimum, therefore to d the conditional risk value R (d obtained before
i| X), i=1,2 ..., d compares, and therefrom finds out the decision-making making conditional risk minimum, is designated as d
k, d
kit is exactly Least risk Bayes categorised decision;
The AdaBoost method that this method is optimized is as follows:
Input network data with a matrix type, initializes weights
i=1,2 ..., n, performs circulation m=1,2 ..., M, by ω
ivalue substitute in AdaBoost framework, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y
i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P
m, by classification error rate α
mupgrade, the weights obtaining training sample are
continue the circulation starting next round, until M circulation terminates; By repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P
m, obtain through algorithm:
Final P (x) is exactly the final grader obtained after M study in content-based in-depth filtration algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510694999.9A CN105306296B (en) | 2015-10-21 | 2015-10-21 | A kind of data filtering processing method based on LTE signalings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510694999.9A CN105306296B (en) | 2015-10-21 | 2015-10-21 | A kind of data filtering processing method based on LTE signalings |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105306296A true CN105306296A (en) | 2016-02-03 |
CN105306296B CN105306296B (en) | 2018-10-12 |
Family
ID=55203076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510694999.9A Expired - Fee Related CN105306296B (en) | 2015-10-21 | 2015-10-21 | A kind of data filtering processing method based on LTE signalings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105306296B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107332704A (en) * | 2017-07-03 | 2017-11-07 | 南京华苏科技有限公司 | Assess the method and system that high-speed railway mobile subscriber uses LTE service quality |
CN107908720A (en) * | 2017-11-14 | 2018-04-13 | 河北工程大学 | A kind of patent data cleaning method and system based on AdaBoost algorithms |
CN108009249A (en) * | 2017-12-01 | 2018-05-08 | 北京中视广信科技有限公司 | For the comment spam filter method of the fusion user behavior rule of unbalanced data |
CN108091134A (en) * | 2017-12-08 | 2018-05-29 | 北京工业大学 | A kind of conventional data set creation method based on mobile phone signaling position track data |
CN112784910A (en) * | 2021-01-28 | 2021-05-11 | 武汉市博畅软件开发有限公司 | Deep filtering method and system for junk data |
WO2022087806A1 (en) * | 2020-10-27 | 2022-05-05 | Paypal, Inc. | Multi-phase training techniques for machine learning models using weighted training data |
CN116192997A (en) * | 2023-02-21 | 2023-05-30 | 上海兴容信息技术有限公司 | Event detection method and system based on network flow |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050556A (en) * | 2014-05-27 | 2014-09-17 | 哈尔滨理工大学 | Feature selection method and detection method of junk mails |
CN104598586A (en) * | 2015-01-18 | 2015-05-06 | 北京工业大学 | Large-scale text classifying method |
CN104750850A (en) * | 2015-04-14 | 2015-07-01 | 中国地质大学(武汉) | Feature selection method based on information gain ratio |
-
2015
- 2015-10-21 CN CN201510694999.9A patent/CN105306296B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104050556A (en) * | 2014-05-27 | 2014-09-17 | 哈尔滨理工大学 | Feature selection method and detection method of junk mails |
CN104598586A (en) * | 2015-01-18 | 2015-05-06 | 北京工业大学 | Large-scale text classifying method |
CN104750850A (en) * | 2015-04-14 | 2015-07-01 | 中国地质大学(武汉) | Feature selection method based on information gain ratio |
Non-Patent Citations (1)
Title |
---|
张铭锋: "垃圾邮件过滤的贝叶斯方法综述", 《计算机应用与研究》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107332704A (en) * | 2017-07-03 | 2017-11-07 | 南京华苏科技有限公司 | Assess the method and system that high-speed railway mobile subscriber uses LTE service quality |
CN107908720A (en) * | 2017-11-14 | 2018-04-13 | 河北工程大学 | A kind of patent data cleaning method and system based on AdaBoost algorithms |
CN108009249A (en) * | 2017-12-01 | 2018-05-08 | 北京中视广信科技有限公司 | For the comment spam filter method of the fusion user behavior rule of unbalanced data |
CN108009249B (en) * | 2017-12-01 | 2020-08-18 | 北京中视广信科技有限公司 | Spam comment filtering method for unbalanced data and fusing user behavior rules |
CN108091134A (en) * | 2017-12-08 | 2018-05-29 | 北京工业大学 | A kind of conventional data set creation method based on mobile phone signaling position track data |
CN108091134B (en) * | 2017-12-08 | 2020-09-25 | 北京市交通运行监测调度中心 | Universal data set generation method based on mobile phone signaling position track data |
WO2022087806A1 (en) * | 2020-10-27 | 2022-05-05 | Paypal, Inc. | Multi-phase training techniques for machine learning models using weighted training data |
AU2020474630B2 (en) * | 2020-10-27 | 2024-01-25 | Paypal, Inc. | Multi-phase training techniques for machine learning models using weighted training data |
CN112784910A (en) * | 2021-01-28 | 2021-05-11 | 武汉市博畅软件开发有限公司 | Deep filtering method and system for junk data |
CN116192997A (en) * | 2023-02-21 | 2023-05-30 | 上海兴容信息技术有限公司 | Event detection method and system based on network flow |
CN116192997B (en) * | 2023-02-21 | 2023-12-01 | 兴容(上海)信息技术股份有限公司 | Event detection method and system based on network flow |
Also Published As
Publication number | Publication date |
---|---|
CN105306296B (en) | 2018-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105306296A (en) | Data filter processing method based on LTE (Long Term Evolution) signaling | |
Nikolov et al. | Unsupervised learning of link discovery configuration | |
CN109492026B (en) | Telecommunication fraud classification detection method based on improved active learning technology | |
CN109299741B (en) | Network attack type identification method based on multi-layer detection | |
CN102346829A (en) | Virus detection method based on ensemble classification | |
CN110460605B (en) | Abnormal network flow detection method based on automatic coding | |
CN108614997B (en) | Remote sensing image identification method based on improved AlexNet | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN101996241A (en) | Bayesian algorithm-based content filtering method | |
CN104391835A (en) | Method and device for selecting feature words in texts | |
CN101021838A (en) | Text handling method and system | |
CN106228389A (en) | Network potential usage mining method and system based on random forests algorithm | |
JP6897749B2 (en) | Learning methods, learning systems, and learning programs | |
BaygIn | Classification of text documents based on Naive Bayes using N-Gram features | |
CN111835707B (en) | Malicious program identification method based on improved support vector machine | |
CN113408605A (en) | Hyperspectral image semi-supervised classification method based on small sample learning | |
WO2020024444A1 (en) | Group performance grade recognition method and apparatus, and storage medium and computer device | |
CN107483451B (en) | Method and system for processing network security data based on serial-parallel structure and social network | |
CN112733936A (en) | Recyclable garbage classification method based on image recognition | |
CN109933619A (en) | A kind of semisupervised classification prediction technique | |
CN115909011A (en) | Astronomical image automatic classification method based on improved SE-inclusion-v 3 network model | |
Villa-Blanco et al. | Feature subset selection for data and feature streams: a review | |
CN112990371B (en) | Unsupervised night image classification method based on feature amplification | |
CN107169020B (en) | directional webpage collecting method based on keywords | |
CN105337842B (en) | A kind of rubbish mail filtering method unrelated with content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181012 Termination date: 20211021 |
|
CF01 | Termination of patent right due to non-payment of annual fee |