CN105306296A - Data filter processing method based on LTE (Long Term Evolution) signaling - Google Patents

Data filter processing method based on LTE (Long Term Evolution) signaling Download PDF

Info

Publication number
CN105306296A
CN105306296A CN201510694999.9A CN201510694999A CN105306296A CN 105306296 A CN105306296 A CN 105306296A CN 201510694999 A CN201510694999 A CN 201510694999A CN 105306296 A CN105306296 A CN 105306296A
Authority
CN
China
Prior art keywords
data
text
classification
algorithm
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510694999.9A
Other languages
Chinese (zh)
Other versions
CN105306296B (en
Inventor
窦慧晶
卞婷婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201510694999.9A priority Critical patent/CN105306296B/en
Publication of CN105306296A publication Critical patent/CN105306296A/en
Application granted granted Critical
Publication of CN105306296B publication Critical patent/CN105306296B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data filter processing method based on LTE (Long Term Evolution) signaling. In particular, big data filter processing is performed on an LTE mobile core high-speed network system, and a four-stage hybrid filtering mode in which simple filtering and content deep filtering are combined is adopted. The method comprises the following steps: firstly, performing simple data preprocessing through a quintuple; secondly, determining a data source property according to a KNN (Key Nearest Neighbor) text classification method; thirdly, implementing content-based three-stage information filtering through an optimized AdaBoost algorithm; and finally, implementing complete data filter processing. Compared with a conventional filtering method, the data filter processing method has the advantages that the problems of missing and screening errors on the aspect of data filtering in a conventional LTE system are solved; a high stability and a high accuracy are achieved; and very high robustness to data filtering is realized. The data filter processing method can be directly applied to the fields of network security, network information data processing, big data analyses and the like.

Description

A kind of data filtering processing method based on LTE signaling
Technical field
The present invention relates to a kind of data filtering processing method based on LTE signaling, belong to data filtering processing technology field.
Background technology
Five-tuple simple data is filtered, refer to first defined variable SIP, DIP, SP, DP, PT be respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple.In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), carries out information single filter according to demand.
KNN (KeyNearestNeighbor) algorithm is a kind of algorithm for pattern recognition of Corpus--based Method, mainly be used in text classification, its basic thought is: after new text is given, consider the one section of text concentrating nearest to the text (being the most similar) at training text, the classification belonging to this section of text judges the classification belonging to new text.That is, each section of text is all regarded as a N dimensional vector, calculate the distance that new text and training text concentrate text, determined the classification of new text by distance.
The AdaBoost algorithm optimized is the Least risk Bayes in-depth filtration algorithm based on AdaBoost algorithm.Training framework using AdaBoost algorithm as grader, replaces the Weak Classifier in AdaBoost algorithm by Least risk Bayes sorting algorithm, as the grader of AdaBoost, finally reaches the combination of two algorithms.Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning.Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability.Thus obtain the theme (selection has the class of maximum a posteriori probability as the theme belonging to object source) of object source.By the set of training source data, obtain each data message in inhomogeneous probability size by Bayesian Classification Arithmetic, construct Bayesian Classification Model, it is minimum that naive Bayesian is exactly Bayesian Classification Model medial error rate, and estimated parameter needed for it is little, and implementation algorithm is very simple.AdaBoost algorithm is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest.The main feature of algorithm has:
1. adopt five-tuple simple filtration algorithm, information filtering that the AdaBoost method of KNN file classification method and improvement carries out three grades of deep content, effectively ensure that strainability and the robustness of system;
2. high speed, accuracy.KNN Algorithm of documents categorization can select according to the needs of user self document that is applicable to, filters out useless document, carries out automatic classification quickly and efficiently to a large amount of network datas, be applicable to carry out information sifting in mass data;
3. the AdaBoost algorithm through optimizing can get rid of some unnecessary training data features, pay close attention to crucial training data, and according to different topic distillation strategies, data are filtered, the possibility of all classification errors is all taken into account, reduce the risk of erroneous judgement to a great extent;
4. reduce system loading, improve running efficiency of system.Clustering method is adopted to carry out tissue and classification automatically to Sample Storehouse.
5. stability, by three-stage filtration, can significant increase filtering function disposal ability.
But for KNN Algorithm of documents categorization, this algorithm also existing defects, when sample size is uneven, sample size as a class is very large, and other class sample sizes very little time, likely cause when an input new samples, in K neighbours of this sample, the sample of Large Copacity class occupies the majority, because this algorithm only calculates " nearest " neighbours sample, so when the sample size of a certain class is very large, new samples can be caused cannot to match correct sample, the method (increasing and the weights of this sample apart from little neighbours) changing weights can be adopted to improve, but the complexity of algorithm can be brought.
Summary of the invention
For above problem, the object of the present invention is to provide efficiently a kind of, the stable data filtering processing method based on LTE signaling, it adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize deep content and filter.
Its concrete steps comprise:
1. five-tuple simple filtration.
First simple single filter being carried out to network data---five-tuple filters.Defined variable SIP, DIP, SP, DP, PT represent respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple.In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), thus the information of carrying out single filter.
2.KNN file classification method determines source character.
Data after five-tuple filtration treatment are called new text, the text that itself and given training text are concentrated is carried out KNN text classification calculating, to new input example and new text, concentrate at training text and find K the example the most contiguous with this example, then that class belonging to majority of this K example is exactly the class of new text.That is, new text and training text are all regarded as a N dimensional vector, calculate the similarity that new text and training text concentrate each text, find out K sample the most similar, determined the classification of new text by the classification belonging to Weighted distance and training text.
KNN algorithmic procedure is described below:
1) for new text and training text, new text vector and training text vector is formed according to Feature Words.
According to traditional vector space model, text message is turned to the weighted feature vector in feature space by form.I.e. D=D (T 1, W 1; T 2, W 2; T n, W n), the vector representation of new text and training text is determined according to Feature Words.
2) text similarity that new text and training text concentrate each text is calculated. computing formula is:
S i m ( d i , d j ) = Σ k = 1 M W i k × W j k ( Σ k = 1 M W i k 2 ) ( Σ k = 1 M W j k 2 )
Wherein d ifor the characteristic vector of new text, d jfor the center vector of jth class, M is the dimension of characteristic vector, W kfor the kth of vector is tieed up.
For k value, because KNN method is counted as one estimate posterior probability P (w from sample i| method x), so in order to be estimated reliably, k value need be the bigger the better, just can improve the accuracy of estimation like this, but then, wishes that again this k neighbor point is more near apart from new text better, remembers that new text posterior probability is P (w i| x 1), only have when this k neighbor point is nearer apart from new text, P (w i| x 1) just can approach P (w as much as possible i| x), all determine k value according to the personal experience of people in the past, so often occur estimating inaccurate situation, if the selection of k value is too small, the contiguous number obtained can be too small, then reduce nicety of grading, if the selection of k value is excessive, then easily increases noise data and reduce classification accuracy, therefore now proved by a large amount of experiments, during all text numbers of fetching data in storehouse when k value, the classification results of new text is globally optimal solution.
3) in k neighbours of new text, the weight of every class is calculated successively,
P ( x ‾ , C j ) = Σ d ‾ i ∈ K N N S i m ( x ‾ , d ‾ i ) y ( d ‾ i , C j )
Wherein for the characteristic vector of new text, for calculating formula of similarity, for data source character.
4) weight of comparing class, assigns to text in that maximum classification of weight and goes.
In sum, data carry out cascade filtration by the method for KNN text classification to data message, determine the character of data source.
3. the AdaBoost method optimized carries out deep content filtration.
It is training framework using AdaBoost algorithm as grader that the present invention proposes based on the Least risk Bayes in-depth filtration algorithm of AdaBoost algorithm, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost.
AdaBoost is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest.Its algorithm realizes according to change Data distribution8, whether correct according to the classification of sample each among each training set, and the accuracy of the general classification of last time determines the weights of each sample, the up-to-date weights revised are delivered in sub classification device and trains, finally the grader that obtains after training is at every turn merged, export final most strong classifier.
Assuming that training sample set is: S={ (x 1, y 1), (x 2, y 2) ..., (x i, y i), x i∈ X, y i∈ Y}, X, Y correspond respectively to positive example sample and negative routine sample, and M is the maximum cycle of training, and the error rate of grader is designated as ε m, minimal error rate is designated as
In the former algorithm of AdaBoost, integrate whole decision-making to produce final decision-making by the majority voting method of a weighting:
P ( x ) = s i g n [ Σ m = 1 M α m P m ( x ) ]
Wherein P mx () is grader decision function, AdaBoost algorithm suitably can be integrated the mistake learning the Weak Classifier obtained, iteration all will upgrade weight each time, reduce the weight of the good data of Weak Classifier classifying quality, increase the weight of the poor data of Weak Classifier classifying quality, final grader is the weighted average of Weak Classifier.
Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability.Namely which kind of theme object source belong to, and selects the class with maximum a posteriori probability as the theme belonging to object source.By the set of training source data, obtain each data message in inhomogeneous probability size by bayesian theory, construct Bayesian model.Naive Bayesian is that Bayesian Classification Model medial error rate is minimum, and estimated parameter needed for it is little, and implementation algorithm is simple.Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning.In the present invention, if by system, data are judged to be that " sensitive data " is used as junk data and is filtered out, but it is again exactly the content required for user, then can cause very large loss to user.By the theme of Least risk Bayes sorting technique determination data source, filter according to different topic distillation strategies, all classification errors are all taken into account, the risk of erroneous judgement can be reduced to a great extent.
Known P (ω i), P (X| ω i), i=1,2 ..., c, and when X to be identified (network packet to be filtered), calculate posterior probability according to Bayesian formula,
P ( ω j | X ) = P ( X | ω j ) P ( ω j ) Σ i = 1 c P ( X | ω i ) P ( ω i ) , j = 1 , 2 , ... ... , x
Wherein P (ω i) be prior probability, be obtained by the demand analysis of user to network data in the past; P (ω j| X) be posterior probability, be the probability again corrected again after obtaining information X, P (X| ω i) be judge that whether the X to be identified that receives is the probability of rubbish network data according to the demand experience of user to network data in the past.
Note data degradation is α, decision-making decision rule is defined as:
1) when network data is junk data, be judged as that junk data can not cause any loss, α=0;
2) when being valid data rubbish network data judging, then α=0 is lost;
3) when needed for user, network data is judged to be junk data, then the loss caused is immeasurable, 0< α < ∞.
According to the decision rule of the posterior probability that draws after calculating and setting, calculate as follows and take d i, i=1,2 ... the conditional risk of a:
R ( d i | X ) = &Sigma; j = 1 c &alpha; ( d i , &omega; j ) P ( &omega; j | X ) , i = 1 , 2 , ... ... , a
After considering that data are misjudged, loss α → 0 is dropped to minimum, therefore to d the conditional risk value R (d obtained before i| X), i=1,2 ..., d compares, and therefrom finds out the decision-making making conditional risk minimum, is designated as d k, d kit is exactly Least risk Bayes categorised decision.
The AdaBoost method that the present invention optimizes is as follows:
Input network data with a matrix type, initializes weights perform circulation m=1,2 ..., M, by ω ivalue substitute in AdaBoost framework, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P m, by classification error rate α mupgrade, the weights obtaining training sample are continue the circulation starting next round, until M circulation terminates.By repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P m, obtain through algorithm:
P ( x ) = s i g n &lsqb; &Sigma; m = 1 M &alpha; m P m ( x ) &rsqb;
Final P (x) is exactly the final grader obtained after M study in content-based in-depth filtration algorithm.
The present invention can obtain following beneficial effect:
The present invention is directed to above problem, the object of the present invention is to provide efficiently a kind of, the stable data filtering processing method based on LTE signaling, it adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize the data filtering processing method based on LTE signaling of complete set.Data filtering processing method application scenarios figure as shown in Figure 1.Data processing is divided into three processes, LTE signaling data is carried out to the filtration of five-tuple simple data, carries out deep content filtration by the AdaBoost method of KNN file classification method determination data source character, optimization to signaling data, complete the data processing to described LTE data message, process as shown in Figure 2.The present invention has higher filtration accuracy and system robustness than original information filtering method, and the present invention can directly apply to the fields such as network security, network information data process, large data analysis.
Accompanying drawing explanation
The data filtering processing method application scenarios figure of Fig. 1 LTE signaling.
Fig. 2 data filtering process flow figure.
Fig. 3 filtration treatment method cell schematics is illustrated.
Fig. 4 HASH rule is arranged.
Fig. 5 KNN file classification method key diagram.
Embodiment
In order to understand the present invention better, below in conjunction with the drawings and specific embodiments, the present invention is elaborated.The present invention proposes a kind of three DBMS filtration treatment methods based on LTE signaling newly.Below key step is specifically described.
1. five-tuple simple filtration
First defined variable SIP, DIP, SP, DP, PT represents source IP address in five-tuple, object IP address, source port number, destination slogan and transport protocol type respectively, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), PCL is the advanced function of exchange chip, and it is the filtering rule list realized on exchange chip.
Data message is first through the process of PCL (IngressPCL) Engine, an IPCLTable is generated according to the type of message and PCL-ID, carry out searching coupling in TCAM by this table, the condition that the match is successful is: first PCL-ID is identical, then define a data structure and be used as function parameter, and the PCL transmitting client layer specified then, how many rule is had just to need to perform to secondary coupling less, five-tuple of the present invention filters and uses mask, respectively based on source IP address+mask SIP_MASK, object IP address+mask DIP_MASK, source port number+mask SP_MASK, destination slogan+mask DP_MASK, transport protocol type+mask PT_MASK is respectively as filter condition, hit rule then preserves output, miss rule, then carry out discard processing.
2.KNN file classification method
1) new text vector and training text vector is formed according to Feature Words.
I.e. D=D (T 1, W 1; T 2, W 2; T n, W n), the vector representation of new text and training text is determined according to Feature Words.
2) text similarity that new text and training text concentrate each text is calculated. computing formula is:
S i m ( d i , d j ) = &Sigma; k = 1 M W i k &times; W j k ( &Sigma; k = 1 M W i k 2 ) ( &Sigma; k = 1 M W j k 2 )
Wherein d ifor the characteristic vector of new text, d jfor the center vector of jth class, M is the dimension of characteristic vector, W kfor the kth of vector is tieed up.
3) choose the number that k value is all texts in database, then according to the value of Sim according to sorting from high to low, k equals several, just gets front severally to sort.
4) in k neighbours of new text, the weight of every class is calculated successively,
P ( x &OverBar; , C j ) = &Sigma; d &OverBar; i &Element; K N N S i m ( x &OverBar; , d &OverBar; i ) y ( d &OverBar; i , C j )
Wherein for the characteristic vector of new text, for calculating formula of similarity, for data source character.
5) weight of comparing class, namely in k neighbours, belong to which classification many, which classification new text just belongs to.Message is originally assigned in that maximum classification of weight and go.
The cascade filtration more than undertaken by KNN file classification method for data, determines the character of data source.
3. the AdaBoost method optimized carries out deep content filtration.
The AdaBoost method of optimization of the present invention is the Least risk Bayes in-depth filtration algorithm based on AdaBoost algorithm.Its training framework using AdaBoost algorithm as grader, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost.
1) the form input of matrix will be converted to by the network data after front double-filtration, and under given rectangular characteristic prototype, calculate and obtain rectangular characteristic collection,
2) initializes weights perform circulation m=1,2 ..., M, by ω ivalue substitute in AdaBoost framework,
3) take feature set as input, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P m,
&alpha; m = 1 2 l o g ( 1 - &epsiv; m &epsiv; m )
4) by classification error rate α mupgrade, the weights obtaining training sample are D m + 1 ( i ) = D m ( i ) exp ( - &alpha; m y i P m ( x i ) ) Z m , Continue the circulation starting next round, until M circulation terminates.
5) by repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P m, obtain through algorithm:
P ( x ) = s i g n &lsqb; &Sigma; m = 1 M &alpha; m P m ( x ) &rsqb;
Final P (x) is exactly the final grader obtained after M study in content-based in-depth filtration algorithm.

Claims (1)

1., based on a data filtering processing method for LTE signaling, it is characterized in that:
This method adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize deep content and filter;
Its concrete steps comprise:
1. five-tuple simple filtration;
First simple single filter being carried out to network data---five-tuple filters; Defined variable SIP, DIP, SP, DP, PT represent respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple; In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL, thus the information of carrying out single filter;
2.KNN file classification method determines source character;
Data after five-tuple filtration treatment are called new text, the text that itself and given training text are concentrated is carried out KNN text classification calculating, to new input example and new text, concentrate at training text and find K the example the most contiguous with this example, then that class belonging to majority of this K example is exactly the class of new text; That is, new text and training text are all regarded as a N dimensional vector, calculate the similarity that new text and training text concentrate each text, find out K sample the most similar, determined the classification of new text by the classification belonging to Weighted distance and training text;
KNN algorithmic procedure is described below:
1) for new text and training text, new text vector and training text vector is formed according to Feature Words;
According to traditional vector space model, text message is turned to the weighted feature vector in feature space by form; I.e. D=D (T 1, W 1; T 2, W 2; T n, W n), the vector representation of new text and training text is determined according to Feature Words;
2) text similarity that new text and training text concentrate each text is calculated. computing formula is:
S i m ( d i , d j ) = &Sigma; k = 1 M W i k &times; W j k ( &Sigma; k = 1 M W i k 2 ) ( &Sigma; k = 1 M W j k 2 )
Wherein d ifor the characteristic vector of new text, d jfor the center vector of jth class, M is the dimension of characteristic vector, W kfor the kth of vector is tieed up;
For k value, because KNN method is counted as one estimate posterior probability P (w from sample i| method x), so in order to be estimated reliably, k value need be the bigger the better, just can improve the accuracy of estimation like this, but then, wishes that again this k neighbor point is more near apart from new text better, remembers that new text posterior probability is P (w i| x 1), only have when this k neighbor point is nearer apart from new text, P (w i| x 1) just can approach P (w as much as possible i| x), all determine k value according to the personal experience of people in the past, so often occur estimating inaccurate situation, if the selection of k value is too small, the contiguous number obtained can be too small, then reduce nicety of grading, if the selection of k value is excessive, then easily increases noise data and reduce classification accuracy, therefore now proved by a large amount of experiments, during all text numbers of fetching data in storehouse when k value, the classification results of new text is globally optimal solution;
3) in k neighbours of new text, the weight of every class is calculated successively,
P ( x &OverBar; , C j ) = &Sigma; d &OverBar; i &Element; K N N S i m ( x &OverBar; , d &OverBar; i ) y ( d &OverBar; i , C j )
Wherein for the characteristic vector of new text, for calculating formula of similarity, for data source character;
4) weight of comparing class, assigns to text in that maximum classification of weight and goes;
In sum, data carry out cascade filtration by the method for KNN text classification to data message, determine the character of data source;
3. the AdaBoost method optimized carries out deep content filtration;
It is training framework using AdaBoost algorithm as grader that this method proposes based on the Least risk Bayes in-depth filtration algorithm of AdaBoost algorithm, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost;
AdaBoost is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest; Its algorithm realizes according to change Data distribution8, whether correct according to the classification of sample each among each training set, and the accuracy of the general classification of last time determines the weights of each sample, the up-to-date weights revised are delivered in sub classification device and trains, finally the grader that obtains after training is at every turn merged, export final most strong classifier;
If training sample set is: S={ (x 1, y 1), (x 2, y 2) ..., (x i, y i), x i∈ X, y i∈ Y}, X, Y correspond respectively to positive example sample and negative routine sample, and M is the maximum cycle of training, and the error rate of grader is designated as ε m, minimal error rate is designated as
In the former algorithm of AdaBoost, integrate whole decision-making to produce final decision-making by the majority voting method of a weighting:
P ( x ) = s i g n &lsqb; &Sigma; m = 1 M &alpha; m P m ( x ) &rsqb;
Wherein P mx () is grader decision function, AdaBoost algorithm suitably can be integrated the mistake learning the Weak Classifier obtained, iteration all will upgrade weight each time, reduce the weight of the good data of Weak Classifier classifying quality, increase the weight of the poor data of Weak Classifier classifying quality, final grader is the weighted average of Weak Classifier;
Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability; Namely which kind of theme object source belong to, and selects the class with maximum a posteriori probability as the theme belonging to object source; By the set of training source data, obtain each data message in inhomogeneous probability size by bayesian theory, construct Bayesian model; Naive Bayesian is that Bayesian Classification Model medial error rate is minimum, and estimated parameter needed for it is little, and implementation algorithm is simple; Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning; In the method, if by system, data are judged to be that " sensitive data " is used as junk data and is filtered out, but it is again exactly the content required for user, then can cause very large loss to user; By the theme of Least risk Bayes sorting technique determination data source, filter according to different topic distillation strategies, all classification errors are all taken into account, the risk of erroneous judgement can be reduced to a great extent;
Known P (ω i), P (X| ω i), i=1,2 ..., c, and when X to be identified (network packet to be filtered), calculate posterior probability according to Bayesian formula,
P ( &omega; j | X ) = P ( X | &omega; j ) P ( &omega; j ) &Sigma; i = 1 c P ( X | &omega; j ) P ( &omega; j ) , j = 1 , 2 , ... ... , x
Wherein P (ω i) be prior probability, be obtained by the demand analysis of user to network data in the past; P (ω j| X) be posterior probability, be the probability again corrected again after obtaining information X, P (X| ω i) be judge that whether the X to be identified that receives is the probability of rubbish network data according to the demand experience of user to network data in the past;
Note data degradation is α, decision-making decision rule is defined as:
1) when network data is junk data, be judged as that junk data can not cause any loss, α=0;
2) when being valid data rubbish network data judging, then α=0 is lost;
3) when needed for user, network data is judged to be junk data, then the loss caused is immeasurable, 0< α < ∞;
According to the decision rule of the posterior probability that draws after calculating and setting, calculate as follows and take d i, i=1,2 ... the conditional risk of a:
R ( d i | X ) = &Sigma; j = 1 c &alpha; ( d i , &omega; j ) P ( &omega; j | X ) , i = 1 , 2 , ... ... , a
After considering that data are misjudged, loss α → 0 is dropped to minimum, therefore to d the conditional risk value R (d obtained before i| X), i=1,2 ..., d compares, and therefrom finds out the decision-making making conditional risk minimum, is designated as d k, d kit is exactly Least risk Bayes categorised decision;
The AdaBoost method that this method is optimized is as follows:
Input network data with a matrix type, initializes weights i=1,2 ..., n, performs circulation m=1,2 ..., M, by ω ivalue substitute in AdaBoost framework, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P m, by classification error rate α mupgrade, the weights obtaining training sample are continue the circulation starting next round, until M circulation terminates; By repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P m, obtain through algorithm:
P ( x ) = s i g n &lsqb; &Sigma; m = 1 M &alpha; m P m ( x ) &rsqb;
Final P (x) is exactly the final grader obtained after M study in content-based in-depth filtration algorithm.
CN201510694999.9A 2015-10-21 2015-10-21 A kind of data filtering processing method based on LTE signalings Expired - Fee Related CN105306296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510694999.9A CN105306296B (en) 2015-10-21 2015-10-21 A kind of data filtering processing method based on LTE signalings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510694999.9A CN105306296B (en) 2015-10-21 2015-10-21 A kind of data filtering processing method based on LTE signalings

Publications (2)

Publication Number Publication Date
CN105306296A true CN105306296A (en) 2016-02-03
CN105306296B CN105306296B (en) 2018-10-12

Family

ID=55203076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510694999.9A Expired - Fee Related CN105306296B (en) 2015-10-21 2015-10-21 A kind of data filtering processing method based on LTE signalings

Country Status (1)

Country Link
CN (1) CN105306296B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107332704A (en) * 2017-07-03 2017-11-07 南京华苏科技有限公司 Assess the method and system that high-speed railway mobile subscriber uses LTE service quality
CN107908720A (en) * 2017-11-14 2018-04-13 河北工程大学 A kind of patent data cleaning method and system based on AdaBoost algorithms
CN108009249A (en) * 2017-12-01 2018-05-08 北京中视广信科技有限公司 For the comment spam filter method of the fusion user behavior rule of unbalanced data
CN108091134A (en) * 2017-12-08 2018-05-29 北京工业大学 A kind of conventional data set creation method based on mobile phone signaling position track data
CN112784910A (en) * 2021-01-28 2021-05-11 武汉市博畅软件开发有限公司 Deep filtering method and system for junk data
WO2022087806A1 (en) * 2020-10-27 2022-05-05 Paypal, Inc. Multi-phase training techniques for machine learning models using weighted training data
CN116192997A (en) * 2023-02-21 2023-05-30 上海兴容信息技术有限公司 Event detection method and system based on network flow

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
CN104598586A (en) * 2015-01-18 2015-05-06 北京工业大学 Large-scale text classifying method
CN104750850A (en) * 2015-04-14 2015-07-01 中国地质大学(武汉) Feature selection method based on information gain ratio

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
CN104598586A (en) * 2015-01-18 2015-05-06 北京工业大学 Large-scale text classifying method
CN104750850A (en) * 2015-04-14 2015-07-01 中国地质大学(武汉) Feature selection method based on information gain ratio

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张铭锋: "垃圾邮件过滤的贝叶斯方法综述", 《计算机应用与研究》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107332704A (en) * 2017-07-03 2017-11-07 南京华苏科技有限公司 Assess the method and system that high-speed railway mobile subscriber uses LTE service quality
CN107908720A (en) * 2017-11-14 2018-04-13 河北工程大学 A kind of patent data cleaning method and system based on AdaBoost algorithms
CN108009249A (en) * 2017-12-01 2018-05-08 北京中视广信科技有限公司 For the comment spam filter method of the fusion user behavior rule of unbalanced data
CN108009249B (en) * 2017-12-01 2020-08-18 北京中视广信科技有限公司 Spam comment filtering method for unbalanced data and fusing user behavior rules
CN108091134A (en) * 2017-12-08 2018-05-29 北京工业大学 A kind of conventional data set creation method based on mobile phone signaling position track data
CN108091134B (en) * 2017-12-08 2020-09-25 北京市交通运行监测调度中心 Universal data set generation method based on mobile phone signaling position track data
WO2022087806A1 (en) * 2020-10-27 2022-05-05 Paypal, Inc. Multi-phase training techniques for machine learning models using weighted training data
AU2020474630B2 (en) * 2020-10-27 2024-01-25 Paypal, Inc. Multi-phase training techniques for machine learning models using weighted training data
CN112784910A (en) * 2021-01-28 2021-05-11 武汉市博畅软件开发有限公司 Deep filtering method and system for junk data
CN116192997A (en) * 2023-02-21 2023-05-30 上海兴容信息技术有限公司 Event detection method and system based on network flow
CN116192997B (en) * 2023-02-21 2023-12-01 兴容(上海)信息技术股份有限公司 Event detection method and system based on network flow

Also Published As

Publication number Publication date
CN105306296B (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN105306296A (en) Data filter processing method based on LTE (Long Term Evolution) signaling
Nikolov et al. Unsupervised learning of link discovery configuration
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN102346829A (en) Virus detection method based on ensemble classification
CN110460605B (en) Abnormal network flow detection method based on automatic coding
CN108614997B (en) Remote sensing image identification method based on improved AlexNet
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN101996241A (en) Bayesian algorithm-based content filtering method
CN104391835A (en) Method and device for selecting feature words in texts
CN101021838A (en) Text handling method and system
CN106228389A (en) Network potential usage mining method and system based on random forests algorithm
JP6897749B2 (en) Learning methods, learning systems, and learning programs
BaygIn Classification of text documents based on Naive Bayes using N-Gram features
CN111835707B (en) Malicious program identification method based on improved support vector machine
CN113408605A (en) Hyperspectral image semi-supervised classification method based on small sample learning
WO2020024444A1 (en) Group performance grade recognition method and apparatus, and storage medium and computer device
CN107483451B (en) Method and system for processing network security data based on serial-parallel structure and social network
CN112733936A (en) Recyclable garbage classification method based on image recognition
CN109933619A (en) A kind of semisupervised classification prediction technique
CN115909011A (en) Astronomical image automatic classification method based on improved SE-inclusion-v 3 network model
Villa-Blanco et al. Feature subset selection for data and feature streams: a review
CN112990371B (en) Unsupervised night image classification method based on feature amplification
CN107169020B (en) directional webpage collecting method based on keywords
CN105337842B (en) A kind of rubbish mail filtering method unrelated with content

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181012

Termination date: 20211021

CF01 Termination of patent right due to non-payment of annual fee