CN105306296A

CN105306296A - Data filter processing method based on LTE (Long Term Evolution) signaling

Info

Publication number: CN105306296A
Application number: CN201510694999.9A
Authority: CN
Inventors: 窦慧晶; 卞婷婷
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-10-21
Filing date: 2015-10-21
Publication date: 2016-02-03
Anticipated expiration: 2035-10-21
Also published as: CN105306296B

Abstract

The invention relates to a data filter processing method based on LTE (Long Term Evolution) signaling. In particular, big data filter processing is performed on an LTE mobile core high-speed network system, and a four-stage hybrid filtering mode in which simple filtering and content deep filtering are combined is adopted. The method comprises the following steps: firstly, performing simple data preprocessing through a quintuple; secondly, determining a data source property according to a KNN (Key Nearest Neighbor) text classification method; thirdly, implementing content-based three-stage information filtering through an optimized AdaBoost algorithm; and finally, implementing complete data filter processing. Compared with a conventional filtering method, the data filter processing method has the advantages that the problems of missing and screening errors on the aspect of data filtering in a conventional LTE system are solved; a high stability and a high accuracy are achieved; and very high robustness to data filtering is realized. The data filter processing method can be directly applied to the fields of network security, network information data processing, big data analyses and the like.

Description

A kind of data filtering processing method based on LTE signaling

Technical field

The present invention relates to a kind of data filtering processing method based on LTE signaling, belong to data filtering processing technology field.

Background technology

Five-tuple simple data is filtered, refer to first defined variable SIP, DIP, SP, DP, PT be respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple.In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), carries out information single filter according to demand.

KNN (KeyNearestNeighbor) algorithm is a kind of algorithm for pattern recognition of Corpus--based Method, mainly be used in text classification, its basic thought is: after new text is given, consider the one section of text concentrating nearest to the text (being the most similar) at training text, the classification belonging to this section of text judges the classification belonging to new text.That is, each section of text is all regarded as a N dimensional vector, calculate the distance that new text and training text concentrate text, determined the classification of new text by distance.

The AdaBoost algorithm optimized is the Least risk Bayes in-depth filtration algorithm based on AdaBoost algorithm.Training framework using AdaBoost algorithm as grader, replaces the Weak Classifier in AdaBoost algorithm by Least risk Bayes sorting algorithm, as the grader of AdaBoost, finally reaches the combination of two algorithms.Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning.Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability.Thus obtain the theme (selection has the class of maximum a posteriori probability as the theme belonging to object source) of object source.By the set of training source data, obtain each data message in inhomogeneous probability size by Bayesian Classification Arithmetic, construct Bayesian Classification Model, it is minimum that naive Bayesian is exactly Bayesian Classification Model medial error rate, and estimated parameter needed for it is little, and implementation algorithm is very simple.AdaBoost algorithm is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest.The main feature of algorithm has:

1. adopt five-tuple simple filtration algorithm, information filtering that the AdaBoost method of KNN file classification method and improvement carries out three grades of deep content, effectively ensure that strainability and the robustness of system;

2. high speed, accuracy.KNN Algorithm of documents categorization can select according to the needs of user self document that is applicable to, filters out useless document, carries out automatic classification quickly and efficiently to a large amount of network datas, be applicable to carry out information sifting in mass data;

3. the AdaBoost algorithm through optimizing can get rid of some unnecessary training data features, pay close attention to crucial training data, and according to different topic distillation strategies, data are filtered, the possibility of all classification errors is all taken into account, reduce the risk of erroneous judgement to a great extent;

4. reduce system loading, improve running efficiency of system.Clustering method is adopted to carry out tissue and classification automatically to Sample Storehouse.

5. stability, by three-stage filtration, can significant increase filtering function disposal ability.

But for KNN Algorithm of documents categorization, this algorithm also existing defects, when sample size is uneven, sample size as a class is very large, and other class sample sizes very little time, likely cause when an input new samples, in K neighbours of this sample, the sample of Large Copacity class occupies the majority, because this algorithm only calculates " nearest " neighbours sample, so when the sample size of a certain class is very large, new samples can be caused cannot to match correct sample, the method (increasing and the weights of this sample apart from little neighbours) changing weights can be adopted to improve, but the complexity of algorithm can be brought.

Summary of the invention

For above problem, the object of the present invention is to provide efficiently a kind of, the stable data filtering processing method based on LTE signaling, it adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize deep content and filter.

Its concrete steps comprise:

1. five-tuple simple filtration.

First simple single filter being carried out to network data---five-tuple filters.Defined variable SIP, DIP, SP, DP, PT represent respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple.In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), thus the information of carrying out single filter.

2.KNN file classification method determines source character.

Data after five-tuple filtration treatment are called new text, the text that itself and given training text are concentrated is carried out KNN text classification calculating, to new input example and new text, concentrate at training text and find K the example the most contiguous with this example, then that class belonging to majority of this K example is exactly the class of new text.That is, new text and training text are all regarded as a N dimensional vector, calculate the similarity that new text and training text concentrate each text, find out K sample the most similar, determined the classification of new text by the classification belonging to Weighted distance and training text.

KNN algorithmic procedure is described below:

1) for new text and training text, new text vector and training text vector is formed according to Feature Words.

According to traditional vector space model, text message is turned to the weighted feature vector in feature space by form.I.e. D=D (T ₁, W ₁; T ₂, W ₂; T _n, W _n), the vector representation of new text and training text is determined according to Feature Words.

2) text similarity that new text and training text concentrate each text is calculated. computing formula is:

S i m (d_{i}, d_{j}) = \frac{Σ_{k = 1}^{M} W_{i k} \times W_{j k}}{(Σ_{k = 1}^{M} {W_{i k}}^{2}) (Σ_{k = 1}^{M} {W_{j k}}^{2})}

Wherein d _ifor the characteristic vector of new text, d _jfor the center vector of jth class, M is the dimension of characteristic vector, W _kfor the kth of vector is tieed up.

For k value, because KNN method is counted as one estimate posterior probability P (w from sample _i| method x), so in order to be estimated reliably, k value need be the bigger the better, just can improve the accuracy of estimation like this, but then, wishes that again this k neighbor point is more near apart from new text better, remembers that new text posterior probability is P (w _i| x ₁), only have when this k neighbor point is nearer apart from new text, P (w _i| x ₁) just can approach P (w as much as possible _i| x), all determine k value according to the personal experience of people in the past, so often occur estimating inaccurate situation, if the selection of k value is too small, the contiguous number obtained can be too small, then reduce nicety of grading, if the selection of k value is excessive, then easily increases noise data and reduce classification accuracy, therefore now proved by a large amount of experiments, during all text numbers of fetching data in storehouse when k value, the classification results of new text is globally optimal solution.

3) in k neighbours of new text, the weight of every class is calculated successively,

P (\overset{&OverBar;}{x}, C_{j}) = \underset{{\overset{&OverBar;}{d}}_{i} &Element; K N N}{Σ} S i m (\overset{&OverBar;}{x}, {\overset{&OverBar;}{d}}_{i}) y ({\overset{&OverBar;}{d}}_{i}, C_{j})

Wherein for the characteristic vector of new text, for calculating formula of similarity, for data source character.

4) weight of comparing class, assigns to text in that maximum classification of weight and goes.

In sum, data carry out cascade filtration by the method for KNN text classification to data message, determine the character of data source.

3. the AdaBoost method optimized carries out deep content filtration.

It is training framework using AdaBoost algorithm as grader that the present invention proposes based on the Least risk Bayes in-depth filtration algorithm of AdaBoost algorithm, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost.

AdaBoost is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest.Its algorithm realizes according to change Data distribution8, whether correct according to the classification of sample each among each training set, and the accuracy of the general classification of last time determines the weights of each sample, the up-to-date weights revised are delivered in sub classification device and trains, finally the grader that obtains after training is at every turn merged, export final most strong classifier.

Assuming that training sample set is: S={ (x ₁, y ₁), (x ₂, y ₂) ..., (x _i, y _i), x _i∈ X, y _i∈ Y}, X, Y correspond respectively to positive example sample and negative routine sample, and M is the maximum cycle of training, and the error rate of grader is designated as ε _m, minimal error rate is designated as

In the former algorithm of AdaBoost, integrate whole decision-making to produce final decision-making by the majority voting method of a weighting:

P (x) = s i g n [Σ_{m = 1}^{M} α_{m} P_{m} (x)]

Wherein P _mx () is grader decision function, AdaBoost algorithm suitably can be integrated the mistake learning the Weak Classifier obtained, iteration all will upgrade weight each time, reduce the weight of the good data of Weak Classifier classifying quality, increase the weight of the poor data of Weak Classifier classifying quality, final grader is the weighted average of Weak Classifier.

Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability.Namely which kind of theme object source belong to, and selects the class with maximum a posteriori probability as the theme belonging to object source.By the set of training source data, obtain each data message in inhomogeneous probability size by bayesian theory, construct Bayesian model.Naive Bayesian is that Bayesian Classification Model medial error rate is minimum, and estimated parameter needed for it is little, and implementation algorithm is simple.Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning.In the present invention, if by system, data are judged to be that " sensitive data " is used as junk data and is filtered out, but it is again exactly the content required for user, then can cause very large loss to user.By the theme of Least risk Bayes sorting technique determination data source, filter according to different topic distillation strategies, all classification errors are all taken into account, the risk of erroneous judgement can be reduced to a great extent.

Known P (ω _i), P (X| ω _i), i=1,2 ..., c, and when X to be identified (network packet to be filtered), calculate posterior probability according to Bayesian formula,

P (ω_{j} | X) = \frac{P (X | ω_{j}) P (ω_{j})}{Σ_{i = 1}^{c} P (X | ω_{i}) P (ω_{i})}, j = 1, 2, ... ..., x

Wherein P (ω _i) be prior probability, be obtained by the demand analysis of user to network data in the past; P (ω _j| X) be posterior probability, be the probability again corrected again after obtaining information X, P (X| ω _i) be judge that whether the X to be identified that receives is the probability of rubbish network data according to the demand experience of user to network data in the past.

Note data degradation is α, decision-making decision rule is defined as:

1) when network data is junk data, be judged as that junk data can not cause any loss, α=0;

2) when being valid data rubbish network data judging, then α=0 is lost;

3) when needed for user, network data is judged to be junk data, then the loss caused is immeasurable, 0< α < ∞.

According to the decision rule of the posterior probability that draws after calculating and setting, calculate as follows and take d _i, i=1,2 ... the conditional risk of a:

R (d_{i} | X) = Σ_{j = 1}^{c} α (d_{i}, ω_{j}) P (ω_{j} | X), i = 1, 2, ... ..., a

After considering that data are misjudged, loss α → 0 is dropped to minimum, therefore to d the conditional risk value R (d obtained before _i| X), i=1,2 ..., d compares, and therefrom finds out the decision-making making conditional risk minimum, is designated as d _k, d _kit is exactly Least risk Bayes categorised decision.

The AdaBoost method that the present invention optimizes is as follows:

Input network data with a matrix type, initializes weights perform circulation m=1,2 ..., M, by ω _ivalue substitute in AdaBoost framework, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y _i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P _m, by classification error rate α _mupgrade, the weights obtaining training sample are continue the circulation starting next round, until M circulation terminates.By repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P _m, obtain through algorithm:

P (x) = s i g n [Σ_{m = 1}^{M} α_{m} P_{m} (x)]

Final P (x) is exactly the final grader obtained after M study in content-based in-depth filtration algorithm.

The present invention can obtain following beneficial effect:

The present invention is directed to above problem, the object of the present invention is to provide efficiently a kind of, the stable data filtering processing method based on LTE signaling, it adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize the data filtering processing method based on LTE signaling of complete set.Data filtering processing method application scenarios figure as shown in Figure 1.Data processing is divided into three processes, LTE signaling data is carried out to the filtration of five-tuple simple data, carries out deep content filtration by the AdaBoost method of KNN file classification method determination data source character, optimization to signaling data, complete the data processing to described LTE data message, process as shown in Figure 2.The present invention has higher filtration accuracy and system robustness than original information filtering method, and the present invention can directly apply to the fields such as network security, network information data process, large data analysis.

Accompanying drawing explanation

The data filtering processing method application scenarios figure of Fig. 1 LTE signaling.

Fig. 2 data filtering process flow figure.

Fig. 3 filtration treatment method cell schematics is illustrated.

Fig. 4 HASH rule is arranged.

Fig. 5 KNN file classification method key diagram.

Embodiment

In order to understand the present invention better, below in conjunction with the drawings and specific embodiments, the present invention is elaborated.The present invention proposes a kind of three DBMS filtration treatment methods based on LTE signaling newly.Below key step is specifically described.

1. five-tuple simple filtration

First defined variable SIP, DIP, SP, DP, PT represents source IP address in five-tuple, object IP address, source port number, destination slogan and transport protocol type respectively, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL (PolicyControlList, policy control listing), PCL is the advanced function of exchange chip, and it is the filtering rule list realized on exchange chip.

Data message is first through the process of PCL (IngressPCL) Engine, an IPCLTable is generated according to the type of message and PCL-ID, carry out searching coupling in TCAM by this table, the condition that the match is successful is: first PCL-ID is identical, then define a data structure and be used as function parameter, and the PCL transmitting client layer specified then, how many rule is had just to need to perform to secondary coupling less, five-tuple of the present invention filters and uses mask, respectively based on source IP address+mask SIP_MASK, object IP address+mask DIP_MASK, source port number+mask SP_MASK, destination slogan+mask DP_MASK, transport protocol type+mask PT_MASK is respectively as filter condition, hit rule then preserves output, miss rule, then carry out discard processing.

2.KNN file classification method

1) new text vector and training text vector is formed according to Feature Words.

I.e. D=D (T ₁, W ₁; T ₂, W ₂; T _n, W _n), the vector representation of new text and training text is determined according to Feature Words.

S i m (d_{i}, d_{j}) = \frac{Σ_{k = 1}^{M} W_{i k} \times W_{j k}}{(Σ_{k = 1}^{M} {W_{i k}}^{2}) (Σ_{k = 1}^{M} {W_{j k}}^{2})}

3) choose the number that k value is all texts in database, then according to the value of Sim according to sorting from high to low, k equals several, just gets front severally to sort.

4) in k neighbours of new text, the weight of every class is calculated successively,

P (\overset{&OverBar;}{x}, C_{j}) = \underset{{\overset{&OverBar;}{d}}_{i} &Element; K N N}{Σ} S i m (\overset{&OverBar;}{x}, {\overset{&OverBar;}{d}}_{i}) y ({\overset{&OverBar;}{d}}_{i}, C_{j})

5) weight of comparing class, namely in k neighbours, belong to which classification many, which classification new text just belongs to.Message is originally assigned in that maximum classification of weight and go.

The cascade filtration more than undertaken by KNN file classification method for data, determines the character of data source.

3. the AdaBoost method optimized carries out deep content filtration.

The AdaBoost method of optimization of the present invention is the Least risk Bayes in-depth filtration algorithm based on AdaBoost algorithm.Its training framework using AdaBoost algorithm as grader, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost.

1) the form input of matrix will be converted to by the network data after front double-filtration, and under given rectangular characteristic prototype, calculate and obtain rectangular characteristic collection,

2) initializes weights perform circulation m=1,2 ..., M, by ω _ivalue substitute in AdaBoost framework,

3) take feature set as input, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y _i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P _m,

α_{m} = \frac{1}{2} l o g (\frac{1 - ϵ_{m}}{ϵ_{m}})

4) by classification error rate α _mupgrade, the weights obtaining training sample are

D_{m + 1} (i) = \frac{D_{m} (i) \exp (- α_{m} y_{i} P_{m} (x_{i}))}{Z_{m}},

Continue the circulation starting next round, until M circulation terminates.

5) by repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P _m, obtain through algorithm:

P (x) = s i g n [Σ_{m = 1}^{M} α_{m} P_{m} (x)]

Claims

1., based on a data filtering processing method for LTE signaling, it is characterized in that:

This method adopts five-tuple simple filtration method to carry out data prediction, then the character of KNN file classification method determination source information is adopted, obtain data finally by the AdaBoost method optimized and carry out characteristic matching with the sample obtained according to cluster analysis, finally realize deep content and filter;

Its concrete steps comprise:

1. five-tuple simple filtration;

First simple single filter being carried out to network data---five-tuple filters; Defined variable SIP, DIP, SP, DP, PT represent respectively five-tuple filter in source IP address, object IP address, source port number, destination slogan and transport protocol type, they constitute the basic element of five-tuple; In a session, according to the value of filtering policy determination mask SIP_MASK, DIP_MASK, SP_MASK, DP_MASK and PT_MASK, composition PCL, thus the information of carrying out single filter;

2.KNN file classification method determines source character;

Data after five-tuple filtration treatment are called new text, the text that itself and given training text are concentrated is carried out KNN text classification calculating, to new input example and new text, concentrate at training text and find K the example the most contiguous with this example, then that class belonging to majority of this K example is exactly the class of new text; That is, new text and training text are all regarded as a N dimensional vector, calculate the similarity that new text and training text concentrate each text, find out K sample the most similar, determined the classification of new text by the classification belonging to Weighted distance and training text;

KNN algorithmic procedure is described below:

1) for new text and training text, new text vector and training text vector is formed according to Feature Words;

According to traditional vector space model, text message is turned to the weighted feature vector in feature space by form; I.e. D=D (T ₁, W ₁; T ₂, W ₂; T _n, W _n), the vector representation of new text and training text is determined according to Feature Words;

S i m (d_{i}, d_{j}) = \frac{Σ_{k = 1}^{M} W_{i k} \times W_{j k}}{(Σ_{k = 1}^{M} {W_{i k}}^{2}) (Σ_{k = 1}^{M} {W_{j k}}^{2})}

Wherein d _ifor the characteristic vector of new text, d _jfor the center vector of jth class, M is the dimension of characteristic vector, W _kfor the kth of vector is tieed up;

For k value, because KNN method is counted as one estimate posterior probability P (w from sample _i| method x), so in order to be estimated reliably, k value need be the bigger the better, just can improve the accuracy of estimation like this, but then, wishes that again this k neighbor point is more near apart from new text better, remembers that new text posterior probability is P (w _i| x ₁), only have when this k neighbor point is nearer apart from new text, P (w _i| x ₁) just can approach P (w as much as possible _i| x), all determine k value according to the personal experience of people in the past, so often occur estimating inaccurate situation, if the selection of k value is too small, the contiguous number obtained can be too small, then reduce nicety of grading, if the selection of k value is excessive, then easily increases noise data and reduce classification accuracy, therefore now proved by a large amount of experiments, during all text numbers of fetching data in storehouse when k value, the classification results of new text is globally optimal solution;

P (\overset{&OverBar;}{x}, C_{j}) = \underset{{\overset{&OverBar;}{d}}_{i} &Element; K N N}{Σ} S i m (\overset{&OverBar;}{x}, {\overset{&OverBar;}{d}}_{i}) y ({\overset{&OverBar;}{d}}_{i}, C_{j})

Wherein for the characteristic vector of new text, for calculating formula of similarity, for data source character;

4) weight of comparing class, assigns to text in that maximum classification of weight and goes;

In sum, data carry out cascade filtration by the method for KNN text classification to data message, determine the character of data source;

3. the AdaBoost method optimized carries out deep content filtration;

It is training framework using AdaBoost algorithm as grader that this method proposes based on the Least risk Bayes in-depth filtration algorithm of AdaBoost algorithm, the Weak Classifier in AdaBoost algorithm is replaced by Least risk Bayes sorting algorithm, as the grader of AdaBoost, reach the combination of two algorithms, namely based on the Least risk Bayes in-depth filtration algorithm of AdaBoost;

AdaBoost is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, the final grader (strong classifier) that final formation one is the strongest; Its algorithm realizes according to change Data distribution8, whether correct according to the classification of sample each among each training set, and the accuracy of the general classification of last time determines the weights of each sample, the up-to-date weights revised are delivered in sub classification device and trains, finally the grader that obtains after training is at every turn merged, export final most strong classifier;

If training sample set is: S={ (x ₁, y ₁), (x ₂, y ₂) ..., (x _i, y _i), x _i∈ X, y _i∈ Y}, X, Y correspond respectively to positive example sample and negative routine sample, and M is the maximum cycle of training, and the error rate of grader is designated as ε _m, minimal error rate is designated as

P (x) = s i g n [Σ_{m = 1}^{M} α_{m} P_{m} (x)]

Wherein P _mx () is grader decision function, AdaBoost algorithm suitably can be integrated the mistake learning the Weak Classifier obtained, iteration all will upgrade weight each time, reduce the weight of the good data of Weak Classifier classifying quality, increase the weight of the poor data of Weak Classifier classifying quality, final grader is the weighted average of Weak Classifier;

Bayesian Classification Arithmetic is the prior probability model by certain object, utilizes Bayesian formula to calculate its posterior probability; Namely which kind of theme object source belong to, and selects the class with maximum a posteriori probability as the theme belonging to object source; By the set of training source data, obtain each data message in inhomogeneous probability size by bayesian theory, construct Bayesian model; Naive Bayesian is that Bayesian Classification Model medial error rate is minimum, and estimated parameter needed for it is little, and implementation algorithm is simple; Least risk Bayes sorting algorithm solves error rate problem exactly based on Bayes and naive Bayesian, is the optimization in minimal error rate meaning; In the method, if by system, data are judged to be that " sensitive data " is used as junk data and is filtered out, but it is again exactly the content required for user, then can cause very large loss to user; By the theme of Least risk Bayes sorting technique determination data source, filter according to different topic distillation strategies, all classification errors are all taken into account, the risk of erroneous judgement can be reduced to a great extent;

P (ω_{j} | X) = \frac{P (X | ω_{j}) P (ω_{j})}{Σ_{i = 1}^{c} P (X | ω_{j}) P (ω_{j})}, j = 1, 2, ... ..., x

Wherein P (ω _i) be prior probability, be obtained by the demand analysis of user to network data in the past; P (ω _j| X) be posterior probability, be the probability again corrected again after obtaining information X, P (X| ω _i) be judge that whether the X to be identified that receives is the probability of rubbish network data according to the demand experience of user to network data in the past;

Note data degradation is α, decision-making decision rule is defined as:

2) when being valid data rubbish network data judging, then α=0 is lost;

3) when needed for user, network data is judged to be junk data, then the loss caused is immeasurable, 0< α < ∞;

R (d_{i} | X) = Σ_{j = 1}^{c} α (d_{i}, ω_{j}) P (ω_{j} | X), i = 1, 2, ... ..., a

After considering that data are misjudged, loss α → 0 is dropped to minimum, therefore to d the conditional risk value R (d obtained before _i| X), i=1,2 ..., d compares, and therefrom finds out the decision-making making conditional risk minimum, is designated as d _k, d _kit is exactly Least risk Bayes categorised decision;

The AdaBoost method that this method is optimized is as follows:

Input network data with a matrix type, initializes weights i=1,2 ..., n, performs circulation m=1,2 ..., M, by ω _ivalue substitute in AdaBoost framework, trained by Least risk Bayes grader, obtain hypothesis P:X ∈ y _i, grader is traveled through whole data set, and marks P and to classify correct sample and the sample of classification error, carry out misjudgment sample number according to the quantity of population sample, calculate the classification error rate α of P _m, by classification error rate α _mupgrade, the weights obtaining training sample are continue the circulation starting next round, until M circulation terminates; By repeatedly circulating, the Least risk Bayes sorting algorithm based on AdaBoost can sum up M grader P _m, obtain through algorithm:

P (x) = s i g n [Σ_{m = 1}^{M} α_{m} P_{m} (x)]