CN104486308A

CN104486308A - Design method for high-speed multi-dimension message classification

Info

Publication number: CN104486308A
Application number: CN201410730111.8A
Authority: CN
Inventors: 宁卓; 孙知信; 石伟; 胡婷
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2015-04-01

Abstract

The invention discloses a design method for high-speed multi-dimension message classification. The method has the advantages that the high-speed multi-dimension message classification can be optimized by using backbone network flow features, and in addition, the optimization problem is concluded into an optimum flow classification tree for calculating the dynamic flow classification cost minimization in the current timeslice; a solving method of the optimum flow classification tree comprises the following four parts including a layer division method for optimizing the message classification tree by using the large-scale backbone network flow features, a classification domain entropy calculation method for measuring the classification capability of various classification domains on the current flow, a finding method for reducing the classification node finding and copying cost by using flow length features and effectively accelerating the message classification speed, and an updating method for dynamically updating the optimum flow classification tree.

Description

A kind of method for designing of high-speed multi-dimension message classification

Technical field

The present invention relates to the method for designing of a kind of high-speed multi-dimension message classification, belong to technical field of network security.

Background technology

Misuse intrusion detection system Intrusion Detection System (is called for short: the message classifying algorithm IDS) is the higher-dimension sorting algorithm using multiple sorting field.Conveniently set d as classification dimension to discuss, n is rule number.

Existing sorting algorithm is divided into hardware and software two class.Hardware algorithm utilizes hardware computation capability to improve processing speed, the consumption of Time and place achieves good compromise, but although hardware algorithm speed is fast, but price comparison is expensive, especially IDS rule base is upgraded frequent, the situation of message classification dimension many (more than 10), autgmentability difference is the shortcoming that another be can not ignore.Software algorithm has what represent meaning to be exactly the P_Hicuts algorithm that the people such as Gong Jian proposes most, and it expands for classical Hicuts algorithm free air anomaly and decision tree imbalance problem proposes 2 improvement: 1. take non-homogeneous cutting method to decrease class node number; 2. carry on cover up rule, no longer participate in grouping, suppress the space index caused by this rule-like to expand with this, thus reduce the height of classification tree, reduce complexity of on average classifying.Current message classifying algorithm development is basicly stable, and the average time complexity of these sorting algorithms is O (d) in theory, and the various corrective measures of proposition ensure that space hold is much smaller than its worst space complexity O (n ^d).

But above-mentioned software algorithm all belongs to traditional message classifying algorithm, their various Optimized Measures only consider the static nature of IDS rule base, and have ignored main body---the characteristic of network traffics of access classification tree completely, dynamic flow characteristic thus cannot be utilized to carry out Optimum Classification tree structure.The WIND algorithm that the people such as Sinha, S propose innovatively proposes to utilize the behavioral characteristics of flow to the method instructing classification tree to construct, and its essence is the improvement of the partition strategy to node of classifying in classification tree, is a kind of heuristic of node cutting of classifying.Whether for this reason WIND weighs specific properties value with the rule number M that the specific properties value of sorting field in flow can be got rid of and sets up independent class node.But its method is also ripe far away, this is embodied in 1. WIND is the classification tree that employing Small Sample Database proves by experiment according to present flow rate feature construction, message classification speed can be increased to 1.3 ~ 1.7 times without the Snort classification tree optimized, take memory space few nearly 15%, and intuitively draw the conclusion can improving classification tree performance in conjunction with dynamic flow feature.Do not point out it is that the long dependence & self-similar network feature of flow has ensured that the classification tree using the feature of present flow rate to generate is best suited for " the best " classification tree of flow within a period of time.2. this node cutting heuristic that WIND proposes is only suitable for Small Sample Database, is so great that unacceptable for extensive backbone network real traffic classification tree space hold; 3. WIND proposes the feature of flow is dynamic change, and the dynamic change that optimal classification tree must be adaptive to flow could keep " optimum " of classification tree, but it does not propose the adaptive of oneself and dynamically updates strategy.Such as adopt flow sample how long, those behavioral characteristics of flow can be utilized all not to be resolved to improve classification tree seek rate structure and how to upgrade sort tree structure etc. key issue.And the present invention can solve problem above well.

Summary of the invention

The object of the invention there are provided the method for designing of a kind of high-speed multi-dimension message classification, the method can utilize backbone traffic characteristic optimization high-speed multi-dimension message to classify, and is the define method of the optimum flow classification tree of the dynamic flow classification cost minimization asked in current time sheet by problem stipulations; The method for solving of optimum flow classification tree, namely utilizes the traffic characteristic of extensive backbone network to optimize the hierarchical division method of message classification tree; Weigh the sorting field entropy computational methods of various sorting field to the classification capacity of present flow rate; Utilize the long characteristic of stream to lower classification tree node checks and copy cost, effectively improve the lookup method of message classification speed and dynamically update the update method of optimum flow classification tree.

The present invention solves the technical scheme that its technical problem takes: a kind of method for designing of high-speed multi-dimension message classification, the method comprises the steps:

Step 1: use MultiBloom Filter to calculate the stream flowing F belonging to message Pi in real time long.

Step 2: if message Pi belongs to short stream, then illustrate that message Pi does not have corresponding rule spatial cache, message Pi still needs to adopt the conventional search methods of P_Hicuts to search classification tree, by the rule aggregate copy of each classification tree node of traversal in the regular label of message Pi, Pi+ tag set Ri is copied in the public buffer memory of message to during classification tree leaf node when searching.

Step 3: if the long standard just having met long stream of the stream of stream F belonging to message Pi, then for this new long stream creates new projects in long stream hash table.It comprises four information: traffic identifier and three position indicator pointers.They are traffic identifier FlowID respectively, adopt quaternary group information mark; The classification tree intermediate node pointer p_midNodePointer of message Pi, points to the intermediate node arrived after message Pi adopts 5 tuple information search in classification tree; The classification tree leaf node pointer p_finalNodePointer of message Pi, points to leaf node and the regular label pointer p_rule of Pi in public internal memory of the last access of message Pi classification for search tree.

Step 4: if message Pi is the follow-up message of long stream F, the classification tree leaf node that the intermediate node pointer p_midNodePointer that then message Pi directly preserves from long stream hash table down searches, if this leaf node is the same with the leaf node of other message in stream, then need not again copy regular label, directly message Pi is joined in stream shared drive Buffer Pool by p_rule and go, otherwise carry out the 5th step.

Step 5: the leaf node of message Pi and the access of other messages in flowing is different, now produce backtracking, from intermediate node pointer p_midNodePointer, classification tree is searched by the conventional search methods of P_Hicuts, by the rule aggregate copy of each classification tree node of traversal in message Pi rule label, Pi+ tag set Ri is copied in the public buffer memory of message to during classification leaf node when searching.

Beneficial effect:

1, the present invention utilizes the feature of dynamic flow to improve IDS sort tree structure, reduces the complexity of message access classification tree, reduces the memory cost of algorithm while improving classification speed.

2, the invention solves defect and the deficiency of Wind algorithm, what improve classification tree dynamically updates strategy.

Accompanying drawing explanation

Fig. 1 is the flow chart of method of the present invention.

Fig. 2 is that schematic diagram is compared in five classification territory entropy distributions.

Fig. 3 is that schematic diagram is compared in other sorting field entropy per hour distribution.

Fig. 4 is that the classification speed of three kinds of message classifying algorithms compares schematic diagram.

Fig. 5 is that the EMS memory occupation of three kinds of message classifying algorithms compares schematic diagram.

Fig. 6 is the distribution statistics figure of backbone network long flow amount per minute.

Fig. 7 is stream content and the message percentage composition statistical chart thereof of long stream per hour.

Embodiment

Below in conjunction with Figure of description, the invention is described further.

Embodiment one

1, sorting field entropy computational methods

Defining classification territory entropy:

If X={x ₁, x ₂... x _i... x _nrepresent present flow rate, each x _iall represent a message, X is also referred to as message set.If R={r ₁, r ₂... r _i... r _mthe attack regular collection of IDS, A={a ₁, a ₂..., a _j... a _urepresenting sorting field set, R is at a _jon being categorized as of being formed of non-homogeneous cutting then make then sorting field a _jpresent flow rate X Sum fanction collection is classified categorical attribute entropy be defined as:

H_{X} (a_{j}) = - Σ_{i = 1}^{t + 1} p (C_{a_{j}}^{i}) \log_{2} \frac{1}{p (C_{a_{j}}^{i})},

Wherein

p (C_{a_{j}}^{i}) = \frac{Occr (C_{a_{j}}^{k})}{n},

Wherein it is classification node by the number of times of all message access in X, n is message total.

Sorting field entropy H _x(a _j) span at [0, log ₂(t+1)].Work as H _x(a _j) value is when being 0, illustrate and adopt a _jclassification, present flow rate X is for classification visiting distribution reach maximum and gather, what namely all messages were all accessed is same class node h _x(a _j) value is log ₂(t+1) time, the visiting distribution of present flow rate X is dispersed most, and the message in X accesses each class node value number of times equal

Occr (C_{a_{j}}^{1}) = Occr (C_{a_{j}}^{2}) = . . . = Occr (C_{a_{j}}^{n}) .

According to the definition of sorting field entropy, the present invention has investigated the classification capacity of 14 sorting fields available in current Snort, their spans in an experiment and classification capacity sequence as shown in table 1.In order to weigh the consecutive variations trend of these sorting field classification capacities, adopt the netflow data in the upper 10Gbps link of CERNET Jiangsu Province selvage circle continuous 8 hours of February 24 in 2009 as experimental data, sampling rate is 1/256, under investigation real traffic, the value of 5 kinds of categorical attribute entropys (DstPort, SrcPort, DstIp, TcpFlags and SrcIp) per minute distributes and variation tendency, as shown in Figure 2.The reason done like this has three: 1. netflow data traffic is large, and the duration is long, more can reflect macroscopical trend of sorting field Entropy Changes; 2. research shows that random sampling is little for the impact of entropy measure, and the accuracy of classification entropy measure can reduce process complexity again to adopt data from the sample survey to ensure; 3. timeslice is got one minute is that the sorting field Entropy Changes being therefore less than minute rank is nonsensical to Optimum Classification tree structure because the time rank of message classification tree structure is a minute level.But adopt netflow data can only weigh the Entropy Changes of aforementioned five sorting fields (five-tuple), because netflow data do not comprise other sorting field information, the trace of corresponding period can only be adopted to calculate other sorting field entropy.Because backbone network trace data volume is large, only investigates identical 10Gbps link and to have gone forward the changes of entropy per minute of other sorting field entropy in 5 hours, as shown in Figure 3.

Table 1: sorting field classification capacity sequencing table

Fig. 2 and Fig. 3 shows the continuous value distribution of above-mentioned 14 kinds of sorting fields.Fig. 2 shows five-tuple information (source/address, place, source/place port and protocol) the entropy entropy distribution map per minute of continuous 8 hours.Fig. 3 shows other sorting field entropy distribution map per minute of 5 hours.The result of analysis chart 2 and Fig. 3, obtains the span table of the sorting field entropy of table 1.

Experimental result shows: what 1. for flow, really can play classification effect is 5 sorting fields of value more than 0.5, and their entropy value size orders change in time hardly and change, and illustrate that their classification capacity difference are obvious.Its size is arranged as H (DstPort) >H (SrcPort) >H (DstIp) >H (TcpFlags) >H (SrcIp).Wherein more special is that H (DstIp) and H (TcpFlags) value are all between 1.1 ~ 1.2, two sorting field values almost overlap, but TcpFlags catastrophe point is more, therefore get H (DstIp) >H (TcpFlags).2. the classification capacity of other sorting field is not strong.The value of the entropy a few hours of TCPAck and ICMPSeq is almost 0 entirely especially, illustrates that in most of the time experiment, all flows all only have accessed a class node, illustrate that TCPAck and ICMPSeq possesses classification capacity hardly.

Embodiment two

As shown in Figure 1, the invention provides the method for designing of a kind of high-speed multi-dimension message classification, the method comprises the steps:

Present invention saves memory copying number of times and the time of long stream packet, therefore, effect of the present invention depends on quantity and the degree of long stream packet in total message number of long stream in flow.Extensive backbone network has the long heavy-tailed property of typical stream, which ensure that the validity of algorithm.On the backbone network that experimental data of the present invention shows 10Gbps, an average needs process 103 order of magnitude per minute long stream stream information, just can improve average 10 ⁶the classification processing speed of an order of magnitude message.

Two, FCS algorithm realization process

Realization of the present invention does not have particular/special requirement to system, and the algorithm of realization is called FCS.Two Intel Xeon dual 3.06GHz are adopted, the server of 2G during experiment.One as flow generator, one as message classification machine.Two kinds of message data sources are adopted to input as analogue flow rate respectively, a kind of is the open test data set of DARPA1999 4th week, on another kind of Shi Hua the Northeast 10Gbps backbone link 1/4 load balancing trace, be equivalent to a 2.5Gbps link, actual flow average is about 1.2Gbps.Although the speed of DARPA and trace is different, flow generator can play analogue flow rate with the fixed rate of 150Kpps, simulates the flow of average 600Mbps and compares the message classification speed of now three kinds of message classifying algorithms FCS, P-Hicuts and Wind.The average classification time adopting the every 10K message of process to spend weighs classification speed, and all experimental results adopt 5 laboratory mean values.Experiment adopts the official rule base 3.1.5 of Snort, and contain 2579 rules, sorting field adopts 14 above-mentioned sorting fields.

Fig. 4 shows the processing speed comparison diagram of three kinds of sorting algorithms.As shown in the figure, the message classification speed of DARPA data set or backbone network trace, FCS no matter is used all to be better than other two kinds of message classifying algorithms.FCS is obviously better than its optimization function for artificial small data set DARPA99 for the message classification optimization function of trace.

Fig. 5 shows and adopts the operation time space of the lower three kinds of message classifying algorithms of trace and DARPA data set two kinds of test datas to take comparison diagram.

The main improvement of the present invention is the memory copying number of times and the time that save long stream packet, and the effect therefore improved depends on quantity and the degree of long stream packet in total message number of long stream in flow.Fig. 6 shows the quantity of long stream per minute in certain backbone network one hour flow, and average magnitude is 103.Fig. 7 shows long flow amount in the same time period and accounts for the degree that the degree flowing total number and the message growing stream account for message total amount.Experimental result shows: the long stream accounting for total amount 10% ~ 20% contains the message load of 50% ~ 99%, and the long heavy-tailed property of stream so has just ensured validity of the present invention.

Claims

1. a method for designing for high-speed multi-dimension message classification, it is characterized in that, described method comprises the steps:

Step 1: use MultiBloom Filter to calculate the stream flowing F belonging to message Pi in real time long;

Step 2: if message Pi belongs to short stream, then illustrate that message Pi does not have corresponding rule spatial cache, message Pi still needs to adopt the conventional search methods of P_Hicuts to search classification tree, by the rule aggregate copy of each classification tree node of traversal in the regular label of message Pi, to during classification tree leaf node, Pi+ tag set Ri is copied in the public buffer memory of message when searching;

Step 3: if the long standard just having met long stream of the stream of stream F belonging to message Pi, then for this new long stream creates new projects in long stream hash table, it comprises four information: traffic identifier and three position indicator pointers; They are traffic identifier FlowID respectively, adopt quaternary group information mark; The classification tree intermediate node pointer p_midNodePointer of message Pi, points to the intermediate node arrived after message Pi adopts 5 tuple information search in classification tree; The classification tree leaf node pointer p_finalNodePointer of message Pi, points to leaf node and the regular label pointer p_rule of Pi in public internal memory of the last access of message Pi classification for search tree;

Step 4: if message Pi is the follow-up message of long stream F, the classification tree leaf node that the intermediate node pointer p_midNodePointer that then message Pi directly preserves from long stream hash table down searches, if this leaf node is the same with the leaf node of other message in stream, then need not again copy regular label, directly message Pi is joined in stream shared drive Buffer Pool by p_rule and go, otherwise carry out the 5th step;

2. the method for designing of a kind of high-speed multi-dimension message classification according to claim 1, it is characterized in that, described method utilizes backbone traffic characteristic optimization high-speed multi-dimension message to classify, and is the optimum flow classification tree of the dynamic flow classification cost minimization asked in current time sheet by problem stipulations; The method for solving of optimum flow classification tree, namely utilizes the traffic characteristic of extensive backbone network to optimize the hierarchical division method of message classification tree; Weigh the sorting field entropy computational methods of various sorting field to the classification capacity of present flow rate; The long characteristic of stream is utilized to lower classification tree node checks and copy cost.

3. the method for designing of a kind of high-speed multi-dimension message classification according to claim 2, is characterized in that: the flow of described method can play 5 sorting fields for value more than 0.5 of classification effect; Described entropy value size order does not change in time and changes.

4. the method for designing of a kind of high-speed multi-dimension message classification according to claim 1, it is characterized in that: described method make use of the heavy-tailed property of the extensive flow of backbone network, the long stream packet namely accounting for flow quantity minority has undertaken most message load; The long stream accounting for total amount 10% ~ 20% comprises the message load of 50% ~ 99%.