CN104486308A - Design method for high-speed multi-dimension message classification - Google Patents

Design method for high-speed multi-dimension message classification Download PDF

Info

Publication number
CN104486308A
CN104486308A CN201410730111.8A CN201410730111A CN104486308A CN 104486308 A CN104486308 A CN 104486308A CN 201410730111 A CN201410730111 A CN 201410730111A CN 104486308 A CN104486308 A CN 104486308A
Authority
CN
China
Prior art keywords
message
classification
stream
classification tree
flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410730111.8A
Other languages
Chinese (zh)
Inventor
宁卓
孙知信
石伟
胡婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201410730111.8A priority Critical patent/CN104486308A/en
Publication of CN104486308A publication Critical patent/CN104486308A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a design method for high-speed multi-dimension message classification. The method has the advantages that the high-speed multi-dimension message classification can be optimized by using backbone network flow features, and in addition, the optimization problem is concluded into an optimum flow classification tree for calculating the dynamic flow classification cost minimization in the current timeslice; a solving method of the optimum flow classification tree comprises the following four parts including a layer division method for optimizing the message classification tree by using the large-scale backbone network flow features, a classification domain entropy calculation method for measuring the classification capability of various classification domains on the current flow, a finding method for reducing the classification node finding and copying cost by using flow length features and effectively accelerating the message classification speed, and an updating method for dynamically updating the optimum flow classification tree.

Description

A kind of method for designing of high-speed multi-dimension message classification
Technical field
The present invention relates to the method for designing of a kind of high-speed multi-dimension message classification, belong to technical field of network security.
Background technology
Misuse intrusion detection system Intrusion Detection System (is called for short: the message classifying algorithm IDS) is the higher-dimension sorting algorithm using multiple sorting field.Conveniently set d as classification dimension to discuss, n is rule number.
Existing sorting algorithm is divided into hardware and software two class.Hardware algorithm utilizes hardware computation capability to improve processing speed, the consumption of Time and place achieves good compromise, but although hardware algorithm speed is fast, but price comparison is expensive, especially IDS rule base is upgraded frequent, the situation of message classification dimension many (more than 10), autgmentability difference is the shortcoming that another be can not ignore.Software algorithm has what represent meaning to be exactly the P_Hicuts algorithm that the people such as Gong Jian proposes most, and it expands for classical Hicuts algorithm free air anomaly and decision tree imbalance problem proposes 2 improvement: 1. take non-homogeneous cutting method to decrease class node number; 2. carry on cover up rule, no longer participate in grouping, suppress the space index caused by this rule-like to expand with this, thus reduce the height of classification tree, reduce complexity of on average classifying.Current message classifying algorithm development is basicly stable, and the average time complexity of these sorting algorithms is O (d) in theory, and the various corrective measures of proposition ensure that space hold is much smaller than its worst space complexity O (n d).
But above-mentioned software algorithm all belongs to traditional message classifying algorithm, their various Optimized Measures only consider the static nature of IDS rule base, and have ignored main body---the characteristic of network traffics of access classification tree completely, dynamic flow characteristic thus cannot be utilized to carry out Optimum Classification tree structure.The WIND algorithm that the people such as Sinha, S propose innovatively proposes to utilize the behavioral characteristics of flow to the method instructing classification tree to construct, and its essence is the improvement of the partition strategy to node of classifying in classification tree, is a kind of heuristic of node cutting of classifying.Whether for this reason WIND weighs specific properties value with the rule number M that the specific properties value of sorting field in flow can be got rid of and sets up independent class node.But its method is also ripe far away, this is embodied in 1. WIND is the classification tree that employing Small Sample Database proves by experiment according to present flow rate feature construction, message classification speed can be increased to 1.3 ~ 1.7 times without the Snort classification tree optimized, take memory space few nearly 15%, and intuitively draw the conclusion can improving classification tree performance in conjunction with dynamic flow feature.Do not point out it is that the long dependence & self-similar network feature of flow has ensured that the classification tree using the feature of present flow rate to generate is best suited for " the best " classification tree of flow within a period of time.2. this node cutting heuristic that WIND proposes is only suitable for Small Sample Database, is so great that unacceptable for extensive backbone network real traffic classification tree space hold; 3. WIND proposes the feature of flow is dynamic change, and the dynamic change that optimal classification tree must be adaptive to flow could keep " optimum " of classification tree, but it does not propose the adaptive of oneself and dynamically updates strategy.Such as adopt flow sample how long, those behavioral characteristics of flow can be utilized all not to be resolved to improve classification tree seek rate structure and how to upgrade sort tree structure etc. key issue.And the present invention can solve problem above well.
Summary of the invention
The object of the invention there are provided the method for designing of a kind of high-speed multi-dimension message classification, the method can utilize backbone traffic characteristic optimization high-speed multi-dimension message to classify, and is the define method of the optimum flow classification tree of the dynamic flow classification cost minimization asked in current time sheet by problem stipulations; The method for solving of optimum flow classification tree, namely utilizes the traffic characteristic of extensive backbone network to optimize the hierarchical division method of message classification tree; Weigh the sorting field entropy computational methods of various sorting field to the classification capacity of present flow rate; Utilize the long characteristic of stream to lower classification tree node checks and copy cost, effectively improve the lookup method of message classification speed and dynamically update the update method of optimum flow classification tree.
The present invention solves the technical scheme that its technical problem takes: a kind of method for designing of high-speed multi-dimension message classification, the method comprises the steps:
Step 1: use MultiBloom Filter to calculate the stream flowing F belonging to message Pi in real time long.
Step 2: if message Pi belongs to short stream, then illustrate that message Pi does not have corresponding rule spatial cache, message Pi still needs to adopt the conventional search methods of P_Hicuts to search classification tree, by the rule aggregate copy of each classification tree node of traversal in the regular label of message Pi, Pi+ tag set Ri is copied in the public buffer memory of message to during classification tree leaf node when searching.
Step 3: if the long standard just having met long stream of the stream of stream F belonging to message Pi, then for this new long stream creates new projects in long stream hash table.It comprises four information: traffic identifier and three position indicator pointers.They are traffic identifier FlowID respectively, adopt quaternary group information mark; The classification tree intermediate node pointer p_midNodePointer of message Pi, points to the intermediate node arrived after message Pi adopts 5 tuple information search in classification tree; The classification tree leaf node pointer p_finalNodePointer of message Pi, points to leaf node and the regular label pointer p_rule of Pi in public internal memory of the last access of message Pi classification for search tree.
Step 4: if message Pi is the follow-up message of long stream F, the classification tree leaf node that the intermediate node pointer p_midNodePointer that then message Pi directly preserves from long stream hash table down searches, if this leaf node is the same with the leaf node of other message in stream, then need not again copy regular label, directly message Pi is joined in stream shared drive Buffer Pool by p_rule and go, otherwise carry out the 5th step.
Step 5: the leaf node of message Pi and the access of other messages in flowing is different, now produce backtracking, from intermediate node pointer p_midNodePointer, classification tree is searched by the conventional search methods of P_Hicuts, by the rule aggregate copy of each classification tree node of traversal in message Pi rule label, Pi+ tag set Ri is copied in the public buffer memory of message to during classification leaf node when searching.
Beneficial effect:
1, the present invention utilizes the feature of dynamic flow to improve IDS sort tree structure, reduces the complexity of message access classification tree, reduces the memory cost of algorithm while improving classification speed.
2, the invention solves defect and the deficiency of Wind algorithm, what improve classification tree dynamically updates strategy.
Accompanying drawing explanation
Fig. 1 is the flow chart of method of the present invention.
Fig. 2 is that schematic diagram is compared in five classification territory entropy distributions.
Fig. 3 is that schematic diagram is compared in other sorting field entropy per hour distribution.
Fig. 4 is that the classification speed of three kinds of message classifying algorithms compares schematic diagram.
Fig. 5 is that the EMS memory occupation of three kinds of message classifying algorithms compares schematic diagram.
Fig. 6 is the distribution statistics figure of backbone network long flow amount per minute.
Fig. 7 is stream content and the message percentage composition statistical chart thereof of long stream per hour.
Embodiment
Below in conjunction with Figure of description, the invention is described further.
Embodiment one
1, sorting field entropy computational methods
Defining classification territory entropy:
If X={x 1, x 2... x i... x nrepresent present flow rate, each x iall represent a message, X is also referred to as message set.If R={r 1, r 2... r i... r mthe attack regular collection of IDS, A={a 1, a 2..., a j... a urepresenting sorting field set, R is at a jon being categorized as of being formed of non-homogeneous cutting then make then sorting field a jpresent flow rate X Sum fanction collection is classified categorical attribute entropy be defined as: H X ( a j ) = - Σ i = 1 t + 1 p ( C a j i ) log 2 1 p ( C a j i ) , Wherein p ( C a j i ) = Occr ( C a j k ) n , Wherein it is classification node by the number of times of all message access in X, n is message total.
Sorting field entropy H x(a j) span at [0, log 2(t+1)].Work as H x(a j) value is when being 0, illustrate and adopt a jclassification, present flow rate X is for classification visiting distribution reach maximum and gather, what namely all messages were all accessed is same class node h x(a j) value is log 2(t+1) time, the visiting distribution of present flow rate X is dispersed most, and the message in X accesses each class node value number of times equal Occr ( C a j 1 ) = Occr ( C a j 2 ) = . . . = Occr ( C a j n ) .
According to the definition of sorting field entropy, the present invention has investigated the classification capacity of 14 sorting fields available in current Snort, their spans in an experiment and classification capacity sequence as shown in table 1.In order to weigh the consecutive variations trend of these sorting field classification capacities, adopt the netflow data in the upper 10Gbps link of CERNET Jiangsu Province selvage circle continuous 8 hours of February 24 in 2009 as experimental data, sampling rate is 1/256, under investigation real traffic, the value of 5 kinds of categorical attribute entropys (DstPort, SrcPort, DstIp, TcpFlags and SrcIp) per minute distributes and variation tendency, as shown in Figure 2.The reason done like this has three: 1. netflow data traffic is large, and the duration is long, more can reflect macroscopical trend of sorting field Entropy Changes; 2. research shows that random sampling is little for the impact of entropy measure, and the accuracy of classification entropy measure can reduce process complexity again to adopt data from the sample survey to ensure; 3. timeslice is got one minute is that the sorting field Entropy Changes being therefore less than minute rank is nonsensical to Optimum Classification tree structure because the time rank of message classification tree structure is a minute level.But adopt netflow data can only weigh the Entropy Changes of aforementioned five sorting fields (five-tuple), because netflow data do not comprise other sorting field information, the trace of corresponding period can only be adopted to calculate other sorting field entropy.Because backbone network trace data volume is large, only investigates identical 10Gbps link and to have gone forward the changes of entropy per minute of other sorting field entropy in 5 hours, as shown in Figure 3.
Table 1: sorting field classification capacity sequencing table
Fig. 2 and Fig. 3 shows the continuous value distribution of above-mentioned 14 kinds of sorting fields.Fig. 2 shows five-tuple information (source/address, place, source/place port and protocol) the entropy entropy distribution map per minute of continuous 8 hours.Fig. 3 shows other sorting field entropy distribution map per minute of 5 hours.The result of analysis chart 2 and Fig. 3, obtains the span table of the sorting field entropy of table 1.
Experimental result shows: what 1. for flow, really can play classification effect is 5 sorting fields of value more than 0.5, and their entropy value size orders change in time hardly and change, and illustrate that their classification capacity difference are obvious.Its size is arranged as H (DstPort) >H (SrcPort) >H (DstIp) >H (TcpFlags) >H (SrcIp).Wherein more special is that H (DstIp) and H (TcpFlags) value are all between 1.1 ~ 1.2, two sorting field values almost overlap, but TcpFlags catastrophe point is more, therefore get H (DstIp) >H (TcpFlags).2. the classification capacity of other sorting field is not strong.The value of the entropy a few hours of TCPAck and ICMPSeq is almost 0 entirely especially, illustrates that in most of the time experiment, all flows all only have accessed a class node, illustrate that TCPAck and ICMPSeq possesses classification capacity hardly.
Embodiment two
As shown in Figure 1, the invention provides the method for designing of a kind of high-speed multi-dimension message classification, the method comprises the steps:
Step 1: use MultiBloom Filter to calculate the stream flowing F belonging to message Pi in real time long.
Step 2: if message Pi belongs to short stream, then illustrate that message Pi does not have corresponding rule spatial cache, message Pi still needs to adopt the conventional search methods of P_Hicuts to search classification tree, by the rule aggregate copy of each classification tree node of traversal in the regular label of message Pi, Pi+ tag set Ri is copied in the public buffer memory of message to during classification tree leaf node when searching.
Step 3: if the long standard just having met long stream of the stream of stream F belonging to message Pi, then for this new long stream creates new projects in long stream hash table.It comprises four information: traffic identifier and three position indicator pointers.They are traffic identifier FlowID respectively, adopt quaternary group information mark; The classification tree intermediate node pointer p_midNodePointer of message Pi, points to the intermediate node arrived after message Pi adopts 5 tuple information search in classification tree; The classification tree leaf node pointer p_finalNodePointer of message Pi, points to leaf node and the regular label pointer p_rule of Pi in public internal memory of the last access of message Pi classification for search tree.
Step 4: if message Pi is the follow-up message of long stream F, the classification tree leaf node that the intermediate node pointer p_midNodePointer that then message Pi directly preserves from long stream hash table down searches, if this leaf node is the same with the leaf node of other message in stream, then need not again copy regular label, directly message Pi is joined in stream shared drive Buffer Pool by p_rule and go, otherwise carry out the 5th step.
Step 5: the leaf node of message Pi and the access of other messages in flowing is different, now produce backtracking, from intermediate node pointer p_midNodePointer, classification tree is searched by the conventional search methods of P_Hicuts, by the rule aggregate copy of each classification tree node of traversal in message Pi rule label, Pi+ tag set Ri is copied in the public buffer memory of message to during classification leaf node when searching.
Present invention saves memory copying number of times and the time of long stream packet, therefore, effect of the present invention depends on quantity and the degree of long stream packet in total message number of long stream in flow.Extensive backbone network has the long heavy-tailed property of typical stream, which ensure that the validity of algorithm.On the backbone network that experimental data of the present invention shows 10Gbps, an average needs process 103 order of magnitude per minute long stream stream information, just can improve average 10 6the classification processing speed of an order of magnitude message.
Two, FCS algorithm realization process
Realization of the present invention does not have particular/special requirement to system, and the algorithm of realization is called FCS.Two Intel Xeon dual 3.06GHz are adopted, the server of 2G during experiment.One as flow generator, one as message classification machine.Two kinds of message data sources are adopted to input as analogue flow rate respectively, a kind of is the open test data set of DARPA1999 4th week, on another kind of Shi Hua the Northeast 10Gbps backbone link 1/4 load balancing trace, be equivalent to a 2.5Gbps link, actual flow average is about 1.2Gbps.Although the speed of DARPA and trace is different, flow generator can play analogue flow rate with the fixed rate of 150Kpps, simulates the flow of average 600Mbps and compares the message classification speed of now three kinds of message classifying algorithms FCS, P-Hicuts and Wind.The average classification time adopting the every 10K message of process to spend weighs classification speed, and all experimental results adopt 5 laboratory mean values.Experiment adopts the official rule base 3.1.5 of Snort, and contain 2579 rules, sorting field adopts 14 above-mentioned sorting fields.
Fig. 4 shows the processing speed comparison diagram of three kinds of sorting algorithms.As shown in the figure, the message classification speed of DARPA data set or backbone network trace, FCS no matter is used all to be better than other two kinds of message classifying algorithms.FCS is obviously better than its optimization function for artificial small data set DARPA99 for the message classification optimization function of trace.
Fig. 5 shows and adopts the operation time space of the lower three kinds of message classifying algorithms of trace and DARPA data set two kinds of test datas to take comparison diagram.
The main improvement of the present invention is the memory copying number of times and the time that save long stream packet, and the effect therefore improved depends on quantity and the degree of long stream packet in total message number of long stream in flow.Fig. 6 shows the quantity of long stream per minute in certain backbone network one hour flow, and average magnitude is 103.Fig. 7 shows long flow amount in the same time period and accounts for the degree that the degree flowing total number and the message growing stream account for message total amount.Experimental result shows: the long stream accounting for total amount 10% ~ 20% contains the message load of 50% ~ 99%, and the long heavy-tailed property of stream so has just ensured validity of the present invention.

Claims (4)

1. a method for designing for high-speed multi-dimension message classification, it is characterized in that, described method comprises the steps:
Step 1: use MultiBloom Filter to calculate the stream flowing F belonging to message Pi in real time long;
Step 2: if message Pi belongs to short stream, then illustrate that message Pi does not have corresponding rule spatial cache, message Pi still needs to adopt the conventional search methods of P_Hicuts to search classification tree, by the rule aggregate copy of each classification tree node of traversal in the regular label of message Pi, to during classification tree leaf node, Pi+ tag set Ri is copied in the public buffer memory of message when searching;
Step 3: if the long standard just having met long stream of the stream of stream F belonging to message Pi, then for this new long stream creates new projects in long stream hash table, it comprises four information: traffic identifier and three position indicator pointers; They are traffic identifier FlowID respectively, adopt quaternary group information mark; The classification tree intermediate node pointer p_midNodePointer of message Pi, points to the intermediate node arrived after message Pi adopts 5 tuple information search in classification tree; The classification tree leaf node pointer p_finalNodePointer of message Pi, points to leaf node and the regular label pointer p_rule of Pi in public internal memory of the last access of message Pi classification for search tree;
Step 4: if message Pi is the follow-up message of long stream F, the classification tree leaf node that the intermediate node pointer p_midNodePointer that then message Pi directly preserves from long stream hash table down searches, if this leaf node is the same with the leaf node of other message in stream, then need not again copy regular label, directly message Pi is joined in stream shared drive Buffer Pool by p_rule and go, otherwise carry out the 5th step;
Step 5: the leaf node of message Pi and the access of other messages in flowing is different, now produce backtracking, from intermediate node pointer p_midNodePointer, classification tree is searched by the conventional search methods of P_Hicuts, by the rule aggregate copy of each classification tree node of traversal in message Pi rule label, Pi+ tag set Ri is copied in the public buffer memory of message to during classification leaf node when searching.
2. the method for designing of a kind of high-speed multi-dimension message classification according to claim 1, it is characterized in that, described method utilizes backbone traffic characteristic optimization high-speed multi-dimension message to classify, and is the optimum flow classification tree of the dynamic flow classification cost minimization asked in current time sheet by problem stipulations; The method for solving of optimum flow classification tree, namely utilizes the traffic characteristic of extensive backbone network to optimize the hierarchical division method of message classification tree; Weigh the sorting field entropy computational methods of various sorting field to the classification capacity of present flow rate; The long characteristic of stream is utilized to lower classification tree node checks and copy cost.
3. the method for designing of a kind of high-speed multi-dimension message classification according to claim 2, is characterized in that: the flow of described method can play 5 sorting fields for value more than 0.5 of classification effect; Described entropy value size order does not change in time and changes.
4. the method for designing of a kind of high-speed multi-dimension message classification according to claim 1, it is characterized in that: described method make use of the heavy-tailed property of the extensive flow of backbone network, the long stream packet namely accounting for flow quantity minority has undertaken most message load; The long stream accounting for total amount 10% ~ 20% comprises the message load of 50% ~ 99%.
CN201410730111.8A 2014-12-04 2014-12-04 Design method for high-speed multi-dimension message classification Pending CN104486308A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410730111.8A CN104486308A (en) 2014-12-04 2014-12-04 Design method for high-speed multi-dimension message classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410730111.8A CN104486308A (en) 2014-12-04 2014-12-04 Design method for high-speed multi-dimension message classification

Publications (1)

Publication Number Publication Date
CN104486308A true CN104486308A (en) 2015-04-01

Family

ID=52760812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410730111.8A Pending CN104486308A (en) 2014-12-04 2014-12-04 Design method for high-speed multi-dimension message classification

Country Status (1)

Country Link
CN (1) CN104486308A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1688140A (en) * 2005-06-03 2005-10-26 清华大学 High-speed multi-dimension message classifying algorithm design and realizing based on network processor
US20060129689A1 (en) * 2004-12-10 2006-06-15 Ricky Ho Reducing the sizes of application layer messages in a network element
US20080259956A1 (en) * 2004-03-31 2008-10-23 Lg Electronics Inc. Data Processing Method for Network Layer
CN102255788A (en) * 2010-05-19 2011-11-23 北京启明星辰信息技术股份有限公司 Message classification decision establishing system and method and message classification system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080259956A1 (en) * 2004-03-31 2008-10-23 Lg Electronics Inc. Data Processing Method for Network Layer
US20060129689A1 (en) * 2004-12-10 2006-06-15 Ricky Ho Reducing the sizes of application layer messages in a network element
CN1688140A (en) * 2005-06-03 2005-10-26 清华大学 High-speed multi-dimension message classifying algorithm design and realizing based on network processor
CN102255788A (en) * 2010-05-19 2011-11-23 北京启明星辰信息技术股份有限公司 Message classification decision establishing system and method and message classification system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宁卓 等: "利用流量特征的GIDS报文分类优化算法", 《电子学报》 *

Similar Documents

Publication Publication Date Title
CN105260474B (en) A kind of microblog users influence power computational methods based on information exchange network
CN101841435B (en) Method, apparatus and system for detecting abnormality of DNS (domain name system) query flow
CN106452868A (en) Network traffic statistics implement method supporting multi-dimensional aggregation classification
Zhang et al. The relationship between China's income inequality and transport infrastructure, economic growth, and carbon emissions
Magdy et al. GeoTrend: spatial trending queries on real-time microblogs
Zhao et al. The evolution of the port network along the Maritime Silk Road: From a sustainable development perspective
CN103428267A (en) Intelligent cache system and method for same to distinguish users' preference correlation
Lin et al. Regional differences of urbanization in China and its driving factors
CN108447255B (en) Urban road dynamic traffic network structure information system
CN105893637A (en) Link prediction method in large-scale microblog heterogeneous information network
CN105704031B (en) A kind of data transmission path determination and device
CN104836810A (en) Coordinated detection method of NDN low-speed cache pollution attack
CN103973589B (en) Network traffic classification method and device
Chen et al. Spatio-temporal top-k term search over sliding window
Wang et al. Identifying influential nodes in social networks: Centripetal centrality and seed exclusion approach
Zha et al. Sources of tourism growth in Mainland China: An extended data envelopment analysis‐based decomposition analysis
CN101834763B (en) Multiple-category large-flow parallel measuring method under high speed network environment
Shao et al. Identifying influential nodes in complex networks based on Neighbours and edges
Shan et al. CVS: fast cardinality estimation for large-scale data streams over sliding windows
Wang et al. The evolution of China’s interregional coal trade network, 1997–2016
CN109033173A (en) It is a kind of for generating the data processing method and device of multidimensional index data
Wang et al. Spatial correlation network and population mobility effect of regional haze pollution: empirical evidence from Pearl River Delta urban agglomeration in China
CN104486308A (en) Design method for high-speed multi-dimension message classification
CN107316246A (en) A kind of method for digging of social networks key user
CN102915369A (en) Method for ranking web pages on basis of hyperlink source analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150401

RJ01 Rejection of invention patent application after publication