CN1822567A

CN1822567A - Multi-domain Network Packet Classification Method Based on Network Flow

Info

Publication number: CN1822567A
Application number: CNA200510130708XA
Authority: CN
Inventors: 亓亚烜; 李军
Original assignee: Tsinghua University
Current assignee: CERTUSNET CORP
Priority date: 2005-12-23
Filing date: 2005-12-23
Publication date: 2006-08-23
Anticipated expiration: 2025-12-23
Also published as: CN100387029C

Abstract

Present invention relates to network filtering and monitoring technology field. It includes following steps: receiving reached network package, collecting network package head information, statistics and normalization network flow property, computing element obtaining network package classifying structure according to configured rule congregation and network flow statistical property, network package classifying unit obtaining network package head information and classifying network package through sorter data structure obtained by calculating unit, transmitting unit transmitting network package in output queue according to classifying result. Present invention is realized based on microprocessor universal platform or network processing unit special platform, combined network flow dynamic statistics property and rule aggregative static structure property, optimizing network package classification method, raising 80-400 per cent average classifying rate than current both abroad and home same class of method, and reducing memory requirements by 30-600 per cent.

Description

Multi-domain Network Packet Classification Method Based on Network Flow

技术领域technical field

本发明涉及网络过滤和监控技术领域。The invention relates to the technical field of network filtering and monitoring.

背景技术Background technique

在第四层交换机，状态检测防火墙，QoS路由器以及负载均衡等多项基于策略的网络过滤和监控应用中，多域网包分类方法是系统性能的关键组成部分和核心技术所在。这里的与传统的第二层交换和第三层路由相比，多域网包分类不但检查和处理以太网和IP包头，而且检查和处理TCP和UDP等更高层网络协议的包头。多域网包分类问题是最佳匹配过滤规则问题的一个实例。对于常见的四层以下的网包分类，共有7个域可被选择成为过滤规则的域：源/目的网络层地址(各32位)，源/目的传输层端口(各16位)，服务类型(8位)，协议域(8位)和传输层协议标志(8位)，共计120位(比特)。实际上，过滤规则并不涉及所有这7个域，绝大多数的网络应用限制在五个域上(源/目的网络层地址，源/目的传输层端口和协议域)。本发明提出的方法适用于任何维度的多域网包分类问题。In many policy-based network filtering and monitoring applications such as layer 4 switches, state inspection firewalls, QoS routers, and load balancing, multi-domain network packet classification methods are the key components and core technologies of system performance. Compared with traditional layer-2 switching and layer-3 routing, multi-domain network packet classification not only checks and processes Ethernet and IP headers, but also checks and processes headers of higher-level network protocols such as TCP and UDP. The multi-domain network packet classification problem is an example of the best matching filter rule problem. For the common classification of network packets below the fourth layer, there are 7 domains that can be selected as domains for filtering rules: source/destination network layer address (32 bits each), source/destination transport layer port (16 bits each), service type (8 bits), the protocol field (8 bits) and the transport layer protocol flag (8 bits), a total of 120 bits (bits). In fact, filtering rules do not involve all these seven domains, and most network applications are limited to five domains (source/destination network layer address, source/destination transport layer port and protocol domain). The method proposed by the invention is applicable to the multi-domain network packet classification problem of any dimension.

多域网包分类方法近年来一直国际上学术界和工业界倍受关注。目前千兆级以上的高端策略路由器设备均采用ASIC/FPGA等专用芯片解决方案。由于基于专用芯片的设备开发周期长，体积功耗大，更新难度高，所以此类产品的效费比很低，限制了高性能网包分类设备的广泛使用。因此，研究和开发新型高性能网包分类方法，并结合当前最新的软硬件平台，成为推广高性能网包分类设备的必由之路。美国斯坦福大学、加州大学圣地亚哥分校、思科、朗讯以及ServGate等公司的研究人员都在这方面进行了许多研究和实验，提出了一系列解决网包分类问题的方案，主要包括两类方法：基于决策树结构的查找方法HiCuts^{【P.Gupta and N. McKeown，“Packet classification using hierarchical intelligent cuttings，”Proc.Hot Interconnects，1999】}和HyperCuts^{【F.Baboescu，and G. Varghese，”Packet classification Using Multidimensional Cutting，”Proc.ACM SIGCOMM，2003】}，以及基于查询多域表的RFC^{【P. Gupta and N.McKeown，“Packet classification on multiple fields，”Proc.ACM SIGCOMM 99，1999】}和HSM^{【Bo Xu，Dongyi Jiang，Jun Li， “HSM：A Fast Packet Classification Algorithm”，The IEEE 19th International Conference on Advanced Information Networking and Applications (AINA)，Taiwan，2005】}方法。这两类方法从不同角度利用分类规则集合的结构特性，通过多种启发式算法来消除搜索空间的冗余，提高网包分类速度。In recent years, multi-domain network packet classification methods have attracted much attention in the international academic and industrial circles. At present, high-end policy routers above Gigabit class all use ASIC/FPGA and other dedicated chip solutions. Due to the long development cycle of equipment based on dedicated chips, large volume and power consumption, and high difficulty in updating, the cost-effectiveness of such products is very low, which limits the widespread use of high-performance network packet classification equipment. Therefore, the research and development of new high-performance network packet classification methods, combined with the latest software and hardware platforms, has become the only way to promote high-performance network packet classification equipment. Researchers from companies such as Stanford University, University of California, San Diego, Cisco, Lucent, and ServGate have conducted many studies and experiments in this area, and proposed a series of solutions to network packet classification problems, mainly including two types of methods: decision-based Tree structure search method HiCuts ^{[P.Gupta and N. McKeown, "Packet classification using hierarchical intelligent cuttings," Proc.Hot Interconnects, 1999]} and HyperCuts ^{[F.Baboescu, and G. Varghese," Packet classification Using Multidimensional Cutting, "Proc.ACM SIGCOMM, 2003]} , and RFC based on querying multi-field tables ^{[P. Gupta and N.McKeown, "Packet classification on multiple fields," Proc.ACM SIGCOMM 99, 1999]} and HSM ^{[Bo Xu, Dongyi Jiang , Jun Li, "HSM: A Fast Packet Classification Algorithm", The IEEE 19th International Conference on Advanced Information Networking and Applications (AINA), Taiwan, 2005]} method. These two types of methods use the structural characteristics of the classification rule set from different angles, and eliminate the redundancy of the search space through a variety of heuristic algorithms to improve the speed of network packet classification.

然而，现有的解决方案在设计中仅利用了分类规则本身的特性，并没有考虑到网络流量的统计特性。在网包分类的实际应用中，通常会出现如下问题：绝大多数的网包仅与规则集合的某个子集中的规则匹配，并且有相当多的规则仅有极少量的网包与之匹配。导致出现这个问题的原因是网包分类规则的严密性，必须为每一个商业需求的或者安全要求设置相应的分类策略。比如为了防范外网的某个类型的蠕虫病毒侵入内网，防火墙策略中将出现针对该蠕虫端口的策略，过滤并阻止外网中带有该蠕虫特性的网包侵入内网。但蠕虫的活动很可能仅仅是某个特定时期的个别现象，因此在网络正常运行的绝大多数时候，仅会有极少的网包与该规则匹配。由于规则的制定往往并未针对特定的网络进行优化，同时难以得到及时的更新，因此通常存在于网包分类设备中的规则库都有相当大的一部分规则极少能被匹配到，亦即此部分规则在网络正常运行的情况时很少参与实际的网包分类过程，但这些规则本身的存在却大大增加了网包分类问题的复杂性。现有的几类网包分类方法由于没能考虑到这种与网络流量统计特性相关的现实情况，所以无法解决网络在特定流量下无用规则(极少匹配到的规则)给网包分类本身带来的复杂性，从而无法进一步提高网络的平均传输速率。However, the existing solutions only utilize the characteristics of the classification rules themselves in the design, and do not take into account the statistical characteristics of network traffic. In the practical application of network packet classification, the following problems usually occur: the vast majority of network packets only match the rules in a certain subset of the rule set, and there are quite a few rules that only match a very small number of network packets. The reason for this problem is the strictness of network packet classification rules, and corresponding classification policies must be set for each business requirement or security requirement. For example, in order to prevent a certain type of worm virus from the external network from invading the internal network, a policy targeting the worm port will appear in the firewall policy to filter and prevent network packets with the characteristics of the worm from the external network from intruding into the internal network. However, the activity of worms is likely to be only an isolated phenomenon in a certain period of time. Therefore, in most of the normal operation of the network, only a very small number of network packets will match this rule. Since the formulation of the rules is often not optimized for a specific network, and it is difficult to get timely updates, a considerable part of the rules in the rule base that usually exists in the network packet classification device can rarely be matched, that is, Some rules rarely participate in the actual network packet classification process when the network is running normally, but the existence of these rules itself greatly increases the complexity of the network packet classification problem. Several existing network packet classification methods fail to take into account the reality related to the statistical characteristics of network traffic, so they cannot solve the problem of network packet classification itself caused by useless rules (rules that rarely match) under specific traffic conditions. Therefore, the average transmission rate of the network cannot be further improved.

发明内容Contents of the invention

本发明的目的在于提供一种平均分类速率高且内存占用少的基于网络流量统计特性的多域网包分类方法。The purpose of the present invention is to provide a multi-domain network packet classification method based on network traffic statistics characteristics with high average classification rate and less memory occupation.

本发明的特征在于：该方法是基于微处理通用平台或网络处理器专用平台实现的，依次具有以下步骤：The present invention is characterized in that: the method is realized based on a microprocessing general-purpose platform or a network processor-specific platform, and has the following steps successively:

步骤1.网包接受单元接受输入网包：Step 1. The network packet accepting unit accepts the input network packet:

该单元的网卡接受到达的网包并解析网包包头信息后，再向该单元的动态缓冲存储器的缓存队列内输出网包包头信息的内容；After the network card of this unit receives the network packet that arrives and parses the network packet header information, then output the content of the network packet header information in the cache queue of the dynamic buffer memory of this unit;

步骤2.采样单元依据是定的采样间隔对网包接包单元输出的网络包头信息做网络流量特性统计，并按照更新时刻对所统计的信息做归一化处理后得到流量的先验分布，步骤2所述的方法依次含有以下步骤：Step 2. The sampling unit performs network traffic characteristics statistics on the network packet header information output by the network packet receiving unit based on a fixed sampling interval, and performs normalization processing on the statistical information according to the update time to obtain the prior distribution of the traffic. The method described in step 2 contains the following steps in turn:

步骤2.1.采样：Step 2.1. Sampling:

所述采样单元内的处理器按照设定的采样间隔Ns从输入的网络包头信息队列中的每N个连续到达的网包中抽取一个五元组包头信息作为一个采样样本，每个采样样被是一个由下述五个域组成的数据组：The processor in the sampling unit extracts a five-tuple packet header information as a sampling sample from every N continuously arriving network packets in the input network packet header information queue according to the set sampling interval Ns, and each sampling sample is is a data group consisting of the following five fields:

32位源IP地址，用int32 sIP表示，int表示整数类型，下同；32-bit source IP address, represented by int32 sIP, int represents an integer type, the same below;

32位目标IP地址，用int32 dIP表示；32-bit target IP address, represented by int32 dIP;

16位源端口，用int16 sPort表示；16-bit source port, represented by int16 sPort;

16位目标端口，用int16 dPort表示；16-bit destination port, represented by int16 dPort;

8位协议，用int8 protocol表示；8-bit protocol, represented by int8 protocol;

步骤2.2.记录统计量：Step 2.2. Record statistics:

所述统计量包括：The statistics include:

每一个B类网段源IP地址在所述采样样本中出现的次数，用int32 sIP[65536]表示，所述B类网段是指32位IP地址的前16位，下同；The number of times that the source IP address of each class B network segment appears in the sampling sample is represented by int32 sIP[65536]. The class B network segment refers to the first 16 bits of the 32-bit IP address, the same below;

每一个B类网段目标IP地址在所述采样样本中出现的次数，用int32 dIP[65536]表示；The number of times that the target IP address of each Class B network segment occurs in the sampling sample is represented by int32 dIP[65536];

每一个源端口在所述采样样本中出现的次数，用int32 sPort[65536]表示；The number of times each source port appears in the sampling sample, represented by int32 sPort[65536];

每一个目标端口在所述采样样本中出现的次数，用int32 dPort[65536]表示；The number of times each target port appears in the sampling sample, represented by int32 dPort[65536];

每一个协议在所述采样样本中出现的次数，用int32 protocol[256]表示；The number of times each protocol appears in the sampling sample is represented by int32 protocol[256];

步骤2.3.归一化：Step 2.3. Normalization:

在分类器数据结构更新时，对统计量数据结构进行归一化得到先验分布，该先验分布的数据结构为下述二维数组：When the data structure of the classifier is updated, the prior distribution is obtained by normalizing the statistical data structure, and the data structure of the prior distribution is the following two-dimensional array:

float Prior[i][j]，float Prior[i][j],

所述Prior[i][j]为每个域归一化统计量，其中，i＝0，1，2，3，4，一次分别对应源IP地址，目标IP地址，源端口，目标端口和协议域；j＝0，…，65535；The Prior[i][j] is a normalized statistic for each domain, where i=0, 1, 2, 3, 4, respectively corresponding to the source IP address, target IP address, source port, target port and protocol field; j=0,...,65535;

对于源IP地址：For source IP address:

$Prior [0] [j] = \frac{Stats . sIP [j]}{Σ_{k = 0}^{65535} Stats . sIP [k]},$ Stats表示统计量，下同； $Prior [0] [j] = \frac{Stats . sIP [j]}{Σ_{k = 0}^{65535} Stats . sIP [k]},$ Stats means statistics, the same below;

对于目标IP地址：For target IP address:

$Prior Prior [[11]] [[j j]] = = \frac{Stats Stats . . dIP DIP [[j j]]}{{Σ Σ}_{k k = = 00}^{6553565535} Stats Stats . . dIP DIP [[k k]]},,$

对于源端口：For source port:

$Prior Prior [[22]] [[j j]] = = \frac{Stats Stats . . sPort sPort [[j j]]}{{Σ Σ}_{k k = = 00}^{6553565535} Stats Stats . . sPort sPort [[k k]]},,$

对于目标端口：For destination port:

$Prior Prior [[33]] [[j j]] = = \frac{Stats Stats . . dPort dPort [[j j]]}{{Σ Σ}_{k k = = 00}^{6553565535} Stats Stats . . dPort dPort [[k k]]},,$

对于协议域：For protocol fields:

$Prior [4] [j] = \frac{Stats . protocol [j]}{Σ_{k = 0}^{65535} Stats . protocol [k]},$ 当256≤j≤65535时，Prior[4][j]＝0； $Prior [4] [j] = \frac{Stats . protocol [j]}{Σ_{k = 0}^{65535} Stats . protocol [k]},$ When 256≤j≤65535, Prior[4][j]=0;

步骤3.在计算单元中设置分类规则，再通过统计单元输出更新后的分类器数据结构及规范化的规则集合，所述规范化是指把代友掩码或通配符的规则用所述五元组所在区间描述：Step 3. Classification rules are set in the calculation unit, and then the updated classifier data structure and the normalized rule set are output by the statistical unit, and the normalization refers to using the rules of the generation mask or wildcards with the five-tuple Interval description:

所述计算单元内含一个处理器和一个高速存储设备，该处理器的有关网络流量统计的先验分布信息是输入端与所述采样单元的相应输出端相连，该处理器还设有一个规则集合输入端；The calculation unit contains a processor and a high-speed storage device, the prior distribution information of the processor related to network traffic statistics is that the input end is connected with the corresponding output end of the sampling unit, and the processor is also provided with a rule collection input;

所述分类器的数据结构为一个决策树结构并在该决策树的叶节点上由一个线性表来存储规则子集；所述计算单元在更新分类数据结构时对搜索空间进行逐级划分，不断缩小搜索范围和规则数目，把对整个规则集合的搜索编程对该规则集合子集的搜索，当被搜索子集中的规则个数小于设定阈值时，通过对所述线性表采用线性查找方法对当前子集进行搜索以得到最终分类结果，该阈值用Thresh表示，设为1～10；The data structure of the classifier is a decision tree structure and a linear table is used to store rule subsets on the leaf nodes of the decision tree; the calculation unit divides the search space step by step when updating the classification data structure, and continuously Narrow down the search scope and the number of rules, program the search of the entire rule set to the search of the subset of the rule set, when the number of rules in the searched subset is less than the set threshold, by using a linear search method for the linear table The current subset is searched to obtain the final classification result. The threshold is represented by Thresh, which is set to 1-10;

所述分类器数据结构更新方法依次含有一下步骤：The method for updating the classifier data structure contains the following steps in turn:

步骤3.1.初始化：Step 3.1. Initialization:

设定搜索空间为U，即多域网包包头信息的梭鱼可能的取值空间，所述五元组的初始搜索空间依次为：[0，2³²-1]，[0，2³²-1]，[0，2¹⁶-1]，[0，2¹⁶-1]，[0，2⁸-1]；Set the search space as U, which is the possible value space of the barracuda in the multi-domain network packet header information. The initial search space of the five-tuple is: [0, 2 ³² -1], [0, 2 ³² - 1], [0, 2 ¹⁶ -1], [0, 2 ¹⁶ -1], [0, 2 ⁸ -1];

设定空间的划分，即在制定的一个或者多个域上把取值范围划分为多个子范围的集合，则得到对当前搜索空间的一个划分；The division of the setting space, that is, dividing the value range into a set of multiple sub-ranges on one or more specified domains, then a division of the current search space is obtained;

步骤3.2.通过计算机外围设备向所述计算单元输入规则集合和系统参数，所述系统参数包括更新周期，采样间隔以及所述分类器数据结构中的Thresh，spfac_min及spfac_max等可调参数；Step 3.2. input rule sets and system parameters to the calculation unit by computer peripheral equipment, the system parameters include update cycle, sampling interval and Thresh in the classifier data structure, adjustable parameters such as spfac _min and spfac _max ;

所述规则集合中的一条规则用R表示，含有：A rule in the rule set is represented by R and contains:

用两个32位整数描述的源IP地址，表示为int32 sIP[2]；The source IP address described by two 32-bit integers, expressed as int32 sIP[2];

用两个32位整数描述的目标IP地址，表示为int32 dIP[2]；The destination IP address described by two 32-bit integers, expressed as int32 dIP[2];

用两个16位整数描述的源端口，表示为int16 sPort[2]；The source port described by two 16-bit integers, denoted as int16 sPort[2];

用两个16位整数描述的目标端口，表示为int16 dPort[2]；Destination port described by two 16-bit integers, denoted as int16 dPort[2];

用两个8位整数描述的协议区间，表示为int8 protocol[2]；The protocol interval described by two 8-bit integers is expressed as int8 protocol[2];

用int32 priority表示的规则优先级；Rule priority represented by int32 priority;

用int8 action表示的分类结果；The classification result represented by int8 action;

步骤3.3.在所述计算单元中设定如下所述的决策树节点数据结构，即分类器的数据结构，每一个节点包括：Step 3.3. Set the following decision tree node data structure in the computing unit, i.e. the data structure of the classifier, each node includes:

划分维度，divide the dimensions,

划分次数，number of divisions,

节点信息，内部节点的节点信息值为0，叶节点的节点信息值为规则个数，用1～Thresh表示，Thresh为叶节点规则数的阈值；Node information, the node information value of the internal node is 0, the node information value of the leaf node is the number of rules, represented by 1~Thresh, Thresh is the threshold value of the number of rules of the leaf node;

跳转指针：对于内部节点，该指针指向下层节点的头节点；对于叶节点，该指针指向存放对应规则子集的规则ID的线性表；Jump pointer: for an internal node, the pointer points to the head node of the lower node; for a leaf node, the pointer points to a linear table storing the rule ID of the corresponding rule subset;

步骤3.4.创建根节点，并把所有规则都分配到根节点，根节点用v_root表示；Step 3.4. Create a root node and assign all rules to the root node, which is represented by v _root ;

步骤3.5.把所述根节点送入当前队列Q₁；Step 3.5. Send the root node into the current queue Q ₁ ;

步骤3.6.从当前队列中送出节点v_c；Step 3.6. Send node v _c from the current queue;

步骤3.7.判断该节点v_c中的规则数是否大于阈值T(即Thresh)；Step 3.7. Judging whether the number of rules in the node v _c is greater than the threshold T (ie Thresh);

步骤3.8.Step 3.8.

若：节点v_c中的规则数大于(或等于)阈值T，则把该节点v_c设置为叶节点；If: the number of rules in the node _vc is greater than (or equal to) the threshold T, then set the node _vc as a leaf node;

若：节点v_c中的规则数小于阈值T，则执行下一步骤；If: the number of rules in the node v _c is less than the threshold T, then execute the next step;

步骤3.9.根据当前节点v_c的先验概率和该节点所对应的规则子集中规则个数N采用下述两种空间分配函数中的任何一种给节点v_c的子节点分配内存空间：Step 3.9. According to the prior probability of the current node _vc and the number N of rules in the rule subset corresponding to the node, use any one of the following two space allocation functions to allocate memory space to the child nodes of the node _vc :

第一种：The first:

$SpaceAlloc SpaceAlloc (({v v}_{c c})) = = N N * * spfac spfac (({v v}_{c c})) = = N N * * ((spfa spfa {c c}_{min min} + + (({P P}_{{v v}_{c c}}^{l l} - - min min (({P P}^{l l})) * * K K))$

其中， $P_{v_{c}}^{l} = \frac{1}{F} Σ_{i = 0}^{F} Σ_{j = a_{l}}^{b_{l}} Prior [i] [j]$ 为节点v_c的流量的先验分布概率；in, $P_{v_{c}}^{l} = \frac{1}{f} Σ_{i = 0}^{f} Σ_{j = a_{l}}^{b_{l}} Prior [i] [j]$ is the prior distribution probability of the flow of node v _c ;

当max(P^l)-min(P^l)＞0时：K＝(spfac_max-spfac_min)/(max(P^l)-min(P^l))When max(P ^l )-min(P ^l )>0: K=(spfac _max -spfac _min )/(max(P ^l )-min(P ^l ))

当max(P^l)-min(P^l)＝0时：K＝0，When max(P ^l )-min(P ^l )=0: K=0,

其中，in,

N是当前节点的规则个数；N is the number of rules of the current node;

l表示决策树的节点的深度；l represents the depth of the node of the decision tree;

[a_i，b_i]是节点v_c对应的搜索空间在i域中所代表的区间；[a _i , b _i ] is the interval represented by the search space corresponding to node v _c in domain i;

max(P^l)和min(P^l)是决策树第l层所有节点的最大和最小先验概率；max(P ^l ) and min(P ^l ) are the maximum and minimum prior probabilities of all nodes in layer l of the decision tree;

spfac_min和spfac_max用来限制spfac(v_c)的取值范围，一般设为spfac_min＝1，spfac_max＝4；F＝5，表示5个域；spfac _min and spfac _max are used to limit the value range of spfac(v _c ), generally set spfac _min = 1, spfac _max = 4; F = 5, indicating 5 domains;

第二种：The second type:

$SpaceAlloc SpaceAlloc (({v v}_{c c})) = = N N * * spfac spfac (({v v}_{c c})) = = N N * * BOUND BOUND (({P P}_{{v v}_{c c}}^{l l} * * {D D.}^{l l} * * spfa spfa {c c}_{avg avg},, spfa spfa {c c}_{min min},, spfa spfa {c c}_{max max})),,$

$BOUND BOUND ((a a,, b b,, c c)) = = \{\begin{matrix} a a & b b \leq \leq a a \leq \leq c c \\ b b & a a < < b b \\ c c & a a > > c c \end{matrix}$

$spfa spfa {c c}_{avg avg} = = \frac{11}{22} (({spfac spfac}_{min min} + + {spfac spfac}_{max max}))$

其中，D^l表示决策树在l层所有节点的个数；Among them, D ^l represents the number of all nodes of the decision tree in layer l;

步骤3.10.切分维度和划分次数的确定：Step 3.10. Determination of segmentation dimension and number of divisions:

首先用下式计算节点v的内存使用，First use the following formula to calculate the memory usage of node v,

$sm sm ((v v,, f f)) = = {Σ Σ}_{i i = = 11}^{numCuts numCuts ((v v,, f f))} numRules numRules (({v v}_{i i})) + + numCuts numCuts ((v v,, f f))$

其中f是被切分的维度；v_i是v的一个子节点；numRules(v)是属于该节点的规则数。通过在sm(v，f)≤SpaceAlloc(v)的约束下最大化内存估算函数sm(v，f)即可选择在f域上的最优划分次数numCuts(v，f)；Where f is the dimension to be split; v _i is a child node of v; numRules(v) is the number of rules belonging to the node. By maximizing the memory estimation function sm(v, f) under the constraint of sm(v, f)≤SpaceAlloc(v), the optimal number of partitions numCuts(v, f) on the f domain can be selected;

确定了每个域上的最优划分次数后，便可将f域的搜索空间均匀划分为numCuts(v，f)份，f＝1，…，F，假设每个子空间落入N_i个规则，i＝1，…，numCuts(v，f)，那么我们选择使： $\frac{1}{numCuts (v, f)} Σ_{i = 1}^{numCuts (v, f)} N_{i}$ 最小化的f作为节点v的划分维度；After the optimal number of divisions on each domain is determined, the search space of domain f can be evenly divided into numCuts(v, f) parts, f=1,...,F, assuming that each subspace falls into N _i rules , i=1,...,numCuts(v, f), then we choose to make: $\frac{1}{numCuts (v, f)} Σ_{i = 1}^{numCuts (v, f)} N_{i}$ The minimized f is used as the division dimension of node v;

步骤4.网包分类及分类器数据结构更新：Step 4. Network packet classification and classifier data structure update:

步骤4.1.作为网包分类用的处理器从所述DRAM中获取网包包头信息，即网包包头的五元组信息；Step 4.1. Obtain the network packet header information from the DRAM as the processor used for network packet classification, that is, the five-tuple information of the network packet header;

步骤4.2.读取当前树节点v_curr，确定切分维度和切分次数；Step 4.2. Read the current tree node v _curr to determine the segmentation dimension and the number of segmentations;

步骤4.3.根据步骤4.2中给出的划分域及划分次数对当前搜索空间进行划分，再根据网包包头中相应域的数值确定划分后各域的子空间；Step 4.3. divide the current search space according to the division domain and division times given in the step 4.2, then determine the subspace of each domain after division according to the numerical value of the corresponding domain in the network packet header;

步骤4.4.根据步骤4.3中的子空间在搜索空间中的相对位置确定划分后各所属子空间序号，再顺序读取所述当前树节点的子节点，确定划分后各域所对应的子节点；Step 4.4. According to the relative position of the subspace in the step 4.3 in the search space, determine the sequence number of each subspace after the division, and then sequentially read the child nodes of the current tree node, and determine the corresponding child nodes of each domain after the division;

步骤4.5.若步骤4.4最终得到的子节点为叶节点，则转入步骤4.6；否则，将当前搜索空间更新为所述子空间，且将当前树节点更新为所述子节点，重复步骤4.2～4.4；Step 4.5. If the child node finally obtained in step 4.4 is a leaf node, then go to step 4.6; otherwise, update the current search space to the subspace, and update the current tree node to the child node, repeat steps 4.2～ 4.4;

步骤4.6.顺序读取步骤4.4中所述子节点中存放的规则列表，并对每个域进行范围匹配，与网包包头所有域都匹配且具有最高优先级的规则即为线性查找结果，即网包包头最终匹配的规则；Step 4.6. Sequentially read the list of rules stored in the child nodes described in step 4.4, and perform range matching on each domain, and the rule that matches all domains of the network packet header and has the highest priority is the linear search result, namely The final matching rule of the net packet header;

步骤4.7.根据线性查找所得到的规则的处理方式，执行对该网包的操作。Step 4.7. Perform operations on the network packet according to the processing method of the rules obtained by the linear search.

基于网络流量统计的网包分类方法充分结合了网络流量的统计特性(动态特性)和规则集合的结构特性(静态特性)，提出一整套适用于各种复杂网络环境的网包分类系统设计方法。该方法对不同类型、多种协议下的网络的流量进行周期性采样、归一化以及先验转化，将流量的各种统计特性以先验概率的形式引入网包分类器的数据结构中去，从而达到合理的空间(内存需求)和时间(分类速率)的配比，进一步优化网包分类方法，提高网包分类设备在不同网络环境下和多种网络应用中的性能。系统仿真的实验结果都表明本发明比当前国内外公认的最好的网包分类方法在平均分类速率指标上有80％～400％的提高，而在内存需求上却有30％～600％的降低。更为重要的一点是，基于网络流量统计特性的网包分类方法能够自行适应不同的规则集合和网络流量，在多种不同网络环境和网络应用中表现出大大优于其他分类方法的稳定性，从而使网包分类设备具有良好的灵活性和实用性。The network packet classification method based on network traffic statistics fully combines the statistical characteristics of network traffic (dynamic characteristics) and the structural characteristics of rule sets (static characteristics), and proposes a set of design methods for network packet classification systems suitable for various complex network environments. This method performs periodic sampling, normalization and prior transformation on network traffic of different types and protocols, and introduces various statistical characteristics of traffic into the data structure of the network packet classifier in the form of prior probability. , so as to achieve a reasonable ratio of space (memory requirements) and time (classification rate), further optimize the network packet classification method, and improve the performance of network packet classification equipment in different network environments and various network applications. The experimental results of the system simulation all show that the present invention has an 80%-400% increase in the average classification rate index compared with the currently recognized best network packet classification method at home and abroad, but has a 30%-600% improvement in memory requirements. reduce. More importantly, the network packet classification method based on the statistical characteristics of network traffic can adapt to different rule sets and network traffic, and it is much more stable than other classification methods in various network environments and network applications. Therefore, the network packet classification device has good flexibility and practicability.

表1比较了本方法和RFC^{【P.Gupta and N.McKeown，“Packet classification on multiple fields，”Proc.ACM SIGCOMM 99， 1999】}、ABV^{【F.Baboesccu and G.Varghese，Scalable Packet Classification.Proc.ACM SIGCOMM，2001】}及HiCuts^{【P.Gupta and N. McKeown，“Packet classification using hierarchical intelligent cuttings，”Proc.Interconnects，1999】}方法的内存占用性能。在所有的规则集合中，本方法都表现出来了明显的优势。在搜索时间方面，本方法也和当前最好的算法在同一数量级上。Table 1 compares this method with RFC ^{[P.Gupta and N.McKeown, "Packet classification on multiple fields," Proc.ACM SIGCOMM 99, 1999]} , ABV ^{[F.Baboesccu and G.Varghese, Scalable Packet Classification.Proc. ACM SIGCOMM, 2001]} and HiCuts ^{[P.Gupta and N. McKeown, "Packet classification using hierarchical intelligent cuttings," Proc.Interconnects, 1999] The} memory usage performance of the method. In all rule sets, the method shows obvious advantages. In terms of search time, this method is also on the same order of magnitude as the current best algorithms.

表1本方法和RFC、ABV及HiCuts方法的空间性能比较(单位：KB) RFC ABV HiCuts 本方法 FW1FW2CR1CR2SN1 8169109662,220∞^* 6.234.81,0773,1572,435 281291004,235789 2357602,031473 Table 1 Comparison of spatial performance between this method and RFC, ABV and HiCuts methods (unit: KB) RFCs ABV HiCuts This method FW1FW2CR1CR2SN1 8169109662,220∞ ^* 6.234.81,0773,1572,435 281291004, 235789 2357602,031473

注：表中FW1和FW2为真实防火墙规则集合，Note: FW1 and FW2 in the table are real firewall rule sets,

CR1和CR2为真实核心路由器规则集合，SN1为合成规则集合。CR1 and CR2 are real core router rule sets, and SN1 is a synthetic rule set.

本方法通过进一步挖掘网络控制规则集合的数据结构特征，在分类问题的不同层面和不同阶段结合使用多种启发式方法，可以在一定程度上解决单一启发式算法性能受数据结构变化影响较大带来的问题，从而使网包分类方法在最坏情况下的复杂度降低。流量统计特性的引入则是为了提高包分类方法的平均分类效率。流量统计可以看作是对特定网络结构和网络特性的学习过程。将流量统计的结果以先验概率的形式引入启发式方法中，可使该方法得出的结果具有贝叶斯分类器特性，从而能够根据特定的网络结构和网络流量特性进行自适应调节，提高平均分类效率。This method further mines the data structure characteristics of the network control rule set, and uses a variety of heuristic methods at different levels and stages of the classification problem, which can solve the problem that the performance of a single heuristic algorithm is greatly affected by the change of the data structure to a certain extent. In order to reduce the complexity of the network packet classification method in the worst case. The introduction of the traffic statistics feature is to improve the average classification efficiency of the packet classification method. Traffic statistics can be regarded as a learning process for a specific network structure and network characteristics. Introducing the results of traffic statistics into the heuristic method in the form of prior probability can make the results obtained by this method have the characteristics of Bayesian classifiers, so that they can be adaptively adjusted according to the specific network structure and network traffic characteristics, and improve Average classification efficiency.

目前千兆级以上的高端策略路由器设备均采用ASIC/FPGA等专用芯片解决方案。由于基于专用芯片的设备开发周期长，体积功耗大，更新难度高，所以此类产品的效费比很低，限制了高性能网包分类设备的广泛使用。而本发明提出的多域网包分类方法既可以在多种平台上实现，包括基于微处理器CPU的通用平台以及基于网络处理器NPU的专用平台，又可以保证良好的性能和对不同网络应用的适应性，因此整套软硬件系统可以作为多域网包分类方法的核心模块提供给厂商以提高基于通用处理器平台的网包分类设备的性能，大幅降低高端策略路由及防火墙产品的生产成本，从而推动和加速下一代互联网的实施和运行。At present, high-end policy routers above Gigabit class all use ASIC/FPGA and other dedicated chip solutions. Due to the long development cycle of equipment based on dedicated chips, large volume and power consumption, and high difficulty in updating, the cost-effectiveness of such products is very low, which limits the widespread use of high-performance network packet classification equipment. And the multi-domain network packet classification method proposed by the present invention can be realized on multiple platforms, including a general-purpose platform based on a microprocessor CPU and a special-purpose platform based on a network processor NPU, which can ensure good performance and be applicable to different networks. Therefore, the entire software and hardware system can be provided to manufacturers as the core module of the multi-domain network packet classification method to improve the performance of network packet classification equipment based on general-purpose processor platforms, and greatly reduce the production cost of high-end policy routing and firewall products. Thereby promoting and accelerating the implementation and operation of the next generation Internet.

附图说明Description of drawings

图1.本发明所述的决策树结构图示例。其中：Fig. 1. An example of a decision tree structure diagram according to the present invention. in:

————→表示*next指针————→Indicates *next pointer

……………表示顺序存储………… Indicates sequential storage

图2.本发明所述的系统的原理框图。Figure 2. Schematic block diagram of the system according to the invention.

图3.分类方法程序流程框图。Figure 3. Flow chart of the classification method program.

图4.网包分类程序流程框图。Figure 4. Flowchart of network packet classification program.

具体实施方式Detailed ways

下面结合图2，介绍本发明的硬件结构：Below in conjunction with Fig. 2, introduce the hardware structure of the present invention:

A接收单元A receiving unit

主要任务：接收网包，解析网包包头(五个域)，并将网包和包头信息和网包内容缓存到处理队列中。Main tasks: receive network packets, parse network packet headers (five domains), and cache network packets, header information, and network packet content into the processing queue.

相关设备：网卡，缓存(DRAM)Related equipment: network card, cache (DRAM)

输入输出：从网卡输入到达网包，向缓存队列中输出包头信息网包内容。Input and output: input and arrive at the network packet from the network card, and output the packet header information network packet content to the cache queue.

B采样单元B sampling unit

主要任务：依据采样间隔对到达的网包包头信息进行统计，并按照更新时刻对统计信息进行归一化处理，为计算单元提供流量先验分布。(注：采样间隔定义为对每N个到达网包进行一次采样，即每次采样记录N个连续到达的网包其中某一个的包头信息，N通常取1～1000，视设备吞吐量而定；更新时刻定义为规则库被修改的时刻或者更新周期到达的时刻，这里的更新周期通常设置为一天或者一周，视网络流量特性变化而定。)The main task: according to the sampling interval to make statistics on the header information of the arriving network packets, and normalize the statistical information according to the update time, and provide the flow prior distribution for the calculation unit. (Note: The sampling interval is defined as sampling every N arriving network packets, that is, each sampling records the packet header information of one of the N consecutive arriving network packets, and N usually takes 1 to 1000, depending on the throughput of the device ;Update time is defined as the moment when the rule base is modified or when the update cycle arrives, where the update cycle is usually set to one day or one week, depending on the change of network traffic characteristics.)

相关设备：硬盘，处理器。Related equipment: hard disk, processor.

输入输出：从缓存中读取网络包头信息(依据采样间隔)，向计算单元输出流量先验分布(依据更新时刻)。Input and output: read the network packet header information from the cache (according to the sampling interval), and output the traffic prior distribution to the computing unit (according to the update time).

C计算单元C computing unit

主要任务：设置分类规则并更新分类器数据结构。Main task: set classification rules and update classifier data structure.

相关设备：处理器(CPU)，高速存储设备(SRAM或Cache)Related equipment: processor (CPU), high-speed storage device (SRAM or Cache)

输入输出：从采样单元输入网络流量统计的先验分布，从控制单元输入规则集合；向分类单元输出分类器数据结构以及规范化的规则集合。(注：分类器的数据结构为决策树结构并在叶节点上有线性表存储规则子集；规则的规范化是指将带有掩码或通配符的规则用统一的五元区间描述。)Input and output: input the prior distribution of network traffic statistics from the sampling unit, input the rule set from the control unit; output the classifier data structure and the normalized rule set to the classification unit. (Note: The data structure of the classifier is a decision tree structure and there is a linear table on the leaf nodes to store a subset of rules; the normalization of rules refers to describing the rules with masks or wildcards with a unified five-element interval.)

D分类单元Taxon D

主要任务：高速网包分类。依据网包包头信息(五元组)，通过分类器搜索获取分类结果。Main task: high-speed network packet classification. According to the header information (five-tuple) of the network packet, the classification result is obtained by searching through the classifier.

输入输出：对于网包分类过程，分类单元从输入端缓存队列获取网包包头信息，并将分类结果送至输出单元；对于数据更新过程，分类单元从计算单元读入新的分类器数据结构以及新的规则集合。Input and output: For the network packet classification process, the classification unit obtains the network packet header information from the input buffer queue, and sends the classification result to the output unit; for the data update process, the classification unit reads the new classifier data structure from the computing unit and A new collection of rules.

E控制单元E control unit

主要任务：设置分类规则和系统可调参数(如更新周期，采样间隔以及分类器数据结构中的可调参数等)。Main task: Set classification rules and system adjustable parameters (such as update cycle, sampling interval, and adjustable parameters in the classifier data structure, etc.).

相关设备：计算机外围设备，监视器，键盘，鼠标等Related equipment: computer peripherals, monitors, keyboards, mice, etc.

输入输出：管理员通过外围设备向控制单元输入规则集合和系统参数；控制单元将规则集合和系统参数输出至计算单元。Input and output: the administrator inputs rule sets and system parameters to the control unit through peripheral devices; the control unit outputs rule sets and system parameters to the computing unit.

F发送单元F sending unit

主要任务：发送网包，按照分类结果发送输出队列中的网包。Main task: Send network packets, and send network packets in the output queue according to the classification results.

输入输出：从计算单元中读取分类结果，并以此决定发送或丢弃输出缓存队列中相应的网包。Input and output: read the classification results from the computing unit, and decide to send or discard the corresponding network packets in the output buffer queue.

接着，详细描述本发明所采用的软件方法：Then, describe the software method that the present invention adopts in detail:

软件方法主要包括三部分内容：网络流量特性统计方法；分类器数据结构更新方法以及网包查找方法。按照系统构建顺序，分别对这三个方法进行详细阐述：The software method mainly includes three parts: the statistical method of network traffic characteristics; the updating method of classifier data structure and the searching method of network packets. According to the order of system construction, the three methods are described in detail respectively:

A网络流量特性统计方法，在采样单元中进行，依次分三个步骤：The statistical method of A network traffic characteristics is carried out in the sampling unit, which is divided into three steps in turn:

A1采样：A1 sampling:

按照采样间隔Ns，从输入队列中每Ns个连续到达的网包中抽取一个包头信息(五元组)作为一个采样样本。每个采样样本的数据结构为：According to the sampling interval Ns, a packet header information (five-tuple) is extracted from every Ns continuously arriving network packets in the input queue as a sampling sample. The data structure of each sampling sample is:

HeaderHeader

{{

int32sIP，dIP；//32位源地址和目标地址int32sIP, dIP; //32-bit source and destination addresses

int16 sPort，dPort；//16位源端口和目标端口int16 sPort, dPort; //16-bit source port and destination port

int8 protocol；//8位协议int8 protocol; //8-bit protocol

}；};

A2记录：A2 records:

统计量包括每一个B类网段(32bitIP地址的前16bit)源/目标地址在样本中的出现次数、每一个源/目标端口的出现次数以及协议出现次数。统计量数据结构为：Statistics include the number of occurrences of the source/destination address of each class B network segment (the first 16 bits of the 32bit IP address) in the sample, the number of occurrences of each source/destination port, and the number of occurrences of the protocol. The statistics data structure is:

StatsStats

{{

int32sIP[65536]，dIP[65536]；//对每一个B类网段进行统计int32sIP[65536], dIP[65536];//Make statistics on each class B network segment

int32sPort[65536]，dPort[65536]；//对每一个端口进行统计int32sPort[65536], dPort[65536];//Statistics for each port

int32protocol[256]；//对每一个协议进行统计int32protocol[256];//Statistics for each protocol

}；};

举例：若某个样本为{166.111.0.1(目标地址)，162.105.0.1(源地址)，80(目标端口)，3000(源端口)，17(协议)}则相应统计量变化为：For example: if a sample is {166.111.0.1 (destination address), 162.105.0.1 (source address), 80 (destination port), 3000 (source port), 17 (protocol)}, the corresponding statistics changes as follows:

Stats.dIP[42607]++；//166*256+111＝42607Stats.dIP[42607]++; //166*256+111=42607

Stats.sIP[41577]++；//162*256+105＝41577Stats.sIP[41577]++; //162*256+105=41577

Stats.dPort[80]++；Stats.dPort[80]++;

Stats.sPort[3000]++；Stats.sPort[3000]++;

Stats.protocol[17]++；Stats.protocol[17]++;

A3归一化：A3 normalization:

在分类器数据结构更新时，对统计量Stats结构进行归一化得到先验分布Prior，其数据结构为二维数组：When the data structure of the classifier is updated, the Stats structure of the statistics is normalized to obtain the prior distribution Prior, and its data structure is a two-dimensional array:

float Prior[5][65536]；float Prior[5][65536];

其中Prior[i][j](i＝0，…，4；j＝0，…，65535)为每个域的归一化统计量，其中i＝0，…，4分别对应源地址、目标地址、源端口、目标端口、协议等五个域。例如Prior[3][80]表示目标端口为80的先验概率，计算方法如下：Among them, Prior[i][j] (i=0,...,4; j=0,...,65535) is the normalized statistics of each domain, where i=0,...,4 correspond to source address, target There are five fields including address, source port, destination port, and protocol. For example, Prior[3][80] indicates the prior probability that the target port is 80, and the calculation method is as follows:

$Prior Prior [[33]] [[8080]] = = \frac{Stats Stats . . dPort dPort [[8080]]}{{Σ Σ}_{k k = = 00}^{6553565535} Stats Stats . . dPort dPort [[k k]]}$

注：对于协议域先验Prior[4][j]，当j＞＝256时Prior[4][j]＝0.Note: For the prior Prior[4][j] of the protocol domain, when j>=256, Prior[4][j]=0.

B分类器数据结构更新方法，在计算单元中进行：B classifier data structure update method, carried out in the calculation unit:

分类器数据结构主体为决策树，通过决策树对搜索空间的逐级划分，不断缩小搜索范围和规则数目，将对整个规则集合的搜索变为对其子集的搜索。当搜索子集中的规则个数小于阈值Thresh的时候(Thresh＝8～10)，采用线性查找的方法对当前子集进行搜索，获取最终分类结果。The main body of the classifier data structure is a decision tree. Through the step-by-step division of the search space by the decision tree, the search range and the number of rules are continuously reduced, and the search for the entire rule set is changed to a search for its subset. When the number of rules in the search subset is less than the threshold Thresh (Thresh=8-10), use the linear search method to search the current subset to obtain the final classification result.

B1定义：B1 definition:

搜索空间U：多域网包包头的所有可能的取值空间。对于五元组分类，初始搜索空间为{[0，2³²-1]，[0，2³²-1]，[0，2¹⁶-1]，[0，2¹⁶-1]，[0，2⁸-1]}；Search space U: all possible value spaces of the multi-area network packet header. For quintuple classification, the initial search space is {[0, 2 ³² -1], [0, 2 ³² -1], [0, 2 ¹⁶ -1], [0, 2 ¹⁶ -1], [0, 2 ⁸ -1]};

空间划分：在指定的一个或者多个域上将取值范围划分为多个子范围的集合，则得到对当前搜索空间的一个划分。例如对初始搜索空间，如果在第一个域(源地址)上进行二等分，则得到两个搜索子空间U1＝{[0，2³¹-1]，[0，2³²-1]，[0，2¹⁶-1]，[0，2¹⁶-1]，[0，2⁸-1]}和U2＝{[2³¹，2³²-1]，[0，2³²-1]，[0，2¹⁶-1]，[0，2¹⁶-1]，[0，2⁸-1]}。Space division: Divide the value range into a set of multiple sub-ranges on one or more specified domains, and then obtain a division of the current search space. For example, for the initial search space, if the first domain (source address) is bisected, two search subspaces U1={[0,2 ³¹ -1], [0,2 ³² -1], [0, 2 ¹⁶ -1], [0, 2 ¹⁶ -1], [0, 2 ⁸ -1]} and U2={[2 ³¹ , 2 ³² -1], [0, 2 ³² -1], [0, 2 ¹⁶ -1], [0, 2 ¹⁶ -1], [0, 2 ⁸ -1]}.

输入规则集合R：输入的每一条规则包括五个域的范围表述以及规则优先级(一个网包匹配多个规则时取优先级最高的规则)和分类结果。规则集合标记为R，每一条规则标记为R_i。规则数据结构定义如下：Input rule set R: each input rule includes the scope expression of the five domains, the rule priority (when a network packet matches multiple rules, the rule with the highest priority is taken) and the classification result. The set of rules is marked R, and each rule is marked R _i . The rule data structure is defined as follows:

Rule{Rule{

int32sIP[2]，dIP[2]；//每个地址区间用两个32位整数描述int32sIP[2], dIP[2]; // Each address range is described by two 32-bit integers

int16sPort[2]，dPort[2]；//每个端口区间用两个16位整数描述int16sPort[2], dPort[2]; // Each port range is described by two 16-bit integers

int8 protocol[2]；//每个协议区间用两个8位整数描述int8 protocol[2];//Each protocol interval is described by two 8-bit integers

int32priority；//规则优先级int32priority;//rule priority

int8 action；//分类结果int8 action;//classification result

}；};

决策树节点数据结构：Decision tree node data structure:

决策树每一个节点的数据结构如下：The data structure of each node of the decision tree is as follows:

TREENODE{TREENODE {

unsigned chardimToCut；//划分维度unsigned chardimToCut;//division dimension

unsigned charnumCuts；//划分次数unsigned charnumCuts;//number of divisions

char nodeInfo；//节点信息，内部节点为0，叶节点为规则个数(1～T)char nodeInfo;//Node information, the internal node is 0, and the leaf node is the number of rules (1~T)

void *next；//跳转指针，对于内部节点该指针指向下层节点的头节点；void *next;//Jump pointer, for the internal node, the pointer points to the head node of the lower node;

//对于叶节点该指针指向规则ID的线性表//For leaf nodes, the pointer points to the linear table of rule IDs

}；};

决策树结构图见图1。The decision tree structure diagram is shown in Figure 1.

B2空间分配函数：B2 space allocation function:

首先给出HiCuts中的空间分配函数：First give the space allocation function in HiCuts:

SpaceAlloc(v)＝N*spfac SpaceAlloc(v)＝N*spfac

其中，v表示当前节点，N为当前节点中包含的规则个数，而spfac为空间因子，是一个常量，通常设为1～4。spfac越大意味着可用内存空间越多，即对当前搜索空间的分解可以越细致。Among them, v represents the current node, N is the number of rules contained in the current node, and spfac is the space factor, which is a constant, usually set to 1~4. The larger the spfac, the more memory space is available, that is, the more detailed the decomposition of the current search space can be.

本方法中可以采用两种不同形式的空间分配函数，下面分别给出这两种形式的函数：Two different forms of space allocation functions can be used in this method, and the functions of these two forms are given below:

形式I：Form I:

$SpaceAlloc SpaceAlloc ((v v)) = = N N * * spfac spfac ((v v)) = = N N * * ((spfa spfa {c c}_{min min} + + (({P P}_{v v}^{l l} - - min min (({P P}^{l l})) * * K K))$

${P P}_{v v}^{l l} = = \frac{11}{F f} {Σ Σ}_{f f = = 00}^{F f} {Σ Σ}_{i i = = {a a}_{f f}}^{{b b}_{f f}} Prior Prior [[f f]] [[i i]]$

当max(P^l)-min(P^l)＝0时：K＝0，When max(P ^l )-min(P ^l )=0: K=0,

其中，N是当前节点的规则个数；l表示决策树的节点的深度；[a_f，b_f]是节点v(v代表一个决策树节点)对应的搜索空间在f域中所代表的区间；P_v ^l称为节点v的先验概率；max(P^l)和min(P^l)是决策树第l层所有节点的最大和最小先验概率；spfac_min和spfac_max用来限制spfac(v)的取值范围，一般设为1～4。Among them, N is the number of rules of the current node; l represents the depth of the node of the decision tree; [a _f , b _f ] is the interval represented by the search space corresponding to the node v (v represents a decision tree node) in the f domain ; P _v ^l is called the prior probability of node v; max(P ^l ) and min(P ^l ) are the maximum and minimum prior probability of all nodes in the first layer of the decision tree; spfac _min and spfac _max are used to limit spfac( The value range of v) is generally set to 1~4.

形式II：Form II:

$SpaceAlloc SpaceAlloc ((v v)) = = N N * * spfac spfac ((v v))$

$= = N N * * BOUND BOUND (({P P}_{v v}^{l l} * * {D D.}^{l l} * * spfa spfa {c c}_{avg avg},, spfa spfa {c c}_{min min},, spfa spfa {c c}_{max max}))$

$spfa spfa {c c}_{avg avg} = = \frac{11}{22} (({spfac spfac}_{min min} + + spfa spfa {c c}_{max max}))$

其中，D^l表示决策树在l层所有节点的个数。Among them, D ^l represents the number of all nodes of the decision tree in layer l.

从以上两种定义可以看出，本方法与HiCuts不同，空间因子spfac不再是常数，而是用函数spfac(v)来代替。两种形式的空间分配函数都和当前节点的规则个数成正比，只是通过不同的方式将spfac(v)的取值约束在区间[spfac_min，spfac_max]中。It can be seen from the above two definitions that this method is different from HiCuts, the space factor spfac is no longer a constant, but is replaced by the function spfac(v). Both forms of space allocation functions are proportional to the number of rules of the current node, but the value of spfac(v) is constrained in the interval [spfac _min , spfac _max ] in different ways.

B3切分维度和切分次数的选择B3 Choice of segmentation dimension and number of segmentation

一旦空间分配函数确定了可用内存空间，本方法和HiCuts采用相同的方式确定切分次数和切分维度：首先为所有可能的切分维度(即五元组中的五个域)确定最大切分次数，然后根据维度判别准则Once the space allocation function determines the available memory space, this method and HiCuts determine the number of splits and the split dimension in the same way: first determine the maximum split for all possible split dimensions (that is, the five fields in the quintuple) times, and then according to the dimension criterion

首先估算可能的内存使用：First estimate possible memory usage:

其中f是被切分的维度；v_i是v的一个子节点；numRules(v)是属于该节点的规则数。通过在sm(v，f)≤SpaceAlloc(v)的约束下最大化内存估算函数sm(v，f)即可选择在f域上的最优划分次数numCuts(v，f)。Where f is the dimension to be split; v _i is a child node of v; numRules(v) is the number of rules belonging to the node. By maximizing the memory estimation function sm(v, f) under the constraint of sm(v, f)≤SpaceAlloc(v), the optimal division times numCuts(v, f) on the f domain can be selected.

确定了每个域上的最优划分次数后，便可将f域的搜索空间均匀划分为numCuts(v，f)份(f＝1，…，F)，假设每个子空间落入N_i个规则(i＝1，…，numCuts(v，f))，那么我们选择使：After the optimal number of divisions on each domain is determined, the search space of domain f can be evenly divided into numCuts(v, f) parts (f=1,...,F), assuming that each subspace falls into N _i rule (i=1, ..., numCuts(v, f)), then we choose to make:

$\frac{1}{numCuts (v, f)} Σ_{i = 1}^{numCuts (v, f)} N_{i}$ 最小化的f作为节点v的划分维度 $\frac{1}{numCuts (v, f)} Σ_{i = 1}^{numCuts (v, f)} N_{i}$ The minimized f is used as the partition dimension of node v

C网包查找方法，在网包分类单元中进行，具体步骤如下：The C net bag search method is carried out in the net bag classification unit, and the concrete steps are as follows:

第一步：读取网包包头H五元组数据，例如：{源地址＝166.111.8.28，目标地址＝162.105.38.12，源端口＝2000，目标端口＝80，协议＝17}.初始化当前搜索空间S_curr为全局搜索空间，例如：S_curr＝{{0，2³²-1}，{0，2³²-1}，{0，2¹⁶-1}，{0，2¹⁶-1}，{0，2⁸-1}}.Step 1: Read the H quintuple data of the network packet header, for example: {source address=166.111.8.28, destination address=162.105.38.12, source port=2000, destination port=80, protocol=17}. Initialize the current search The space S _curr is the global search space, for example: S _curr = {{0, 2 ³² -1}, {0, 2 ³² -1}, {0, ^{2 16} -1}, {0, 2 ¹⁶ -1}, {0, 2 ⁸ -1}}.

第二步：读取当前树节点v_curr(第一个节点为根结点v_root)，确定划分域DimToCut和划分次数NumToCut。Step 2: Read the current tree node v _curr (the first node is the root node v _root ), and determine the division domain DimToCut and division times NumToCut.

第三步：根据DimToCut和NumToCut对当前搜索空间S_curr进行划分.例如：选择划分目标端口域，划分次数为64次，则搜索空间在目标端口域被划分为64个子空间，其中第一个子空间为{{0，2³²-1}，{0，2³²-1}，{0，2¹⁶-1}，{0，1023}，{0，2⁸-1}}，第二个为{{0，2³²-1}，{0，2³²-1}，{0，2¹⁶-1}，{1024，2047}，{0，2⁸-1}}，…，第64个为{{0，2³²-1}，{0，2³²-1}，{0，2¹⁶-1}，{64512，65535}，{0，2⁸-1}}。然后根据网包包头H中相应域数值确定其所属子空间S_next。例如：H中目标端口域为80，由于0＜＝80＜＝1023，则该网包属于第1个子空间{{0，2³²-1}，{0，2³²-1}，{0，2¹⁶-1}，{0，1023}，{0，2⁸-1}}。Step 3: Divide the current search space S _curr according to DimToCut and NumToCut. For example: choose to divide the target port domain, and the number of divisions is 64 times, then the search space is divided into 64 subspaces in the target port domain, and the first subspace The spaces are {{0, 2 ³² -1}, {0, ^{2 32} -1}, {0, 2 ¹⁶ -1}, {0, 1023}, {0, 2 ⁸ -1}} and the second one is {{0, ^{2 32} -1}, {0, 2 ³² -1}, {0, ^{2 16} -1}, {1024, 2047}, {0, 2 ⁸ -1}}, ..., the 64th is {{0, 2 ³² -1}, {0, 2 ³² -1}, {0, 2 ¹⁶ -1}, {64512, 65535}, {0, ^{2 8} -1}}. Then determine the subspace S _next to which it belongs according to the value of the corresponding field in the header H of the network packet. For example: the destination port field in H is 80, since 0<=80<=1023, the network packet belongs to the first subspace {{0, 2 ³² -1}, {0, 2 ³² -1}, {0, 2 ¹⁶ -1}, {0, 1023}, {0, 2 ⁸ -1}}.

第四步：根据S_next在S_curr中的相对位置确定H所属子空间序号(80所在子空间序号为1)，顺序读取当前节点的子节点，确定H对应的子节点v_next(即当前节点的第1个子节点)。Step 4: determine the subspace sequence number of H according to the relative position of S _next in S _curr (the subspace sequence number of 80 is 1), read the child nodes of the current node in sequence, and determine the corresponding child node v _{next of} H (that is, the current node's first child).

第五步：如果v_next为叶节点，那么转到第六步；否则设S_crr＝S_next，且v_curr＝v_next，并重复第二至四步。Step 5: If v _next is a leaf node, go to step 6; otherwise, set S _crr =S _next , and v _curr =v _next , and repeat steps 2 to 4.

第六步：读取v_next中存放的规则列表RuleList，并对H进行该列表中规则的线性查找(即对每个域进行范围匹配)，与H所有域都匹配且具有最高优先级的规则(设为R_opt)为线性查找结果，即H最终对应的规则。Step 6: Read the rule list RuleList stored in v _next , and perform a linear search for the rules in the list for H (that is, perform range matching on each domain), and match all domains of H and have the highest priority rule (set as R _opt ) is the linear search result, that is, the final corresponding rule of H.

第七步：根据R_opt的Action属性执行对H所属网包的操作。例如丢弃或转发等。Step 7: Perform operations on the network package to which H belongs according to the Action attribute of R _opt . Such as discarding or forwarding, etc.

注：线性表查找--到达叶节点之后，读取叶节点存储的规则ID列表和规则个数，并以此来逐一读取规则，与网包包头信息进行区域比对。由于规则ID代表规则优先级，并且规则ID按照优先级从高到底存储于叶节点中，因此只需要返回第一个被匹配到的规则即可，例如网包{源地址＝166.111.8.28，目标地址＝162.105.38.12，源端口＝2000，目标端口＝80，协议＝17}与规则{166.111.^*，162.106.^*，any，80，TCP}匹配。Note: Linear table lookup--after arriving at the leaf node, read the rule ID list and the number of rules stored in the leaf node, and use this to read the rules one by one, and compare the area with the header information of the network packet. Since the rule ID represents the priority of the rule, and the rule ID is stored in the leaf node according to the priority from high to low, it is only necessary to return the first matched rule, for example, net packet {source address=166.111.8.28, target Address=162.105.38.12, source port=2000, destination port=80, protocol=17} matches the rule {166.111. ^* , 162.106. ^* , any, 80, TCP}.

结论in conclusion

应用前景Application prospects

目前千兆级以上的高端策略路由器设备均采用ASIC/FPGA等专用芯片解决方案。由于基于专用芯片的设备开发周期长，体积功耗大，更新难度高，所以此类产品的效费比很低，限制了高性能网包分类设备的广泛使用。而本专利提出的多域网包分类方法既可以在多种平台上实现，包括基于微处理器CPU的通用平台以及基于网络处理器NPU的专用平台，有可以保证良好的性能和对不同网络应用的适应性，因此整套软硬件系统可以作为多域网包分类方法的核心模块提供给厂商以提高基于通用处理器平台的网包分类设备的性能，大幅降低高端策略路由及防火墙产品的生产成本，从而推动和加速下一代互联网的实施和运行。At present, high-end policy routers above Gigabit class all use ASIC/FPGA and other dedicated chip solutions. Due to the long development cycle of equipment based on dedicated chips, large volume and power consumption, and high difficulty in updating, the cost-effectiveness of such products is very low, which limits the widespread use of high-performance network packet classification equipment. The multi-domain network packet classification method proposed in this patent can be implemented on a variety of platforms, including a general-purpose platform based on a microprocessor CPU and a dedicated platform based on a network processor NPU, which can ensure good performance and be applicable to different network applications. Therefore, the entire software and hardware system can be provided to manufacturers as the core module of the multi-domain network packet classification method to improve the performance of network packet classification equipment based on general-purpose processor platforms, and greatly reduce the production cost of high-end policy routing and firewall products. Thereby promoting and accelerating the implementation and operation of the next generation Internet.

Claims

1, the multi-domain net packet classifying method of flow Network Based is characterized in that: this method is based on that little processing general-purpose platform or network processing unit dedicated platform realize, has following steps successively:

Step 1. net bag is accepted the unit and is accepted input net bag:

After the network interface card of this unit is accepted the net bag that arrives and is resolved net bag header packet information, the content of output net bag header packet information in the buffer queue of the dynamic buffering memory of this unit again;

Step 2. sampling unit foundation is the fixed sampling interval network packet header that the net bag connects the output of bag unit to be done the network traffics statistics of features, and the information of being added up is done the prior distribution that obtains flow after the normalized according to updated time, the described method of step 2 contains following steps successively:

Step 2.1. sampling:

Processor in the described sampling unit is according to extracting a five-tuple header packet information as a sample in every N of the sampling interval Ns that sets from the network packet header formation of the input net bag that arrives continuously, each sampling sample be one by following five data sets that the territory is formed:

32 potential source IP addresses represent that with int32 sIP int represents integer type, down together;

32 target ip address are represented with int32 dIP;

16 potential source ports are represented with int16 sPort;

16 target ports are represented with int16 dPort;

8 bit protocols are represented with int8 protocol;

Step 2.2. records statistics:

Described statistic comprises:

The number of times that each category-B network segment source IP address occurs in described sample is used int32 sIP[65536] expression, the described category-B network segment is meant preceding 16 of 32 IP addresses, down together;

The number of times that each category-B network segment target ip address occurs in described sample is used int32 dIP[65536] expression;

The number of times that each source port occurs in described sample is used int32 sPort[65536] expression;

The number of times that each target port occurs in described sample is used int32 dPort[65536] expression;

The number of times that each agreement occurs in described sample is used int32 protocol[256] expression;

Step 2.3. normalization:

When the classifier data topology update, the statistic data structure to be carried out normalization obtain prior distribution, the data structure of this prior distribution is following two-dimensional array:

float Prior[i][j]，

Described Prior[i] [j] be each territory normalization statistic, wherein, and i=0,1,2,3,4, once distinguish corresponding source IP address, target ip address, source port, target port and protocol domain; J=0 ..., 65535;

For source IP address:

Prior [0] [j] = \frac{Stats . sIP [j]}{Σ_{k = 0}^{65535} Stats . sIP [k]},

Stats represents statistic, down together;

For target ip address:

Prior [1] [j] = \frac{Stats . dIP [j]}{Σ_{k = 0}^{65535} Stats . dIP [k]},

For source port:

Prior [2] [j] = \frac{Stats . sPort [j]}{Σ_{k = 0}^{65535} Stats . sPort [k]},

For target port:

Prior [3] [j] = \frac{Stats . dPort [j]}{Σ_{k = 0}^{65535} Stats . dPort [k]},

For protocol domain:

Prior [4] [j] = \frac{Stats . protocol [j]}{Σ_{k = 0}^{65535} Stats . protocol [k]},

When 256≤j≤65535, Prior[4] [j]=0;

Step 3. is provided with classifying rules in computing unit, classifier data structure and normalized regular collection after upgrading by statistic unit output again, and described standardization is meant that the rule of handle friendly mask of generation or asterisk wildcard is with describing between described five-tuple location:

Described computing unit includes a processor and a high speed storing equipment, and the prior distribution information of the related network traffic statistics of this processor is that input links to each other with the corresponding output end of described sampling unit, and this processor also is provided with a regular collection input;

The data structure of described grader is a decision tree structure and comes the storage rule subclass by a linear list on the leaf node of this decision tree; Described computing unit is divided step by step to the search volume when upgrading the grouped data structure, constantly dwindle hunting zone and regular number, the search of whole regular collection is programmed to the search of this regular collection subclass, when the regular number in the searched subclass during less than setting threshold, by adopting linear search technique that current subclass is searched for to obtain final classification results to described linear list, this threshold value is represented with Thresh, is made as 1～10;

Described classifier data topology update method contains step successively:

Step 3.1. initialization:

The setting search space is U, i.e. the possible value space of the mullet of multiple domain net bag header packet information, and the initial ranging space of described five-tuple is followed successively by: [0,2 ³²-1], [0,2 ³²-1], [0,2 ¹⁶-1], [0,2 ¹⁶-1], [0,2 ⁸-1];

The division of setting space promptly is divided into span the set of a plurality of subranges on one or more territory of formulating, then obtain a division to the current search space;

To described computing unit input rule set and system parameters, described system parameters comprises the update cycle to step 3.2., the Thresh in sampling interval and the described classifier data structure, spfac by computer peripheral _MinAnd spfac _MaxDeng adjustable parameter;

A rule in the described regular collection is represented with R, contains:

Source IP address with two 32 integers are described is expressed as int32 sIP[2];

Target ip address with two 32 integers are described is expressed as int32 dIP[2];

Source port with two 16 integers are described is expressed as int16 sPort[2];

Target port with two 16 integers are described is expressed as int16 dPort[2];

Agreement interval with two 8 integers are described is expressed as int8 protocol[2];

The regular priority of representing with int32 priority;

The classification results of representing with int8 action;

Step 3.3. is set as follows described decision tree nodes data structure in described computing unit, i.e. the data structure of grader, and each node comprises:

Partition dimension,

Divide number of times,

Nodal information, the nodal information value of internal node is 0, and the nodal information value of leaf node is regular number, represents with 1～Thresh, and Thresh is the threshold value of leaf node rule number;

Jump cursor: for internal node, the head node of this pointed lower level node; For leaf node, this pointed is deposited the linear list of the rule ID of rule of correspondence subclass;

Step 3.4. creates root node, and strictly all rules all is assigned to root node, root node v _RootExpression;

Step 3.5. sends into current queue Q to described root node ₁

Step 3.6. sends node v from current queue _c

Step 3.7. judges this node v _cIn regular number whether greater than threshold value T (being Thresh);

Step 3.8.

If: node v _cIn regular number greater than (or equaling) threshold value T, then this node v _cBe set to leaf node;

If: node v _cIn regular number less than threshold value T, then carry out next step;

Step 3.9. is according to present node v _cPrior probability and the pairing regular subclass of this node in regular number N adopt any node v that gives in following two kinds of allocation of space functions _cChild node storage allocation space:

First kind:

SpaceAlloc (v_{c}) = N * spfac (v_{c}) = N * (spfa c_{\min} + (P_{v_{c}}^{l} - \min (P^{l}) * K)

Wherein,

P_{v_{c}}^{l} = \frac{1}{F} Σ_{i = 0}^{F} Σ_{j = a_{i}}^{b_{i}} Prior [i] [j]

Be node v _cThe prior distribution probability of flow;

As max (P ^l)-min (P ^l)＞0 o'clock: K=(spfac _Max-spfac _Min)/(max (P ^l)-min (P ^l))

As max (P ^l)-min (P ^l)=0 o'clock: K=0,

Wherein,

N is the regular number of present node;

L represents the degree of depth of the node of decision tree;

[a _i, b _i] be node v _cThe interval of corresponding search volume representative in the i territory;

Max (P ^l) and min (P ^l) be the minimum and maximum prior probability of all node of decision tree l layer;

Spfac _MinAnd spfac _MaxBe used for limiting spfac (v _c) span, generally be made as spfac _Min=1, spfac _Max=4; F=5 represents 5 territories;

Second kind:

SpaceAlloc (v_{c}) = N * spfac (v_{c}) = N * BOUND (P_{v_{c}}^{l} * D^{l} * spfa c_{avg}, {spfac}_{\min}, {spfac}_{\max}),

BOUND (a, b, c) = \{\begin{matrix} a & b \leq a \leq c \\ b & a < b \\ c & a > c \end{matrix}

{spfac}_{avg} = \frac{1}{2} ({spfac}_{\min} + {spfac}_{\max})

Wherein, D ^lThe expression decision tree is in the number of all node of l layer;

Determining of step 3.10. cutting dimension and division number of times:

At first the internal memory with following formula computing node v uses,

sm (v, f) = Σ_{i = 1}^{numCuts (v, f)} numRules (v_{i}) + numCuts (v, f)

Wherein f is by the dimension of cutting; v _iIt is the child node of v; NumRules (v) is the regular number that belongs to this node.By sm (v, f)≤SpaceAlloc (constraint v) down maximization internal memory evaluation function sm (v, f) can be chosen in optimal dividing frequency n umCuts on the f territory (v, f);

After having determined the optimal dividing number of times on each territory, just the search volume in f territory evenly can be divided into numCuts (v, f) part, f=1 ..., F supposes that each subspace falls into N _iIndividual rule, i=1 ..., numCuts (v, f), we select to make so:

\frac{1}{numCuts (v, f)} Σ_{i = 1}^{numCuts (v, f)} N_{i}

Minimized f is as the partition dimension of node v;

Bag classification of step 4. net and classifier data topology update:

Step 4.1. obtains net bag header packet information, i.e. the five-tuple information in net bag packet header as the processor of net bag classification usefulness from described DRAM;

Step 4.2. reads current tree node v _Curr, determine cutting dimension and cutting number of times;

Step 4.3. divides the current search space according to dividing domain that provides in the step 4.2 and division number of times, the subspace in each territory after determining to divide according to the numerical value of corresponding field in the net bag packet header again;

Step 4.4. determine to divide back subspace sequence number under each according to the relative position of the subspace in the step 4.3 in the search volume, and order reads the child node of described current tree node again, determines to divide the pairing child node in each territory, back;

Step 4.5. then changes step 4.6 over to if the child node that step 4.4 finally obtains is a leaf node; Otherwise, be described subspace with the current search spatial update, and current tree node be updated to described child node, repeating step 4.2～4.4;

The list of rules of depositing in the child node described in the step 4.6. order read step 4.4, and each territory carried out commensurate in scope, the rule of all mating with net bag all territories, packet header and having a limit priority is the linear search result, i.e. the rule of the final coupling in net bag packet header;

Step 4.7. carries out the operation to this net bag according to the processing mode of the resulting rule of linear search.