CN111581489B

CN111581489B - Storage space optimized sampling method based on shared counting tree

Info

Publication number: CN111581489B
Application number: CN202010438372.8A
Authority: CN
Inventors: 杨武; 玄世昌; 王巍; 苘大鹏; 吕继光; 唐德志
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2023-03-24
Anticipated expiration: 2040-05-22
Also published as: CN111581489A

Abstract

The invention belongs to the technical field of flow sampling, and particularly relates to a storage space optimized sampling method based on a shared counting tree. The invention aims to save the storage space of sampling equipment, and specifically comprises the steps of determining whether to sample an incoming data packet according to a sampling judgment mechanism; if the incoming data packet is determined to be sampled, retrieving the flow node to which the data packet belongs in a Hash flow tracking table; if the flow node to which the sampling data packet belongs is not retrieved, a flow node is newly established in the flow tracking table for the data packet; when sampling is terminated to a certain flow, restoring and guiding characteristic values stored in flow nodes and a shared counting tree set of the flow into an ordered flow characteristic record buffer area; and writing the sample stream characteristic record into a file when the buffer area is full.

Description

Storage space optimized sampling method based on shared counting tree

Technical Field

The invention belongs to the technical field of flow sampling, and particularly relates to a storage space optimized sampling method based on a shared counting tree.

Background

In recent years, the variety and number of applications in the internet have been significantly expanding. In order to cope with the influence of application change on the network, a network manager needs to measure the traffic application characteristics, and the traffic needs to be classified during the measurement process. To support application classification, the sampled traffic should retain sufficient application characteristics. A particular session of a modern application is typically composed of multiple streams, each of which may have the same source IP but different destination IPs. If more flows are adopted in a certain application program session, more application characteristics are reserved for the sampled flow, and further the machine learning algorithm is facilitated to accurately carry out application identification on the sampled flow. RelSamp only samples the flows corresponding to the source IPs within a certain range, and under the condition that the effective sampling ratio is constant, the RelSamp can increase the number of the flows acquired in the application program session by increasing the flow sampling probability and reducing the packet sampling probability, thereby more keeping the application characteristics of the sampling flow. However, for any statistical characteristic of the streams, such as the stream size, relSamp needs to assign a counter to each stream to record its size. The space allocated for each counter is consistent and the counting range of the counter is guaranteed to cover the counting value of the maximum flow. The network traffic distribution has the characteristic of heavy tail distribution, namely, a large flow occupying a small proportion occupies a large proportion in the network traffic. There are studies that show that ordering the flows by flow size, the top 15% of the flows occupy 95% of the total flow. The size of each flow is recorded by allocating a counter with the same size as the space of each flow, which inevitably causes a great waste of the storage space of the flow sampling equipment. Allocating a counter with a consistent space size for each flow to record other statistical characteristics (e.g., the number of FIN, SYN, and ACK packets coming in the flow during sampling) also wastes the storage space of the flow sampling device. Especially when RelSamp is deployed in a high-speed network environment with huge network flow concurrency, huge storage pressure can be caused on the flow sampling equipment.

Disclosure of Invention

The invention aims to provide a storage space optimized sampling method based on a shared count tree, which saves the storage space of sampling equipment.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: according to a pre-configured finite sampling ratio p _e Source IP sampling probability p _h And target stream sampling probability

Determining output packet sampling probability p _p And current stream sampling probability p _f ；

Step (ii) of2: extracting data packets from the data packet buffer queue, and distributing two random numbers r with the value range of [0,1 ] for the data packets _f And r _p ；

And step 3: acquiring a source IP of a data packet and calculating a hash value of the source IP;

and 4, step 4: the hash value of the source IP and the source IP selection probability p _h Multiplying to obtain a target value;

and 5: if the target value is within the preset range, executing step 6; otherwise, discarding the data packet;

step 6: searching a stream node to which the data packet belongs;

if the flow node r corresponding to the data packet is not found _f ≤p _f If so, sampling the data packet, creating a flow node for the data packet, creating a flow characteristic storage unit, and updating the flow characteristic storage unit of the flow;

if the flow node to which the data packet belongs is found and r _p ≤p _p Then sampling the data packet and updating the flow characteristic storage unit of the flow;

if the flow node corresponding to the data packet is not found and r is _f ＞p _f Or find the stream node to which the data packet belongs and r _p ＞p _p Discarding the data packet;

and 7: when sampling of a certain flow is finished, restoring flow characteristics stored in flow nodes and flow statistical characteristics stored in a shared counting tree into complete flow records, and adding the flow records into a flow record buffer queue; and writing the sample stream characteristic record into a file when the buffer area is full.

The present invention may further comprise:

the step 1 is carried out according to a preset finite sampling ratio p _e Source IP sampling probability p _h And target stream sampling probability

Determining output packet sampling probability p _p And current stream sampling probability p _f The method comprises the following specific steps:

step 1.1: inputting a preconfigured finite sampling ratio p _e Source IP sampling probability p _h And target stream sampling probability

Step 1.2: initializing packet sampling probability p _p And current stream sampling probability p _f ；

p _p ＝p _e /p _h

p _f ＝p _e /p _h

Step 1.3: order to

Step 1.4: obtaining a current sampling ratio p _x ；

Step 1.5: if it is

Then the output packet sampling probability p _p And current stream sampling probability p _f Ending the calculation; otherwise, executing step 1.6; wherein α is the accuracy of the setting;

step 1.6: if | p _x -p _e If the | is less than or equal to alpha, returning to the step 1.3; otherwise, executing step 1.7;

step 1.7: if the current sampling ratio p _x Greater than a preconfigured finite sampling ratio p _e Then let p _p ＝0.5*(p _p + t), t =0.00001, return to step 1.6; otherwise, let p _p ＝0.5*(p _p + 1), return to step 1.6.

The specific method for creating the flow characteristic storage unit for the data packet in the step 6 is as follows: analyzing the data packet, and writing quintuple information, stream arrival time, stream latest update time, minimum effective load length and maximum effective load length into a stream node; wherein, the latest updating time of the stream and the arrival time of the stream are both the current time; the minimum effective load length and the maximum effective load length are both the effective load length of the application layer of the data packet; if the data packet is a TCP data packet, entering TCP headerLine analysis, detecting whether flag bits ACK, FIN, SYN and RST are set in the line analysis; if the TCP data packet is set, counting the number of the incoming flag bit data packets of the stream in the corresponding shared counting tree; counting the flow in a shared count tree that stores the flow size; calculating the length len of the data packet, taking 32B as a data block, obtaining the number c of the data blocks occupied by the data packet,

counting the flow c times in a shared counting tree storing the length of the flow; if the data packet is not a TCP data packet or the TCP data packet is not set, the characteristics of the number of ACK packets, the number of SYN packets, the number of FIN packets and the number of RST packets of the flow to which the data packet belongs do not need to be counted in the shared counting tree.

The specific steps of updating the stream feature storage unit of the stream in the step 6 are as follows:

step 6.1: updating the stream latest update time in the stream node;

step 6.2: analyzing the data packet, and calculating the effective load length of the application layer of the data packet;

step 6.3: if the effective load length of the data packet application layer is greater than the original maximum effective load length, updating the maximum effective load length; if the effective load length of the data packet application layer is smaller than the original minimum effective load length, updating the minimum effective load length;

step 6.4: the statistical feature values are updated in the shared count tree.

The specific steps of updating the statistical characteristic value in the shared count tree in the step 6.4 are as follows:

step 6.4.1: extracting quintuple information of the data packet, hashing the quintuple information, positioning a flow node corresponding to the data packet in a flow table, and extracting a flow label f of the flow from the flow node _tuple ；

Step 6.4.2: generating a random number i, i ∈ [0,r);

step 6.4.3: selecting S [ i ] from a set of random numbers S]And flow label f _tuple The exclusive or is performed on the data to be processed,and the XOR value is used as a parameter and is transmitted to a main hash function H to generate the hash function H _i (ii) a Computing hash values

Positioning an enhanced counter for counting this time; wherein p is the number of enhanced counters;

step 6.4.4: updating the statistic characteristic value C [ u ], C [ u ] ← C [ u ] +1;

if the CU does not overflow, the step is finished, and the statistical characteristic value is updated;

if the C [ u ] overflows and the subscript of the father node of the C [ u ] is greater than or equal to the subscript of the virtual root node, the reinforced counter L [ u ] overflows and does not use the shared counting tree for counting any more;

if C [ u ]]Overflow and the subscript of its parent node is less than the subscript of the virtual root node, update u,

step 6.4.4 is re-executed.

The specific step of recovering the stream data statistical characteristic value from the shared counting tree in the step 7 is

Step 7.1: extracting a flow label f from a flow node ending a sample flow _tuple ；

And 7.2: the elements S [ i ] in the random number set S]And carrying out XOR with the flow label of the flow f in sequence, transmitting the XOR values serving as parameters to a main hash function H, and respectively calculating r hash values

Positioning r strengthening counters and subtree counters for recording the flow statistical characteristic value;

step 7.3: calculating a statistical characteristic value s of the flow;

wherein, X _i Is the value of the subtree counter; l _i Is the height of the subtree counter; n is meshA total count value of the previous shared count tree; k is a radical of _i Is the number of leaf nodes in the sub-tree counter,

the invention has the beneficial effects that:

the storage space optimal sampling method based on the shared counting tree does not affect the accuracy rate of applying and identifying the sampling flow by the SVM classifier and the C4.5 classifier, and only a small amount of storage space is needed in the sampling process. Specifically, an SVM classifier is used for carrying out application identification on the sampling flow, and the average accuracy of the application identification in each identified application is 0.867; the application recognition is carried out on the sampling flow of the invention by using a C4.5 classifier, and the average precision of the application recognition is 0.891 in each recognized application. With 9GB flow as input, the invention only needs 700KB of storage space at most and only needs 200KB of storage space at least in the sampling process.

Drawings

Fig. 1 is an overall framework diagram of the present invention.

Fig. 2 is a flow chart of the present invention for determining the probability of correlated sampling.

Fig. 3 is a flow chart of the sampling judgment strategy in the present invention.

FIG. 4 is a diagram of an example of an update stream characteristics storage unit according to the present invention.

FIG. 5 is a three-level shared count tree storage structure diagram.

FIG. 6 is a three-level shared count tree logical structure diagram.

Fig. 7 is a diagram of a counter memory structure.

Fig. 8 is a reinforced counter diagram.

FIG. 9 is a reinforced counter vector diagram.

Fig. 10 is a diagram showing an example of a stream profile restoration process in which quintuple information is tuple 3.

Fig. 11 is a diagram for explaining a selection method of a root node in a sub-tree counter.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention provides a storage space optimal sampling method-MiniSamp based on a shared counting tree. As shown in fig. 1, the technical scheme of MiniSamp is as follows:

(1) And determining whether to sample the incoming data packet according to a sampling judgment mechanism.

(2) If the incoming data packet is determined to be sampled, retrieving the flow node to which the data packet belongs in a Hash flow tracking table; and if the flow node to which the sampling data packet belongs is not retrieved, newly building a flow node in the flow tracking table for the data packet.

(3) The flow characteristic storage structure is composed of a flow characteristic storage unit, and the flow characteristic storage unit is composed of sampling flow nodes and counters for recording sampling flow statistical characteristic values in a shared counting tree set. And after the flow node to which the sampling data packet belongs is positioned in the flow tracking table, extracting the relevant characteristics of the sampling data packet, and updating the characteristic values of the sampling flow in the flow node and the shared counting tree set respectively.

(4) When sampling is terminated to a certain flow, restoring and guiding characteristic values stored in flow nodes and a shared counting tree set of the flow into an ordered flow characteristic record buffer area; and writing the sample stream characteristic record into a file when the buffer area is full.

(5) And inputting the sampling stream feature recording file into a pre-trained classifier, and generating a result of application classification by taking a stream as a unit.

(1) Determining correlated sampling probabilities

In the sampling process, miniSamp judges each incoming data packet in three stages, and determines whether to sample the incoming data packet according to the judgment result. These three phases are a source IP selection phase, a flow selection phase, and a packet selection phase, respectively. Three probabilities, p, are assigned to the three phases _h ，p _f And p _p . Wherein p is _h The size, p, of the sampled set of source IPs may be controlled during the source IP selection phase _f The number of the flows collected by the IP of the sampling source can be controlled in the flow selection stage, p _p The number of data packets sampled in the sample stream may be controlled.

Before formally starting sampling, a sampling probability p is first determined _h ，p _f And p _p The value of (c). The network operator will typically pre-configure the effective sampling ratio p _e And p _h (p _h ≥p _e ) And is specified according to its sampling purpose

In determining the sampling probability, an iterative algorithm is used, with the incoming packet as input, to p _f And p _p Is continuously subjected to binary search in order to ensure that the effective sampling ratio is p _e Under the premise of (a) making p _f Is continuously approached>

And gives p at this time _p The value of (c). When p is _h ，p _f ，p _p After the three sampling probabilities have been determined, all of the sample stream feature records generated in the stage of determining the sampling probabilities are discarded, and formal sampling is started. A flow chart for determining the sampling probability is shown in fig. 2.

(2) Sampling mechanism

And the packet capturing thread circularly captures packets at a network card of the flow sampling equipment and stores the captured TCP or UDP data packets into a data packet buffer queue. There is a status flag tag, which is set to 1 during the sampling phase.

To illustrate the sampling judgment logic, first, a packet is taken from a packet buffer queue, and two random numbers with the value range of [0,1 ] are distributed to the packet, and r is used for each random number _f And r _p And (4) showing. In the source IP selection stage, the source IP of the data packet is hashed, and the hash value is compared with p _h And multiplying to obtain a target value, entering a flow selection stage if the target value is within a preset range, and otherwise discarding the data packet. In the flow selection stage, the five-tuple information (source IP, source port, destination IP, destination port, transport layer protocol) of the data packet is hashed, the flow node of the flow to which the data packet belongs is searched in a hash flow table according to the hash value, and if the flow node does not belong, the flow node of the flow to which the data packet belongs is searchedThere is a stream node found to correspond to the packet, and r _f ≤p _f If so, sampling the data packet and creating a flow node for the data packet in a flow table; if the flow node of the packet is found, entering a packet selection stage, if r is the flow node of the packet _p ≤p _p If so, sampling the data packet and updating the stream storage unit to which the data packet belongs; otherwise, the packet is discarded. The flow chart is shown in fig. 3.

(3) Stream feature update

MiniSamp is a sampling algorithm supporting application classification, and is used for identifying the application type of sampling flow for supporting a machine learning algorithm, and the MiniSamp records the flow statistical characteristics which can be used for application classification in the sampling process. By referring to the existing features that support traffic application classification, and combining with its own background, the unidirectional sampling flow record generated by MiniSamp consists of the following flow features: transport layer protocol, source port, destination port, minimum payload length, maximum payload length, number of packets, total data length, average segment size, number of ACK packets, number of SYN packets, number of FIN packets, and number of RST packets. The characteristics of the minimum effective load length, the maximum effective load length, the data packet number and the like can be used for identifying UDP application, the number of ACK packets, the number of SYN packets, the number of FIN packets, the number of RST packets, the average segment size and the like can be used for identifying TCP application; characteristics of a transport layer protocol, a source port, a destination port, total data length and the like can be used for identifying TCP application and UDP application. In the UDP flow record, the statistical characteristic values of the relevant TCP flag bits are all set to 0.

MiniSamp stores the application characteristics of the unidirectional flow in a flow node and a shared counting tree set respectively, records information such as source IP, destination IP, a source port, a destination port, a transport layer protocol, minimum effective load length, maximum effective load length, flow arrival time, flow latest update time and the like in the flow node, and records information such as the number of data packets, total data length, ACK packets, SYN packets, FIN packets, RST packets and the like in respective shared counting trees. The flow characteristic storage unit is composed of flow nodes and a counter for recording the statistical characteristic value of the flow by the shared counting tree.

When a flow characteristic storage unit is newly established for a sampling data packet, analyzing the data packet, and writing quintuple information, flow arrival time, flow latest update time, minimum effective load length and maximum effective load length into a flow node, wherein the flow latest update time and the flow arrival time are both current time, and the minimum effective load length and the maximum effective load length are both the application layer effective load length of the data packet. If the data packet is a TCP data packet, analyzing a TCP header, detecting whether flag bits ACK, FIN, SYN and RST are set, and if the flag bits ACK, FIN, SYN and RST are set, counting the number of the incoming flag bit data packets of the stream in a corresponding shared counting tree; counting the flow in a shared count tree that stores the size of the flow; the length len of the data packet is calculated, and 32B is taken as a data block, so as to obtain the number c of the data blocks occupied by the data packet. The stream is counted c times in a shared count tree that stores the length of the stream. The calculation formula of c is as follows:

when updating the stream feature storage unit of the data packet, firstly updating the latest update time of the stream in the stream node, then analyzing the data packet, calculating the effective load length of the application layer of the data packet, respectively comparing the effective load length with the maximum effective load length and the minimum effective load length in the stream node, and if the effective load length is greater than the original maximum effective load length, updating the maximum effective load length; if the length is less than the original minimum payload length, the minimum payload length is updated. The update operation for the shared count tree is the same as when the flow characteristics storage unit is established. An example of a MiniSamp update stream feature storage unit is shown in fig. 4, where SCT is an abbreviation of a Shared Counter Tree (Shared Counter Tree).

A process for updating statistical characteristic values of a sample stream in a shared count tree is described. In order to record a certain statistical characteristic value of the sample stream, a plurality of shared counters are allocated to all the sample streams in the sampling process. The counters are logically organized into an approximate binary tree, forming the shared count tree in MiniSamp.

The shared count tree is described as follows, the memory space allocated to the shared count tree is N bits, the space allocated to each counter is a bits, and the number of counters in the shared count tree is N/a. The height of the shared counting tree is h, h layers are shared, the lowest layer is the 0 th layer, and the uppermost layer is the h-1 th layer. The number of leaf nodes in the count tree is p, and the degree of non-leaf nodes in the count tree is 2. And if the number of the h-1 level nodes exceeds 1, setting a virtual root node in the counting tree. A three-level shared count tree storage structure is shown in fig. 5 and a logical structure is shown in fig. 6.

In the counter, the most significant bit is used as the status bit, and the remaining a-1 bits are used for counting, and the storage structure of the counter is shown in fig. 7.

A leaf node of the count tree is represented by C [ i ], i ∈ [0,p), and the nodes contained on the path from C [ i ] to the root node constitute a reinforced counter, which is called L [ i ]. For example, L [0] = { C [0], C [8], C [12] } is an enhanced counter. Since the count tree has p leaf nodes in common, there are p enhanced counters. An enhanced set of counters in the count tree is represented by a vector L, with L = { L [1], L [2],. The turbo counter is shown in fig. 8.

For any sampling flow f, five-tuple information of the flow and r independent hash functions h are passed _i R (r < p) enhancement counters are selected from the p enhancement counters. The r enhanced counters form an enhanced counter vector for the flow f, with L _f Represents the vector, and converts L _f Is denoted as L _f [i]。L _f [i]The specific calculation formula of (1) is as follows, wherein i belongs to [0,r), and the hash function h _i The value range of [0,p-1 ].

L _f [i]＝L[h _i (f)] (2)

In order to reduce the difficulty of designing the hash function, r mutually independent hash functions are not really designed, but only one main hash function H is designed and a set consisting of r elements is utilizedAnd combining S to simulate r mutually independent hash functions. Using a hash function h _i The formula for hashing stream f is as follows:

since r < < p, the probability of randomly choosing r distinct counters from the p enhanced counters of the shared count tree is

Thus vector L _f The respective enhanced counters in (1) are mutually exclusive.

In fig. 9, r =3, enhanced counter vector L for flow f _f ＝{L[1],L[0],L[4]Enhanced counter vector L for flow g _g ＝{L[3],L[4],L[6]}. Wherein the counter L [4 ] is strengthened]Being shared by stream f and stream g, the number of streams that can be recorded by the shared count tree is much larger than the number of its enhanced counters, since different streams can share the same enhanced counter.

The steps of updating the statistical characteristic values of the number of the sampling stream data packets, the number of the ACK packets, the number of the SYN packets, the number of the FIN packets, the number of the RST packets and the like in the shared counting tree are as follows:

step 1: extracting quintuple information of an incoming data packet, hashing the quintuple information, positioning a flow node corresponding to the data packet in a flow table, and extracting a flow label of the flow from the flow node, wherein the flow label comprises information such as a source IP, a source port, a destination IP, a destination port, a transport layer protocol, arrival time of the flow and the like. Suppose that the flow of an incoming packet is f and its flow label is f _tuple 。

Step 2: a random number i is generated, where i ∈ [0,r).

And step 3: h is a main hash function, and S [ i ] is selected from a random number set S]And flow label f _tuple Performing exclusive OR, wherein the scale of the random number set S is r, and transmitting the exclusive OR value as a parameter to a main hash function H to generateHash-forming function h _i Calculating a hash value

And positioning an enhanced counter of the current counting.

And 4, step 4: c [ u ] ← C [ u ] +1. If CU does not overflow, the procedure is finished. If C [ u ] overflows and the subscript of its parent node is greater than or equal to the subscript of the virtual root node, the reinforced counter L [ u ] overflows and the shared count tree count is no longer used. If C [ u ] overflows and the subscript of its parent node is less than the subscript of the virtual root node, the status bit of C [ u ] is set to 1, u is updated according to equation 5, and the process continues.

The steps of recording the total data length of the stream using the shared count tree in the sampling process are as follows:

step 1: the incoming data packet is analyzed, the length of the data packet is calculated to be len, 32B is used as a data block, and the number of the data blocks c of the data packet is calculated.

Step 2: c ← c-1.

And step 3: and counting the number of data blocks of the stream to which the data packet belongs 1 time in the shared counting tree.

And 4, step 4: if c is less than 0, the counting process of the total data length of the stream is finished, otherwise, the step 2 is carried out.

(4) Stream characteristic reduction

And starting the flow characteristic restoring thread every 10s, restoring the flow characteristics stored in the flow nodes and the flow statistical characteristics stored in the shared counting tree into complete flow records by the flow record restoring thread after sampling of a certain flow is finished, and adding the flow records into a flow record buffer queue.

The stream characteristic restoration thread firstly sets the state flag tag to be 0, does not sample at the moment, and temporarily stores the captured data packet in the data packet buffer queue. Waiting for 0.25s to acquire the mutual exclusion lock of the flow table and acquiring the current time so as not to influence the ongoing sampling operation; sequentially scanning flow nodes in a flow table, calculating the time interval between the latest updating time of the flow and the current time in the process of scanning each flow node, if the time interval is more than 16 seconds, determining that the flow is not in an active state in a network environment, finishing sampling the flow, at the moment, respectively reducing the number of data packets, the total length, the number of ACK packets, the number of SYN packets, the number of FIN packets, the number of RST packets and other statistical characteristic values of the sampled flow in a shared counting tree set by using the flow quintuple information and the flow arrival time, multiplying the total length of the flow by 32, and dividing the total length of the flow by the number of the data packets to obtain the average length of the flow; writing quintuple information, stream arrival time, stream duration, minimum effective load length, maximum effective load length, each statistical characteristic value and average length restored in the shared counting tree set in a stream record buffer area; the flow record is added to the flow record buffer queue. And if the number of the flow records in the flow record buffer queue exceeds m (wherein m is far smaller than the scale of the hash table), acquiring a mutual exclusion lock of the ordered flow record buffer queue, wherein the ordered flow record buffer queue carries out ordered arrangement on the flow records by using the source IP and the arrival time of the flow. And after all the flow record elements in the flow record buffer queue are dequeued and added into the ordered flow record buffer queue, releasing the mutual exclusion lock of the ordered flow record buffer queue. And when the scanning and restoring of all the flow nodes in the flow table are completed, releasing the mutual exclusion lock of the flow table, setting the state flag tag to be 1, finishing the flow characteristic restoring stage, and restarting sampling. An example of the stream feature record reduction process for tuple information tuple3 is shown in fig. 10.

The notion of host activity periods is used in MiniSamp to approximate the replacement of application sessions. A host active period is a set of flows for which the source IP is the same, the types of applications to which the flows in the set belong possibly being different, the flows in the set being ordered by arrival time, the arrival time of each flow in the set being less than t seconds apart from the arrival time of the flow immediately preceding it. And the output thread outputs the flow records in the ordered flow record buffer queue to the sampling flow record file by taking the active cycle of the host as a unit.

To the slaveThe process of restoring the statistical characteristic values of the sample streams in the shared count tree is introduced. The concept of introducing a subtree counter, in FIG. 6, as C [13 ]]The root subtree contains node C [13 ]]，C[10]，C[11]，C[4]，C[5]，C[6]，C[7]. All nodes in a sub-tree constitute a sub-tree counter whose value can be calculated, for example, as C13]Value (13) =2 for value of subtree counter of root ^2a C[13]+2 ^a (C[10]+C[11])+(c[4]+C[5]+C[6]+C[7]). Whether a certain node in the shared counting tree can be used as a root node of a sub-tree is related to the state bit of the node counter, and each strengthening counter corresponds to a sub-tree counter. In FIG. 11, counter L [4 ] is enhanced]For example, a method of selecting a root node in a subtree counter is described. In (a), because of C4]Counter does not overflow, so C4 is not counted]Set the status bit, then use C4]For the root node of its subtree, it is not necessary to judge whether it can use C4]The ancestor node of (1) is the root node, and there is only one leaf node C [4 ] in the subtree]The height of the subtree is L =1. In (b) C4]Counter overflow, C10]The counter does not overflow. So to C4]Set the status bit, then use C10]For the root node of its subtree, it is not necessary to judge whether it can use C10]The ancestor node of (1) is the root node, and there are only two leaf nodes C [4 ] in the subtree]，C[5]One non-leaf node C [10 ]]The height of the subtree is L =2. In (C) C4]Counter overflow, C10]The counter overflows. So to C4]，C[10]Set the status bit, then use C13]Is the root node of a subtree in which there are four leaf nodes C [4 ]]，C[5]Three non-leaf nodes C [10 ]]，C[11]，C[13]The height of the subtree is L =3.

The steps of reducing the statistical eigenvalues of the number of the stream data packets, the number of the ACK packets, the number of the SYN packets, the number of the FIN packets, the number of the RST packets, the total number of the data blocks and the like from the shared counting tree are as follows:

step 1: extracting a stream label f from a stream node ending a sample stream f _tuple 。

Step 2: h is a main hash function, and an element Si in the random number set S is used]I e [0,r) are XOR-ed with the flow label of flow f in sequence, and the XOR value is transmitted as a parameter to the main hash function H and is respectivelyCalculate r hash values

R enhancement counters and subtree counters are located which record the statistical characteristic values of the flow.

And step 3: respectively calculating the values X of r subtree counters _i ，i∈[0，r)。

And 4, step 4: respectively dividing the height l of r sub-tree counters _i Calculating and respectively calculating the number of leaf nodes in r subtree counters

And 5: and calculating the total count value n of the current shared count tree.

Step 6: the statistical characteristic value of the flow f is calculated by equation 6, wherein equation 6 is derived by statistical principles, the result of which is an estimate of the statistical characteristic value to be restored and it can be shown that this estimate is an unbiased estimate of this statistical characteristic value.

When sampling the end of the stream f, the step of MiniSamp restoring the total data length of the stream from the shared count tree is as follows:

step 1: the total data block number c of the stream f is reduced.

Step 2: the length of stream f, len, is calculated, where len ← 32 × c.

The storage space optimal sampling method based on the shared counting tree does not affect the accuracy rate of applying and identifying the sampling flow by the SVM classifier and the C4.5 classifier, and only a small amount of storage space is needed in the sampling process. Specifically, an SVM classifier is used for carrying out application recognition on the sampling flow, and the average precision of the application recognition in each application is 0.867; the application recognition is carried out on the sampling flow of the invention by using a C4.5 classifier, and the average precision of the application recognition is 0.891 in each recognized application. With 9GB flow as input, the invention only needs 700KB of storage space at most and only needs 200KB of storage space at least in the sampling process.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A storage space optimized sampling method based on a shared count tree is characterized by comprising the following steps:

p _p ＝p _e /p _h

p _f ＝p _e /p _h

Step 1.3: order to

Step 1.4: obtaining a current sampling ratio p _x ；

Step 1.5: if it is

step 1.6: if p _x -p _e If the | is less than or equal to alpha, returning to the step 1.3; otherwise, executing step 1.7;

step 1.7: if the current sampling ratio p _x Greater than a preconfigured finite sampling ratio p _e Then let p be _p ＝0.5*(p _p + t), t =0.00001, return to step 1.6; otherwise, let p _p ＝0.5*(p _p + 1), returning to step 1.6;

and 2, step: extracting data packets from the data packet buffer queue, and distributing two random numbers r with the value range of [0,1 ] for the data packets _f And r _p ；

And 3, step 3: acquiring a source IP of a data packet and calculating a hash value of the source IP;

step 6: searching a stream node to which the data packet belongs;

if the flow node and r of the data packet are found _p ≤p _p Then sampling the data packet and updating the flow characteristic storage unit of the flow;

if the flow node r corresponding to the data packet is not found _f ＞p _f Or find the stream node to which the data packet belongs and r _p ＞p _p Discarding the data packet;

2. The method for optimized sampling of storage space based on shared count tree as claimed in claim 1, wherein: the specific method for creating the flow characteristic storage unit for the data packet in the step 6 is as follows: analyzing the data packet, and writing quintuple information, stream arrival time, stream latest update time, minimum effective load length and maximum effective load length into a stream node; wherein, the latest updating time of the stream and the arrival time of the stream are both the current time; the minimum effective load length and the maximum effective load length are both the effective load length of the application layer of the data packet; if the data packet is a TCP data packet, analyzing a TCP header, and detecting whether flag bits ACK, FIN, SYN and RST are set; if the TCP data packet is set, counting the number of the incoming flag bit data packets of the stream in the corresponding shared counting tree; counting the flow in a shared count tree that stores the flow size; calculating the length len of the data packet, taking 32B as a data block, obtaining the number c of the data blocks occupied by the data packet,

3. The method for optimized sampling of storage space based on shared count tree as claimed in claim 1 or 2, wherein: the specific steps of updating the stream feature storage unit of the stream in the step 6 are as follows:

step 6.1: updating the stream latest update time in the stream node;

step 6.4: the statistical feature values are updated in the shared count tree.

4. The method for optimized sampling of storage space based on shared count tree as claimed in claim 3, wherein: the specific steps of updating the statistical characteristic value in the shared count tree in the step 6.4 are as follows:

Step 6.4.2: generating a random number i, i ∈ [0,r); where r is a hash function h _i The number of (2);

step 6.4.3: selecting S [ i ] from a set of random numbers S]And flow label f _tuple Performing XOR, and transmitting the XOR value as a parameter to a main hash function H to generate a hash function H _i (ii) a Computing hash values

if the CU does not overflow, the step is finished, and the updating of the statistical characteristic value is finished;

step 6.4.4 is re-executed.

5. The method of claim 4, wherein the method comprises: the specific step of restoring the stream characteristics from the shared count tree in the step 7 is

Step 7.2: the elements S [ i ] in the random number set S]And carrying out XOR with the flow label of the flow f in sequence, transmitting the XOR values serving as parameters to a main hash function H, and respectively calculating r hash values

step 7.3: calculating a statistical characteristic value s of the flow;

wherein X _i Is the value of the subtree counter; l. the _i Is the height of the subtree counter; n is the total count value of the current shared count tree; k is a radical of formula _i Is the number of leaf nodes in the sub-tree counter,